Difference between revisions of "Notes:Diff algorithm"
m |
m (Saving work) |
||
Line 1: | Line 1: | ||
− | <!--{| class="wikitable" border="1" | + | __TOC__ |
+ | ==Overview== | ||
+ | My solution to comparing "things" is to form the problem as a classic "shortest path" combinatorial optimisation problem, the problem is formulated as follows: | ||
+ | * Given a finite string formed of symbols from an infinite or finite alphabet, {{M|A}} and | ||
+ | * another finite string of the same or different length, from the same alphabet, {{M|B}} | ||
+ | We wish to construct a finite sequence of instructions, {{M|1=I:=i_1,i_2,i_3,\ldots,i_n}} such that: | ||
+ | * Given {{M|A}} and {{M|I}} we can construct {{M|B}}. | ||
+ | To do this we will start with a marker initially at the first symbol of string {{M|A}} and instructions and an empty output string which we construct symbol by symbol. The instructions may be: | ||
+ | # {{C|>{{M|s}}}} - '''Insert''' - append {{M|s}} to the output string, do not move the marker forward | ||
+ | # {{C|<}} - '''Delete''' - deletes from the input string {{M|A}} by incrementing the marker and not writing any output | ||
+ | # {{C|1==}} - '''Accept''' - copy the symbol at the marked position of {{M|A}} to the output and move the marker forward by {{M|1}} | ||
+ | :(Assuming here that {{C|>}}, {{C|<}} and {{C|1==}} are not symbols that occur in the alphabet) | ||
+ | We will use the shorthands {{C|1={{M|n}}=}} for {{M|n}} consecutive accepts, so {{C|1=5=}} means {{C|1======}} and {{C|{{M|n}}<}} accordingly. | ||
+ | ===Example=== | ||
+ | Suppose we want to turn the erroneous phrase: | ||
+ | * {{C|Alex's Algoritm}} into the correct phrase: | ||
+ | * {{C|Alec's Algorithm}} | ||
+ | a "sensible" "diff" might be: | ||
+ | * {{C|1=3=>c<10=>h=}} | ||
+ | I use the word sensible because the diff: | ||
+ | * {{C|1=15<>A>l>e>c>'>s> >A>l>g>o>r>i>t>h>m}} is a valid diff! It just deletes everything and then enters the result letter by letter so isn't a very "good" diff. | ||
+ | ====Applying the diff==== | ||
+ | The input string is {{C|Alex's Algoritm}} and the diff is {{C|1=3=>c<10=>h=}}: | ||
+ | # {{C|'''A'''lex's Algoritm}} - {{C|}} - Initially the marker is for the first character of input {{M|A}} and the output is empty. | ||
+ | # {{C|Ale'''x''''s Algoritm}} - {{C|Ale}} - Instruction: {{C|1=3=}} - accept 3 characters from the input, moving the marker forward each time | ||
+ | # {{C|Ale'''x''''s Algoritm}} - {{C|Alec}} - Instruction: {{C|1=>c}} - insert the letter {{C|c}}, do not alter the marker | ||
+ | # {{C|Alex'''<nowiki>'</nowiki>'''s Algoritm}} - {{C|Alec}} - Instruction: {{C|1=>}} - delete, which means move the marker forward WITHOUT writing any output. | ||
+ | # {{C|Alex's Algorit'''m'''}} - {{C|Alec's Algorit}} - Instruction: {{C|1=10=}} - accept 10 times from the marker (so the {{C|'}} onwards 10 times) | ||
+ | # {{C|Alex's Algorit'''m'''}} - {{C|Alec's Algorith}} - Instruction: {{C|1=>h}} - insert {{C|h}} | ||
+ | # {{C|Alex's Algoritm}} - {{C|Alec's Algorithm}} - Instruction: {{C|1==}} - accept once (the marker is now in the "terminal" state) | ||
+ | |||
+ | Here I present a family of algorithms that can be used to find the "best" diff. | ||
+ | ==Edit/Diff graph example== | ||
+ | At any time where the marker isn't terminal we can delete or insert, we can never go backwards. We can also ''sometimes'' accept. We encode this information in an "edit graph" or "diff graph", in this graph we start at the top left and want to get to the bottom right, any path corresponds uniquely to a diff. Interpreted as follows: | ||
+ | # {{M|\rightarrow}} - delete. | ||
+ | # {{M|\downarrow}} - insert - insert the symbol for the row the head (the pointy side) of the arrow is at. | ||
+ | # {{M|\searrow}} - accept - we can only do this when the input string at the current position agrees with the output. | ||
+ | Weight the edges as follows: | ||
+ | # {{M|\rightarrow}} - {{M|1}} | ||
+ | # {{M|\downarrow}} - {{M|1}} - inserting and deleting have the same weight (so there is no preference) | ||
+ | # {{M|\searrow}} - {{M|0}} - accepts (going diagonally) are "free" | ||
+ | I shall show a diffgraph for the strings: | ||
+ | # {{C|abcabbac}} and | ||
+ | # {{C|abbab}} | ||
+ | Notice that I constructed this by taking {{C|abba}}, putting {{C|abc}} at the start, then putting a differing character at the end. As such we'd expect there to be a diff that inserts {{C|abc}}, accepts {{C|abba}}, deletes {{C|c}} and inserts {{C|b}}. | ||
+ | {| class="wikitable" border="1" | ||
|- | |- | ||
| <center><m> | | <center><m> | ||
Line 36: | Line 81: | ||
|- | |- | ||
! Example diffgraph between the strings {{C|abcabbac}} and {{C|abbab}} | ! Example diffgraph between the strings {{C|abcabbac}} and {{C|abbab}} | ||
− | |} | + | |} |
− | ==Example paths== | + | Notice any path produces a valid (although not necessarily good) diff, for example following every {{M|\rightarrow}} along the top and then following the {{M|\downarrow}} along the right produces a diff where we delete the input, then write the output letter by letter. Note we could also go down first, and then right, which corresponds to the diff that inserts the symbols of the second string then deletes all the symbols from the first. Let's mark some examples. |
+ | ===Example diff-paths=== | ||
{| class="wikitable" border="1" | {| class="wikitable" border="1" | ||
|- | |- | ||
Line 124: | Line 170: | ||
</m></center> | </m></center> | ||
|- | |- | ||
− | ! Example shortest paths | + | ! Example "shortest" paths |
|} | |} | ||
+ | Here: | ||
+ | * The <span style="color:#FF0000;">red</span> path is a "worst case", it has weight {{M|13}}, the red path deletes the first string then inserts each letter of the target string | ||
+ | ** Note that any path consisting only of inserts and deletes (only {{M|\rightarrow}} and {{M|\downarrow}} has weight {{M|13}} | ||
+ | * The <span style="color:#FF8800;">orange</span>, <span style="color:#00AA00;">green</span> and <span style="color:#0000FF;">blue</span> paths are all optimal, they all have weight {{M|5}}. Specifically: | ||
+ | ** <span style="color:#00AA00;">Green</span> - Objectively the best, it is "delete {{C|abc}} accept {{C|abba}} delete {{C|c}}, insert {{C|b}}" | ||
+ | ** <span style="color:#0000FF;">Blue</span> - Satisfactory, it is "accept {{C|ab}} delete {{C|c}} insert {{C|b}} accept {{C|ab}} delete {{C|bac}} |
Revision as of 00:14, 27 June 2016
Contents
Overview
My solution to comparing "things" is to form the problem as a classic "shortest path" combinatorial optimisation problem, the problem is formulated as follows:
- Given a finite string formed of symbols from an infinite or finite alphabet, [ilmath]A[/ilmath] and
- another finite string of the same or different length, from the same alphabet, [ilmath]B[/ilmath]
We wish to construct a finite sequence of instructions, [ilmath]I:=i_1,i_2,i_3,\ldots,i_n[/ilmath] such that:
- Given [ilmath]A[/ilmath] and [ilmath]I[/ilmath] we can construct [ilmath]B[/ilmath].
To do this we will start with a marker initially at the first symbol of string [ilmath]A[/ilmath] and instructions and an empty output string which we construct symbol by symbol. The instructions may be:
- >[ilmath]s[/ilmath] - Insert - append [ilmath]s[/ilmath] to the output string, do not move the marker forward
- < - Delete - deletes from the input string [ilmath]A[/ilmath] by incrementing the marker and not writing any output
- = - Accept - copy the symbol at the marked position of [ilmath]A[/ilmath] to the output and move the marker forward by [ilmath]1[/ilmath]
- (Assuming here that >, < and = are not symbols that occur in the alphabet)
We will use the shorthands [ilmath]n[/ilmath]= for [ilmath]n[/ilmath] consecutive accepts, so 5= means ===== and [ilmath]n[/ilmath]< accordingly.
Example
Suppose we want to turn the erroneous phrase:
- Alex's Algoritm into the correct phrase:
- Alec's Algorithm
a "sensible" "diff" might be:
- 3=>c<10=>h=
I use the word sensible because the diff:
- 15<>A>l>e>c>'>s> >A>l>g>o>r>i>t>h>m is a valid diff! It just deletes everything and then enters the result letter by letter so isn't a very "good" diff.
Applying the diff
The input string is Alex's Algoritm and the diff is 3=>c<10=>h=:
- Alex's Algoritm - - Initially the marker is for the first character of input [ilmath]A[/ilmath] and the output is empty.
- Alex's Algoritm - Ale - Instruction: 3= - accept 3 characters from the input, moving the marker forward each time
- Alex's Algoritm - Alec - Instruction: >c - insert the letter c, do not alter the marker
- Alex's Algoritm - Alec - Instruction: > - delete, which means move the marker forward WITHOUT writing any output.
- Alex's Algoritm - Alec's Algorit - Instruction: 10= - accept 10 times from the marker (so the ' onwards 10 times)
- Alex's Algoritm - Alec's Algorith - Instruction: >h - insert h
- Alex's Algoritm - Alec's Algorithm - Instruction: = - accept once (the marker is now in the "terminal" state)
Here I present a family of algorithms that can be used to find the "best" diff.
Edit/Diff graph example
At any time where the marker isn't terminal we can delete or insert, we can never go backwards. We can also sometimes accept. We encode this information in an "edit graph" or "diff graph", in this graph we start at the top left and want to get to the bottom right, any path corresponds uniquely to a diff. Interpreted as follows:
- [ilmath]\rightarrow[/ilmath] - delete.
- [ilmath]\downarrow[/ilmath] - insert - insert the symbol for the row the head (the pointy side) of the arrow is at.
- [ilmath]\searrow[/ilmath] - accept - we can only do this when the input string at the current position agrees with the output.
Weight the edges as follows:
- [ilmath]\rightarrow[/ilmath] - [ilmath]1[/ilmath]
- [ilmath]\downarrow[/ilmath] - [ilmath]1[/ilmath] - inserting and deleting have the same weight (so there is no preference)
- [ilmath]\searrow[/ilmath] - [ilmath]0[/ilmath] - accepts (going diagonally) are "free"
I shall show a diffgraph for the strings:
- abcabbac and
- abbab
Notice that I constructed this by taking abba, putting abc at the start, then putting a differing character at the end. As such we'd expect there to be a diff that inserts abc, accepts abba, deletes c and inserts b.
|
Example diffgraph between the strings abcabbac and abbab |
---|
Notice any path produces a valid (although not necessarily good) diff, for example following every [ilmath]\rightarrow[/ilmath] along the top and then following the [ilmath]\downarrow[/ilmath] along the right produces a diff where we delete the input, then write the output letter by letter. Note we could also go down first, and then right, which corresponds to the diff that inserts the symbols of the second string then deletes all the symbols from the first. Let's mark some examples.
Example diff-paths
|
Example "shortest" paths |
---|
Here:
- The red path is a "worst case", it has weight [ilmath]13[/ilmath], the red path deletes the first string then inserts each letter of the target string
- Note that any path consisting only of inserts and deletes (only [ilmath]\rightarrow[/ilmath] and [ilmath]\downarrow[/ilmath] has weight [ilmath]13[/ilmath]
- The orange, green and blue paths are all optimal, they all have weight [ilmath]5[/ilmath]. Specifically:
- Green - Objectively the best, it is "delete abc accept abba delete c, insert b"
- Blue - Satisfactory, it is "accept ab delete c insert b accept ab delete bac