idnits 2.17.1 draft-cheney-simplediff-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == It seems as if not all pages are separated by form feeds - found 0 form feeds but 23 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (October 2017) is 2385 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Missing reference section? '1' on line 970 looks like a reference -- Missing reference section? '0' on line 970 looks like a reference -- Missing reference section? '2' on line 812 looks like a reference -- Missing reference section? '4' on line 813 looks like a reference -- Missing reference section? '3' on line 811 looks like a reference Summary: 1 error (**), 0 flaws (~~), 2 warnings (==), 6 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 INTERNET-DRAFT 2 Independent Submission A. Cheney 3 Category: Informational prettydiff.com 4 October 2017 6 Simple Diff Algorithm 7 draft-cheney-simplediff-00.txt 9 Abstract 11 This document describes a simplified algorithm for performing 12 computational comparisons. The design goal is to achieve an approach 13 that uses the smallest amount of logic. Necessary secondary goals 14 include an approach that is language agnostic, fast, extensible, and 15 more predictable than other similar algorithms. 17 Copyright Notice 19 Copyright (c) 2017 IETF Trust and the persons identified as the 20 document authors. All rights reserved. 22 This document is subject to BCP 78 and the IETF Trust's Legal 23 Provisions Relating to IETF Documents 24 (http://trustee.ietf.org/license-info) in effect on the date of 25 publication of this document. Please review these documents 26 carefully, as they describe your rights and restrictions with respect 27 to this document. 29 This Internet-Draft is submitted in full conformance with the 30 provisions of BCP 78 and BCP 79. Internet-Drafts are working 31 documents of the Internet Engineering Task Force (IETF). Note that 32 other groups may also distribute working documents as 33 Internet-Drafts. The list of current Internet-Drafts is at 34 http://datatracker.ietf.org/drafts/current. 36 Internet-Drafts are draft documents valid for a maximum of six months 37 and may be updated, replaced, or obsoleted by other documents at any 38 time. It is inappropriate to use Internet-Drafts as reference 39 material or to cite them other than as "work in progress." 41 Comments 43 Comments are solicited and should be addressed to 44 info@prettydiff.com. 46 RFC XXXX Simple Diff Algorithm October 2017 48 Table of Contents 50 1. Introduction ....................................................3 51 2. Understanding how to compare ....................................3 52 2.1. It is more about the equality than the differences .........3 53 2.2. Algorithm quality ..........................................4 54 2.3. Speed ......................................................4 55 2.4. Output format ..............................................6 56 3. How the algorithm works at a high level .........................7 57 3.1. Getting by with a hash map and some counts .................7 58 3.2. The third and final pass through the data ..................8 59 3.3. The "fix" function ........................................11 60 3.4. Done ......................................................16 61 4. Extensibility ..................................................23 62 5. Security considerations ........................................23 63 6. IANA considerations ............................................23 64 7. References .....................................................23 65 8. Author's address ...............................................23 67 RFC XXXX Simple Diff Algorithm October 2017 69 1. Introduction 71 There are many comparison algorithms available, which are commonly 72 referred to as "diff" algorithms after the similarly named Unix 73 application. While there are many algorithms already available none 74 achieve speed, precision, and simplicity in an easily predictable 75 manner. 77 The most commonly fault of many diff algorithms is that they solve 78 for the incorrect problem. Instead of accomplishing the goal, to 79 find how two samples differ, they instead will try to solve for a 80 common computer science problem called "Longest Common Subsequence 81 (LCS)". The LCS problem seeks to find the longest blocks of equality 82 in samples of comparison, which sounds ideal. Unfortunately, the 83 execution of this approach is not predictable as it is bound to the 84 commonality of the compared samples. Furthermore, it isn't simple or 85 fast. Worse still is that this approach is a precision failure in 86 situations where differences are interspersed across frequent 87 repetition. 89 2. Understanding how to compare 91 2.1. It is more about the equality than the differences 93 A good diff algorithm will attempt to identify as much equality as 94 possible. Everything else qualifies as differences. The metric for 95 quality and precision is a smaller count of captured differences. 96 The smaller this number the better, presuming differences aren't 97 escaping undetected. 99 False negatives, which is allowing differences through without 100 detection, is really bad. This is absolute failure in a diff 101 algorithm. False positives, which is identifying more differences 102 than there actually are is also bad, but a false positive is much 103 better than a false negative. This means it is safer to report more 104 differences than are actually present, which is why a higher number 105 of reported differences is less precise. Maximum precision is 106 reporting differences without any false negatives or false positives. 108 One way to achieve higher precision is to first capture as much 109 equality as possible. For everything else that is left over 110 prioritize unique items first and report these items as differences. 111 Finally determine the next bit of equality or uniqueness and 112 everything in between is either a change, an insertion, or a 113 deletion. 115 RFC XXXX Simple Diff Algorithm October 2017 117 2.2. Algorithm quality 119 The primary priorities when writing this kind of code are execution 120 speed, precision (as previous described), and code simplicity. In 121 most cases precision is the most important factor in code design, 122 but in some cases speed is more important when processing large input 123 or a batch of thousands of files. Simplicity is necessary so that 124 other people can understand the code and modify it as necessary to 125 improve upon the design and additional features. No algorithm is 126 ever capable of meeting all needs for all use cases, so it is 127 important that other people can understand the code with minimal 128 effort. 130 2.3. Speed 132 Faster execution is the result of a couple of things. The most 133 important consideration for making an algorithm faster is to reduce 134 the number of passes through data where possible. After that 135 eliminate all costly operations. In most programming languages 136 simple arithmetic and static string comparisons are cheap to execute 137 particularly if the examined strings aren't changing. The 138 theoretical minimum number of data passes is two as you have to read 139 the contents of each submitted sample. This proposed algorithm 140 achieves speed by only taking three complete passes through data and 141 making all possible effort to never repeat a step or loop iteration. 143 This approach is linear and predictable where the number of 144 interations passing through data is computed from the sum of: 146 * number of iterations from the first sample 147 * number of iterations from the second sample 148 * the number of iterations from the smallest of those samples. 150 Performance of the mentioned approach is heavily based upon key 151 access in a hash map, which will likely vary by programming language. 153 Until a way to perform a comparison without reading from the samples 154 is discovered 2 data passes can be safely assumed as the fastest 155 possible approach. Between that and the Simple Diff approach there 156 are two possibilities for increased performance. The first 157 possibility is to make decisions and write output immediately upon 158 reading from the second sample so as to have only two data passes. 159 The challenge with this approach is that analysis occurs in the 160 moment of the current loop interation without knowledge of what comes 161 later in the second sample. A more practical second performance 162 possibility is to write a smaller hash map. Writing a smaller hash 163 map means moving some of the decision tree up front before a separate 164 formal analysis phase. In order for this approach to be viable this 165 early step in logic must be tiny, non-replicating, and a different 166 means of iteration must be devised to account for a data store of 167 unpredictable size. 169 RFC XXXX Simple Diff Algorithm October 2017 171 The proposed algorithm could be faster than the popular Myers' O(ND) 172 approach. That may or may not be true and no evidence was provided 173 either way (conjecture is not evidence). In terms of experimental 174 criteria algorithms are themselves largely irrelevant. More 175 important is the implementation of those algorithms. If exercising 176 the Myers' approach makes fewer decisions and has fewer total data 177 passes, as described in the previous paragraph, then it likely is 178 faster. Pronounced emphasis is applied to the word "total" because 179 this makes all the difference. 181 Many prior diff applications don't provide a complete third data 182 pass. Instead they provide the minimum two complete data passes over 183 the samples and various smaller passes as calculated by block moves 184 and edit distances. If these various fractional passes are non- 185 linear, which means starting and stopping in different places 186 respectively, their performance is less predictable. If these 187 fractional passes are non-linear and touch any given data index more 188 than once they have repetitive logic, and likely are not as fast. To 189 affirmatively guarantee superior performance over this Simple Diff 190 approach there needs to be fewer passes over data, which means no 191 repetition and a smaller number of iterations. Predictability 192 ensures the performance of an application scales roughly 193 proportionately to the size of the provided samples, where an 194 unpredictable approach would scale disproportionately (slower over 195 time). 197 The approach taken here is fast. It cannot be argued, 198 scientifically, it is the fastest ever (or slowest) approach for its 199 level of accuracy without also writing alternate algorithms into 200 applications with identical application constraints. One could argue 201 this approach is the fastest ever comparative algorithm for its level 202 of predictability and precision. 204 RFC XXXX Simple Diff Algorithm October 2017 206 2.4. Output format 208 This Simple Diff algorithm inherits its output format from the 209 application jsdifflib, https://github.com/cemerick/jsdifflib. The 210 output is a big array containing child arrays at each index. The 211 child arrays each contain 5 indexes in the format: 213 * type 214 * start in first sample 215 * end in first sample 216 * start in second sample 217 * end in second sample 219 There are 4 types defined in the output: 221 * "equal" - both array index items are identical at the indexes 222 compared 223 * "replace" - changes are present at indexes compared 224 * "insert" - the index of the second sample is unique to the 225 second sample 226 * "delete" - the index of the first sample is unique to the first 227 sample 229 RFC XXXX Simple Diff Algorithm October 2017 231 3. How the algorithm works at a high level 233 3.1. Getting by with a hash map and some counts 235 This algorithm takes after the Paul Heckel algorithm. First thing is 236 to transform the string input into arrays. The most primative way to 237 do this is to split the input by lines where each line of input is an 238 index in an array. Once the two input samples are converted to 239 arrays they will each have to be examined. 241 The first loop will walk through an array representing one of the 242 code samples and make a decision for a hash map which is named 243 "table" in the following code sample. The two arrays are simply 244 named "one" and "two". Each index of the array, by default is a line 245 of code, which will serve as a key in the hash map named "table". If 246 this key does not exist then it will be added. The value for each 247 key is an object with two properites named "one" and "two". These 248 properties are simply a count indicating the number of times the key 249 is encountered in the two arrays. If the key is already present then 250 we will simply increase the value of properties "one" or "two" 251 respective of the current array. 253 Here is an example of this step in code: 255 do { 256 if (table[two[b]] === undefined) { 257 table[two[b]] = { 258 one: 0, 259 two: 1 260 }; 261 } else { 262 table[two[b]].two += 1; 263 } 264 b += 1; 265 } while (b < lenb); 267 RFC XXXX Simple Diff Algorithm October 2017 269 3.2. The third and final pass through the data 271 Once the table is fully populated with all indexes from both arrays, 272 "one" and "two", we need a third and final loop to walk across the 273 data and make some simple decisions. Here is this complete step in 274 code. 276 do { 277 c = a; 278 d = b; 279 if (one[a] === two[b]) { 280 equality(); 281 } else if (table[one[a]][1] < 1 && table[two[b]][0] < 1) { 282 replaceUniques(); 283 } else if (table[one[a]][1] < 1 && one[a + 1] !== two[b + 2]) { 284 deletion(); 285 } else if (table[two[b]][0] < 1 && one[a + 2] !== two[b + 1]) { 286 insertion(); 287 } else if ( 288 table[one[a]][0] - table[one[a]][1] === 1 && 289 one[a + 1] !== two[b + 2] 290 ) { 291 deletionStatic(); 292 } else if ( 293 table[two[b]][1] - table[two[b]][0] === 1 && 294 one[a + 2] !== two[b + 1] 295 ) { 296 insertionStatic(); 297 } else if (one[a + 1] === two[b]) { 298 deletion(); 299 } else if (one[a] === two[b + 1]) { 300 insertion(); 301 } else { 302 replacement(); 303 } 304 a = a + 1; 305 b = b + 1; 306 } while (a < lena && b < lenb); 308 Before moving any further please understand this logic is fast and 309 simple, but it isn't precise at all. We will fix this later with a 310 child function named "fix". Corrections for precision are a 311 secondary step so as to not disrupt performance and also isolate 312 complexity into a separate single location. 314 In the above code references "a" and "b" are simply positive integer 315 incrementors. The "a" reference is always associated with the "one" 316 array while the "b" reference is always associated with the "two" 317 array. This is necessary so that each array may increment 318 independently without collision. 320 RFC XXXX Simple Diff Algorithm October 2017 322 The "c" and "d" references are secondary incrementors used as 323 closures in the child functions. These secondary incrementors allow 324 for child loops to occur without mutation to the primary 325 incrementors. 327 The first thing that happens in this loop is an attempt to discover 328 equality. Everything else is less important and so happens later in 329 the decision tree. 331 The second rule is to identify items that occur only one more time in 332 each array. If the next item to occur in the arrays is not equal but 333 occurs just once more then we know the item is a unique change. 335 The third rule is to determine if the current item is a deletion. If 336 the current item, and possibly additional following items, 337 is present in the first sample but no longer exists in the second 338 sample then it is a dynamic deletion. 340 The fourth rule is the same as the third rule but in reverse to 341 determine dynamic insertions. 343 The fifth rule is to determine if the current index occurs exactly 344 one more time in the first sample than in the second sample and yet 345 is not a match for the next item in the first sample with two items 346 up in the second sample. This is a static deletion, or rather a 347 deletion of a fixed number of items. 349 The sixth rule is the same as the fifth rule but in reverse to 350 determine static insertions. 352 The seventh and final rule determines that the current items in the 353 samples are non-unique changes. 355 The child functions are as follows: 357 equality = function simpleDiff_equality() { 358 do { 359 table[one[c]].one -= 1; 360 table[one[c]].two -= 1; 361 c += 1; 362 d += 1; 363 } while ( 364 c < lena && 365 d < lenb && 366 one[c] === two[d] 367 ); 368 fix(["equal", a, c, b, d]); 369 b = d - 1; 370 a = c - 1; 371 }, 373 RFC XXXX Simple Diff Algorithm October 2017 375 deletion = function simpleDiff_deletion() { 376 do { 377 table[one[c]].one -= 1; 378 c += 1; 379 } while ( 380 c < lena && 381 table[one[c]].two < 1 382 ); 383 fix(["delete", a, c, -1, -1]); 384 a = c - 1; 385 b = d - 1; 386 }, 387 deletionStatic = function simpleDiff_deletionStatic() { 388 table[one[a]].one -= 1; 389 fix([ 390 "delete", a, a + 1, 391 -1, 392 -1 393 ]); 394 a = c; 395 b = d - 1; 396 }, 397 insertion = function simpleDiff_insertion() { 398 do { 399 table[two[d]].two -= 1; 400 d += 1; 401 } while ( 402 d < lenb && 403 table[two[d]].one < 1 404 ); 405 fix(["insert", -1, -1, b, d]); 406 a = c - 1; 407 b = d - 1; 408 }, 409 insertionStatic = function simpleDiff_insertionStatic() { 410 table[two[b]].two -= 1; 411 fix([ 412 "insert", -1, -1, b, b + 1 413 ]); 414 a = c - 1; 415 b = d; 416 }, 418 RFC XXXX Simple Diff Algorithm October 2017 420 replacement = function simpleDiff_replacement() { 421 do { 422 table[one[c]].one -= 1; 423 table[two[d]].two -= 1; 424 c += 1; 425 d += 1; 426 } while ( 427 c < lena && 428 d < lenb && 429 table[one[c]].two > 0 && 430 table[two[d]].one > 0 431 ); 432 fix(["replace", a, c, b, d]); 433 a = c - 1; 434 b = d - 1; 435 }, 436 replaceUniques = function simpleDiff_replaceUniques() { 437 do { 438 table[one[c]].one -= 1; 439 c += 1; 440 d += 1; 441 } while ( 442 c < lena && 443 d < lenb && 444 table[one[c]].two < 1 && 445 table[two[d]].one < 1 446 ); 447 fix(["replace", a, c, b, d]); 448 a = c - 1; 449 b = d - 1; 450 } 452 It must be noted that most of the these child functions do contain 453 their own loops, but these loops are not duplicates. The primary 454 parent loop, the so called third and final loop, is adjusted forward 455 by the distance these smaller loops increment so that there is no 456 repetition or lose of execution speed. 458 3.3. The "fix" function 460 We can see from the child function code that arrays are generated 461 which appear to be the format described as the child arrays of the 462 output. Instead of pushing this data to the output array directly 463 it will be passed through a function named "fix" which serves as a 464 sort of quality control filter. This function checks for things 465 like: 467 RFC XXXX Simple Diff Algorithm October 2017 469 * A diff type immediately following the same diff type, which can be 470 combined into a single child array for the output 471 * An insertion immediately following a deletion, or vise versa, which 472 should likely be a replacement 473 * Whether false positives generated from replacements immediately 474 following an insertion or deletion can be eliminated 476 The general idea is to only accept a child array of the output as an 477 argument and make a decision after comparing it against the previous 478 child array of the output. Aside from two narrow edge cases there is 479 no analysis of the original data. Think of it like an appeals court 480 in that no new evidence is allowed, because only the laws are under 481 scrutiny. 483 The code for the fix function is as follows: 485 fix = function simpleDiff_fix(code) { 486 var prior = codes[codes.length - 1]; 487 if (prior !== undefined) { 488 if (prior[0] === code[0]) { 489 if (code[0] === "replace" || code[0] === "equal") { 490 prior[2] = code[2]; 491 prior[4] = code[4]; 492 } else if (code[0] === "delete") { 493 prior[2] = code[2]; 494 } else if (code[0] === "insert") { 495 prior[4] = code[4]; 496 } 497 return; 498 } 499 if (prior[0] === "insert" && prior[4] - prior[3] === 1) { 500 if (code[2] - code[1] === 1) { 501 if (code[0] === "replace") { 502 prior[0] = "replace"; 503 prior[1] = code[1]; 504 prior[2] = code[2]; 505 code[0] = "insert"; 506 code[1] = -1; 507 code[2] = -1; 508 } else if (code[0] === "delete") { 509 code[0] = "replace"; 510 code[3] = prior[3]; 511 code[4] = prior[4]; 512 codes.pop(); 513 prior = codes[codes.length - 1]; 514 if (prior[0] === "replace") { 515 prior[2] = code[2]; 516 prior[4] = code[4]; 517 return; 518 } 519 } 521 RFC XXXX Simple Diff Algorithm October 2017 523 } else if (code[0] === "delete") { 524 prior[0] = "replace"; 525 prior[1] = code[1]; 526 prior[2] = code[1] + 1; 527 code[1] = code[1] + 1; 528 } else if (code[0] === "replace") { 529 prior[0] = "replace"; 530 prior[1] = code[1]; 531 prior[2] = code[1] + 1; 532 c = prior[2]; 533 d = prior[4]; 534 return; 535 } 536 } else if ( 537 prior[0] === "insert" && 538 code[0] === "delete" && 539 code[2] - code[1] === 1 540 ) { 541 prior[4] = prior[4] - 1; 542 code[0] = "replace"; 543 code[3] = prior[4]; 544 code[4] = prior[4] + 1; 545 } else if ( 546 prior[0] === "delete" && 547 prior[2] - prior[1] === 1 548 ) { 549 if (code[4] - code[3] === 1) { 550 if (code[0] === "replace") { 551 prior[0] = "replace"; 552 prior[3] = code[3]; 553 prior[4] = code[4]; 554 code[0] = "delete"; 555 code[3] = -1; 556 code[4] = -1; 557 } else if (code[0] === "insert") { 558 code[0] = "replace"; 559 code[1] = prior[1]; 560 code[2] = prior[2]; 561 codes.pop(); 562 prior = codes[codes.length - 1]; 563 if (prior[0] === "replace") { 564 prior[2] = code[2]; 565 prior[4] = code[4]; 566 return; 567 } 568 } 569 } else if (code[0] === "insert") { 570 prior[0] = "replace"; 571 prior[3] = code[3]; 572 prior[4] = code[3] + 1; 573 code[3] = code[3] + 1; 575 RFC XXXX Simple Diff Algorithm October 2017 577 } else if (code[0] === "replace") { 578 prior[0] = "replace"; 579 prior[3] = code[3]; 580 prior[4] = code[4] + 1; 581 c = prior[2]; 582 d = prior[4]; 583 return; 584 } 585 } else if ( 586 prior[0] === "delete" && 587 code[0] === "insert" && 588 code[4] - code[3] === 1 589 ) { 590 prior[2] = prior[2] - 1; 591 code[0] = "replace"; 592 code[1] = prior[2]; 593 code[2] = prior[2] + 1; 594 } else if (prior[0] === "replace") { 595 if (code[0] === "delete") { 596 if (one[code[2] - 1] === two[prior[4] - 1]) { 597 if (prior[2] - prior[1] > 1) { 598 prior[4] = prior[4] - 1; 599 } 600 c = c - 1; 601 d = d - 1; 602 return; 603 } 604 if (one[code[2]] === two[prior[4] - 1]) { 605 if (prior[2] - prior[1] > 1) { 606 prior[2] = prior[2] - 1; 607 prior[4] = prior[4] - 11; 608 table[one[c - 1]][0] = 609 table[one[c - 1]][0] - 1; 610 } 611 } 612 } else if (code[0] === "insert") { 613 if (one[prior[2] - 1] === two[code[4] - 1]) { 614 if (prior[2] - prior[1] > 1) { 615 prior[2] = prior[2] - 1; 616 } 617 c = c - 1; 618 d = d - 1; 619 return; 620 } 621 if (one[code[2] - 1] === two[prior[4]]) { 622 if (prior[4] - prior[3] > 1) { 623 prior[2] = prior[2] - 1; 624 prior[4] = prior[4] - 1; 625 table[two[d - 1]][1] = 626 table[two[d - 1]][1] - 1; 627 } 629 RFC XXXX Simple Diff Algorithm October 2017 631 } 632 } 633 } 634 } 635 codes.push(code); 636 }; 638 RFC XXXX Simple Diff Algorithm October 2017 640 3.4. Done! 642 The following is the complete algorithm with all parts assembled: 644 function simpleDiff () { 645 var table = {}, 646 lines = function simpleDiff_lines(str) { 647 str = str 648 .replace(/\r\n/g, "\n") 649 .replace(/\r/g, "\n"); 650 return str.split("\n"); 651 }, 652 one = (typeof options.source === "string") 653 ? lines(options.source) 654 : options.source, 655 two = (typeof options.diff === "string") 656 ? lines(options.diff) 657 : options.diff, 658 lena = one.length, 659 lenb = two.length, 660 a = 0, 661 b = 0, 662 c = 0, 663 d = 0, 664 codes = [], 665 fix = function simpleDiff_fix(code) { 666 var prior = codes[codes.length - 1]; 667 if (prior !== undefined) { 668 if (prior[0] === code[0]) { 669 if ( 670 code[0] === "replace" || 671 code[0] === "equal" 672 ) { 673 prior[2] = code[2]; 674 prior[4] = code[4]; 675 } else if (code[0] === "delete") { 676 prior[2] = code[2]; 677 } else if (code[0] === "insert") { 678 prior[4] = code[4]; 679 } 680 return; 681 } 682 if ( 683 prior[0] === "insert" && 684 prior[4] - prior[3] === 1 685 ) { 686 if (code[2] - code[1] === 1) { 687 if (code[0] === "replace") { 688 prior[0] = "replace"; 689 prior[1] = code[1]; 690 prior[2] = code[2]; 692 RFC XXXX Simple Diff Algorithm October 2017 694 code[0] = "insert"; 695 code[1] = -1; 696 code[2] = -1; 697 } else if (code[0] === "delete") { 698 code[0] = "replace"; 699 code[3] = prior[3]; 700 code[4] = prior[4]; 701 codes.pop(); 702 prior = codes[codes.length - 1]; 703 if (prior[0] === "replace") { 704 prior[2] = code[2]; 705 prior[4] = code[4]; 706 return; 707 } 708 } 709 } else if (code[0] === "delete") { 710 prior[0] = "replace"; 711 prior[1] = code[1]; 712 prior[2] = code[1] + 1; 713 code[1] = code[1] + 1; 714 } else if (code[0] === "replace") { 715 prior[0] = "replace"; 716 prior[1] = code[1]; 717 prior[2] = code[1] + 1; 718 c = prior[2]; 719 d = prior[4]; 720 return; 721 } 722 } else if ( 723 prior[0] === "insert" && 724 code[0] === "delete" && 725 code[2] - code[1] === 1 726 ) { 727 prior[4] = prior[4] - 1; 728 code[0] = "replace"; 729 code[3] = prior[4]; 730 code[4] = prior[4] + 1; 731 } else if ( 732 prior[0] === "delete" && 733 prior[2] - prior[1] === 1 734 ) { 735 if (code[4] - code[3] === 1) { 736 if (code[0] === "replace") { 737 prior[0] = "replace"; 738 prior[3] = code[3]; 739 prior[4] = code[4]; 740 code[0] = "delete"; 741 code[3] = -1; 742 code[4] = -1; 743 } else if (code[0] === "insert") { 744 code[0] = "replace"; 746 RFC XXXX Simple Diff Algorithm October 2017 748 code[1] = prior[1]; 749 code[2] = prior[2]; 750 codes.pop(); 751 prior = codes[codes.length - 1]; 752 if (prior[0] === "replace") { 753 prior[2] = code[2]; 754 prior[4] = code[4]; 755 return; 756 } 757 } 758 } else if (code[0] === "insert") { 759 prior[0] = "replace"; 760 prior[3] = code[3]; 761 prior[4] = code[3] + 1; 762 code[3] = code[3] + 1; 763 } else if (code[0] === "replace") { 764 prior[0] = "replace"; 765 prior[3] = code[3]; 766 prior[4] = code[4] + 1; 767 c = prior[2]; 768 d = prior[4]; 769 return; 770 } 771 } else if ( 772 prior[0] === "delete" && 773 code[0] === "insert" && 774 code[4] - code[3] === 1 775 ) { 776 prior[2] = prior[2] - 1; 777 code[0] = "replace"; 778 code[1] = prior[2]; 779 code[2] = prior[2] + 1; 780 } else if (prior[0] === "replace") { 781 if (code[0] === "delete") { 782 if (one[code[2] - 1] === two[prior[4] - 1]) { 783 if (prior[2] - prior[1] > 1) { 784 prior[4] = prior[4] - 1; 785 } 786 c = c - 1; 787 d = d - 1; 788 return; 789 } 790 if (one[code[2]] === two[prior[4] - 1]) { 791 if (prior[2] - prior[1] > 1) { 792 prior[2] = prior[2] - 1; 793 prior[4] = prior[4] - 11; 794 table[one[c - 1]][0] = 795 table[one[c - 1]][0] - 1; 796 } 797 } 798 } else if (code[0] === "insert") { 800 RFC XXXX Simple Diff Algorithm October 2017 802 if (one[prior[2] - 1] === two[code[4] - 1]) { 803 if (prior[2] - prior[1] > 1) { 804 prior[2] = prior[2] - 1; 805 } 806 c = c - 1; 807 d = d - 1; 808 return; 809 } 810 if (one[code[2] - 1] === two[prior[4]]) { 811 if (prior[4] - prior[3] > 1) { 812 prior[2] = prior[2] - 1; 813 prior[4] = prior[4] - 1; 814 table[two[d - 1]][1] = 815 table[two[d - 1]][1] - 1; 816 } 817 } 818 } 819 } 820 } 821 codes.push(code); 822 }, 823 equality = function simpleDiff_equality() { 824 do { 825 table[one[c]][0] = table[one[c]][0] - 1; 826 table[one[c]][1] = table[one[c]][1] - 1; 827 c = c + 1; 828 d = d + 1; 829 } while (c < lena && d < lenb && one[c] === two[d]); 830 fix(["equal", a, c, b, d]); 831 b = d - 1; 832 a = c - 1; 833 }, 834 deletion = function simpleDiff_deletion() { 835 do { 836 table[one[c]][0] = table[one[c]][0] - 1; 837 c = c + 1; 838 } while (c < lena && table[one[c]][1] < 1); 839 fix(["delete", a, c, -1, -1]); 840 a = c - 1; 841 b = d - 1; 842 }, 843 deletionStatic = function simpleDiff_deletionStatic() { 844 table[one[a]][0] = table[one[a]][0] - 1; 845 fix([ 846 "delete", a, a + 1, 847 -1, 848 -1 849 ]); 850 a = c; 851 b = d - 1; 852 }, 854 RFC XXXX Simple Diff Algorithm October 2017 856 insertion = function simpleDiff_insertion() { 857 do { 858 table[two[d]][1] = table[two[d]][1] - 1; 859 d = d + 1; 860 } while (d < lenb && table[two[d]][0] < 1); 861 fix(["insert", -1, -1, b, d]); 862 a = c - 1; 863 b = d - 1; 864 }, 865 insertionStatic = function simpleDiff_insertionStatic() { 866 table[two[b]][1] = table[two[b]][1] - 1; 867 fix([ 868 "insert", -1, -1, b, b + 1 869 ]); 870 a = c - 1; 871 b = d; 872 }, 873 replacement = function simpleDiff_replacement() { 874 do { 875 table[one[c]][0] = table[one[c]][0] - 1; 876 table[two[d]][1] = table[two[d]][1] - 1; 877 c = c + 1; 878 d = d + 1; 879 } while ( 880 c < lena && 881 d < lenb && 882 table[one[c]][1] > 0 && 883 table[two[d]][0] > 0 884 ); 885 fix(["replace", a, c, b, d]); 886 a = c - 1; 887 b = d - 1; 888 }, 889 replaceUniques = function simpleDiff_replaceUniques() { 890 do { 891 table[one[c]][0] = table[one[c]][0] - 1; 892 c = c + 1; 893 d = d + 1; 894 } while ( 895 c < lena && 896 d < lenb && 897 table[one[c]][1] < 1 && 898 table[two[d]][0] < 1 899 ); 900 fix(["replace", a, c, b, d]); 901 a = c - 1; 902 b = d - 1; 903 }; 905 // * First Pass, account for lines from first file 906 // * build the table from the second file 908 RFC XXXX Simple Diff Algorithm October 2017 910 do { 911 if (options.diffspaceignore === true) { 912 two[b] = two[b].replace(/\s+/g, ""); 913 } 914 if (table[two[b]] === undefined) { 915 table[two[b]] = [0, 1]; 916 } else { 917 table[two[b]][1] = table[two[b]][1] + 1; 918 } 919 b = b + 1; 920 } while (b < lenb); 922 // * Second Pass, account for lines from second file 923 // * build the table from the first file 924 lena = one.length; 925 a = 0; 926 do { 927 if (options.diffspaceignore === true) { 928 one[a] = one[a].replace(/\s+/g, ""); 929 } 930 if (table[one[a]] === undefined) { 931 table[one[a]] = [1, 0]; 932 } else { 933 table[one[a]][0] = table[one[a]][0] + 1; 934 } 935 a = a + 1; 936 } while (a < lena); 937 a = 0; 938 b = 0; 940 // find all equality... differences are what's left over solve 941 // only for the second set test removing reverse test removing 942 // undefined checks for table refs 944 do { 945 c = a; 946 d = b; 947 if (one[a] === two[b]) { 948 equality(); 949 } else if (table[one[a]][1] < 1 && table[two[b]][0] < 1) { 950 replaceUniques(); 951 } else if ( 952 table[one[a]][1] < 1 && 953 one[a + 1] !== two[b + 2] 954 ) { 955 deletion(); 956 } else if ( 957 table[two[b]][0] < 1 && 958 one[a + 2] !== two[b + 1] 959 ) { 960 insertion(); 962 RFC XXXX Simple Diff Algorithm October 2017 964 } else if ( 965 table[one[a]][0] - table[one[a]][1] === 1 && 966 one[a + 1] !== two[b + 2] 967 ) { 968 deletionStatic(); 969 } else if ( 970 table[two[b]][1] - table[two[b]][0] === 1 && 971 one[a + 2] !== two[b + 1] 972 ) { 973 insertionStatic(); 974 } else if (one[a + 1] === two[b]) { 975 deletion(); 976 } else if (one[a] === two[b + 1]) { 977 insertion(); 978 } else { 979 replacement(); 980 } 981 a = a + 1; 982 b = b + 1; 983 } while (a < lena && b < lenb); 984 if (lena - a === lenb - b) { 985 if (one[a] === two[b]) { 986 fix(["equal", a, lena, b, lenb]); 987 } else { 988 fix(["replace", a, lena, b, lenb]); 989 } 990 } else if (a < lena) { 991 fix(["delete", a, lena, -1, -1]); 992 } else if (b < lenb) { 993 fix(["insert", -1, -1, b, lenb]); 994 } 995 return codes; 996 } 998 RFC XXXX Simple Diff Algorithm October 2017 1000 4. Extensibility 1002 Since the code is simple and its execution is predictable the 1003 approach is open to modification. In different contexts a most 1004 precise comparison of text is not the most desirable objective. One 1005 example is to ignore certain particular differences that might be 1006 considered as unimportant. This degree of flexibility and 1007 extensibility allows for custom precision. 1009 Custom flexibility allows for a variety of advanced concerns. Such 1010 concerns may include nuances in accounting for language specific 1011 criteria whether programming, spoken, or written. Due to flexible 1012 rules, whether colloquialisms or syntax variations, differences of 1013 literal text may commonly reside that are not represented as 1014 differences in the parsed, or interpreted, result. In artificially 1015 intelligent systems it is necessary to account for these variations 1016 as comparisons occur. 1018 5. Security considerations 1020 None. 1022 6. IANA considerations 1024 None. 1026 7. References 1028 Heckel, P. (April 1978) "A technique for isolating differences 1029 between files". Communications of the ACM. 21 (4): 264-268. 1030 doi:10.1145/359460.359467 1032 Myers, E. (1986) "An O(ND) Difference Algorithm and Its Variations". 1033 Retrieved from http://xmailserver.org/diff2.pdf on 3 October 2017. 1035 8. Author's address 1037 Austin Cheney 1038 http://prettydiff.com 1039 info@prettydiff.com 1041 EXPIRATION 6 April 2017