Skip to content

Latest commit

 

History

History
756 lines (613 loc) · 34 KB

cEP-0034.md

File metadata and controls

756 lines (613 loc) · 34 KB

Improve Diff Handling

Metadata
cEP 0034
Version 0.1
Title Improve Diff Handling
Authors Utkarsh Sharma mailto:[email protected]
Status Proposed
Type Feature

Abstract

This cEP describes the implementation of binary diffs and xml diffs which will allow coala to generate nested diffs for XML files and binary diffs for handling binary formats and demonstrates their use on a bear which would use the new diff functionality.

The cEP also discusses approaches for improving existing diff handling by adding context for diffs and improving FileFactory.

Introduction

Currently coala supports unified diff handling which produces a smaller diff with old and new text presented immediately adjacent. It creates line by line diffs which are suited towards text files. The diffs only show the line of the code for which the diff is shown with no context as to which function the code belongs to. Showing some context would help the users in finding the code easily.

Git gives the function definition using git show in which it is easier to understand what the code does because git gives information regarding what function it resides in even though the function definition would normally not be in the diff. This project implements a similar behavior in our coala output.

Bears can handle binary files by setting USE_RAW_FILES to True. However these bears will have to be in charge of managing the file (opening the file, closing the file, reading the file, etc). There is not a binary diff handling system set in place whose functionality the bears could use to handle binary files automatically.

There are currently no bears for handling binary formats. This project would enable coala to handle binary files and add a bear which uses the new binary diffing system. We will use the multidiff library for generating the binary diffs.

For generating nested diffs for XML files we will use the xmldiff library.

Currently coala only supports utf-8 encoded files. This project would add support for handling non utf-8 text files (utf16/32, latin1 etc.).

Fileproxy

coala doesn't have support for handling files with encodings other than UTF-8 (UTF-16/UTF-32/latin-1). The File Class responsible for decoding files only decodes files with encoding UTF-8 which raises a UnicodeDecodeError when trying to decode files with different encodings.

This can be fixed by using the detect_encoding method from coala_utils.

The string method in File class needs to be changed:

string

When detect_encoding is run on utf-32 files it returns encoding as utf-16 instead of utf-32. This is because byte order markers (for Little Endian) for utf-16 is b'\xff\xfe' and for utf-32 is b'\xff\xfe\x00\x00' and for utf-32 encoded files detect_encoding reads the first 4 chars which are b'\xff\xfe\x00\x00'.

running raw.startswith() would give a match for both UTF-16 and UTF-32 BOMS and the code would exit after matching UTF-16. The same case is for Big Endian arrangement.

Hence detect_encoding incorrectly recognizes utf-32 files as utf-16

This can be fixed by the patch shown below

def detect_encoding(filename, default='utf-8'):
    with open(filename, 'rb') as f:
        raw = f.read(4)  # will read less if the file is smaller

    for enc, boms in [
        ('utf-8-sig', (codecs.BOM_UTF8,)),
        ('utf-16', (codecs.BOM_UTF16_LE, codecs.BOM_UTF16_BE)),
        ('utf-32', (codecs.BOM_UTF32_LE, codecs.BOM_UTF32_BE))
    ]:
        if any(raw.startswith(bom) for bom in boms):
            if enc == 'utf-16' and raw in (codecs.BOM_UTF32_LE,
                                           codecs.BOM_UTF32_BE):
                return 'utf-32'
            return enc

    return default

This will be followed by writing cli tests for coala and adding support for non utf-8 encodings to coala_utils and write tests.

Problems found

SpellCheckBear uses an external linter program scspell which likely only supports utf-8 and it doesn't return a patch when coala gives it a utf-16 file.

Some other bears might also give the same error which could be problematic.

To fix this a possible approach could be that the bears give the encoding they can handle and coala needs to determine whether lossless encoding is possible and create a temporary file in the linter's accepted encoding (converting utf-16 to utf-8 in this case).

Context for diffs

This is a sample output by using PEP8 Bear on a python file

sample_output

The diffs only show the affected line of the code with no context as to which function the code belongs to.

There are 2 types of context which can be added - extra lines before and after the line, and info about parents in the syntax tree (i.e class and function name)

For the first type of context, we determine the context by checking the indentation and printing all lines that have a lower indentation than the last printed one

The following approach would be used to achieve this:

  • Get the line number and filename of the diff from sourcerange.
  • Get to the line number of diff.
  • Loop backward and compare the indentation of nth line with (n-1)th line, if equal, do nothing, if lesser, then print that line with its line number. Break out of the loop when there is a line that matches r\n\S.*

The function get_context and print_before_context to get the context and print the context respectively would be written in ConsoleInteraction.py and are shown below

def get_context(file_dict,
                sourcerange):
    """
    Gets context for the affected lines and also the line numbers
    of the context
    :param file_dict:       A dictionary containing all files as values with
                            filenames as key.
    :param sourcerange:     The SourceRange object referring to the related
                            lines to print.
    :return:                Affected file's contextual information.
    """

    related_lines = list()
    related_lines_number = list()

    affected_line = file_dict[sourcerange.file][
                    sourcerange.start.line-1].rstrip().replace('\t', 8*' ')

    indent_space_count = len(affected_line)-len(affected_line.lstrip())
    if indent_space_count > 0:
        for line in range(sourcerange.start.line-2, -1, -1):  # pragma: no cover
            rline = file_dict[sourcerange.file][
                    line].rstrip().replace('\t', 8*' ')
            previous_space_count = len(rline) - len(rline.lstrip())
            if previous_space_count < indent_space_count and rline:
                related_lines.append(rline)
                related_lines_number.append(line+1)
                indent_space_count = previous_space_count
            if indent_space_count == 0:
                break
    else:
        pre_context = list(range(sourcerange.start.line-2, -1, -1))
        pre_context = pre_context[0: min(2, len(pre_context))]
        for line in pre_context:
            related_lines.append(file_dict[sourcerange.file][line])
            related_lines_number.append(line+1)

    related_lines_number.reverse()
    related_lines.reverse()

    return related_lines, related_lines_number

def print_before_context(console_printer,
                         file_dict,
                         sourcerange):  # pragma: no cover
    """
    Prints the context for the affected lines.
    :param console_printer: Object to print messages on the console.
    :param file_dict:       A dictionary containing all files as values with
                            filenames as key.
    :param sourcerange:     The SourceRange object referring to the related
                            lines to print.
    """

    related_lines, related_lines_number = get_context(console_printer,
                                                      file_dict,
                                                      sourcerange)
    no_color = not console_printer.print_colored
    line_number = 0
    for line in related_lines_number:
        # Print affected file's line number in the sidebar.
        console_printer.print(format_lines(lines='', line_nr=line, symbol='['),
                              color=FILE_LINES_COLOR,
                              end='')

        line = related_lines[line_number].rstrip('\n')
        line_number += 1
        try:
            lexer = get_lexer_for_filename(sourcerange.file)
        except ClassNotFound:
            lexer = TextLexer()
        lexer.add_filter(VisibleWhitespaceFilter(
            spaces=True, tabs=True,
            tabsize=SpacingHelper.DEFAULT_TAB_WIDTH))
        # highlight() combines lexer and formatter to output a ``str``
        # object.
        printed_chars = 0
        if line == sourcerange.start.line and sourcerange.start.column:
            console_printer.print(highlight_text(
                no_color, line[:sourcerange.start.column - 1],
                BackgroundMessageStyle, lexer), end='')

            printed_chars = sourcerange.start.column - 1

        if line == sourcerange.end.line and sourcerange.end.column:
            console_printer.print(highlight_text(
                no_color, line[printed_chars:sourcerange.end.column - 1],
                BackgroundSourceRangeStyle, lexer), end='')

            console_printer.print(highlight_text(
               no_color, line[sourcerange.end.column - 1:],
               BackgroundSourceRangeStyle, lexer), end='')
            console_printer.print('')
        else:
            console_printer.print(highlight_text(
                no_color, line[printed_chars:], BackgroundMessageStyle, lexer),
                                  end='')
            console_printer.print('')

The output will print the lines before the code. It will look like this:

first_context

After implementing this we can improve the diffs to show parents as headers and print the extra lines of context before and after the affected line. A data structure can be added which stores the parent keywords.

The new type of context when run on a sample file will give the following diff output:

context_1 context_2 context_3 context_4 context_5 context_6 context_7

Binary Diffs

A binary file is defined by the absence of an end-of-line marker. For a binary diff, there are no lines, hence we can’t use our existing unified diff system for diffing binary files.

To show a binary diff, the UI needs to show byte-ranges of changes, using some textual encoding of the bytes (e.g. hex) with some extra bytes each side to help the user see where those bytes are. This can then be made more user-friendly by describing the change to the user with text and then do the binary change if accepted.

I will be testing the new binary diff system on binary formats like JPEG and PNG Files.

The following process would be followed to generate binary diffs:

  • Open the original file and the fixed file.
  • Iterate through both files from the start to the end - this gives us an address (offset) in the file (byte 0, byte 1, byte 2, etc)
  • compare the bytes at the current position, and if they are not the same, print the address and the two different bytes

To generate a binary diff, classes like SourceRange, etc are all problematic. They all assume there is a end-of-line separator

Currently coala has the option to run raw files. Processing allows raw files if the Bear states that it uses raw files. If this is enabled, bears are in charge of doing the file handling (opening it, closing it, reading it, etc). Bears that use binary files can't be mixed with Bears that use text files.

Bears which handle binary files are required to set USE_RAW_FILES=TRUE - coalib has code for this, but we have no bears which use this yet

Currently There are already some PRs for binary bears -

Below is the hexdump of the 2 sample binary files - bin_file1 and bin_file2

bin_file1

bin_file2

On running these 2 files with a binary diffing library multidiff we get the following output:

original_binary_diff

The library can be trained to get the following output:

trained_binary_diff

The binary diff shows:

  • The address/offset where the byte changes have occured
  • The bytes preceding and succeeding the changed byte range (context)
  • The operation performed - (<delete>, <insert>, <replace>)

The ascii row (3rd column in the original diff output) was removed for better looking diffs but can be added again if we need a different design.

ImageCompressionBear is a bear which uses the optimage library to check if an image can be compressed and calculates the number of bytes which would be reduced if it is compressed.

A sample mockup for the bear is given below -

class ImageCompressionBear(LocalBear):
    """
    Checks for possible optimizations for JPEGs and PNGs
    See https://github.com/sk-/optimage
    """
    LANGUAGES = {'Image'}
    REQUIREMENTS = {
        PipRequirement('optimage', '0.0.1'),
        AnyOneOfRequirements([
            DistributionRequirement(apt_get='jpegoptim',
                                    portage='jpegoptim',
                                    xbps='jpegoptim'),
            ExecutableRequirement('jpegoptim')
        ]),
        AnyOneOfRequirements([
            DistributionRequirement(apt_get='libjpeg-progs',
                                    portage='libjpeg-turbo',
                                    xbps='libjpeg-turbo-tools'),
            ExecutableRequirement('jpegtran')
        ]),
        AnyOneOfRequirements([
            DistributionRequirement(apt_get='pngcrush',
                                    portage='pngcrush',
                                    xbps='pngcrush'),
            ExecutableRequirement('pngcrush')
        ]),
        AnyOneOfRequirements([
            DistributionRequirement(apt_get='optipng',
                                    portage='pngcrush',
                                    xbps='pngcrush'),
            ExecutableRequirement('optipng')
        ]),
        AnyOneOfRequirements([
            DistributionRequirement(apt_get='zopfli',
                                    portage='zopfli',
                                    xbps='zopfli'),
            ExecutableRequirement('zopflipng')
        ])
    }
    AUTHORS = {'The coala developers'}
    AUTHORS_EMAILS = {'The coala developers'}
    LICENSE = 'AGPL-3.0'
    CAN_DETECT = {'Compression', 'Bloat'}
    USE_RAW_FILES = True

    def run(self, filename, file):
        """
        Check for how much the image file size can be optimized
        :param image_files: The image files that this bear will use
        """

        _, extension = os.path.splitext(filename)
        extension = extension.lower()
        compressor = optimage._EXTENSION_MAPPING.get(extension)

        if compressor is None:
            raise '{} extension is unsupported'.format(extension)

        output_filename = create_tempfile(suffix=extension)

        compressor(filename, output_filename)

        original_size = os.path.getsize(filename)
        new_size = os.path.getsize(output_filename)
        reduction = original_size - new_size
        reduction_percentage = reduction * 100 / original_size
        savings = 'savings: {} bytes = {:.2f}%'.format(
            reduction, reduction_percentage)

        if new_size < original_size:
            diff = Diff(filename)
            diff.binary_diff(filename, output_filename)
            yield Result.from_values(origin=self,
                                     message=('This Image can be '
                                              'losslessly compressed '
                                              'to {} bytes ({})'
                                              .format(new_size,
                                                      savings)),
                                     file=filename)

        os.remove(output_filename)

For using the bear with the new binary diff handling system a possible approach is listed below:

  • Run the bear with the image file which you want compressed
  • The bear will check if the image can be compressed, If it can, the bear will generate a binary diff with the changes to the byte range.
  • The user can then choose from the actions provided.

Modifying ImageCompressionBear

  • First we would need to create a new function binary_diff() in Diff.py which will generate the binary diff and store it in a class variable self.binary and call it in ImageCompressionBear.

  • travis runs Ubuntu 16.04 which does not install zopflipng on running apt-get zopfli (zopflipng gets installed in Ubuntu 18.04). So for now we can't run tests for PNG files in travis.

We will make use of the multidiff library to generate the binary diff. The binary diff would be stored as a string which would be iterated over to get the desired binary diff output in ShowPatchAction.

Changes in coala core

For bool(Diff[]) to return True (needed for ShowPatchAction to print the binary diffs), we can set a rename attribute for the Diff Class or we can also modify the Diff.modified() method to achieve the same goal.

We also need to add an additional parameter additional_info to SourceRange and Result class and set it to binary when we are printing binary diffs.

There are lots of changes in binary code when we modify a binary file. So it would not be feasible to print the affected lines. Hence we only call the print_lines() method in ConsoleInteraction.py when additional_info is not set to binary.

Then we need to create a new function print_binary_diff in ShowPatchAction.py which will print the binary diff.

The following files need to be changed:

Diff.py

  def binary_diff(self, filename, output_filename):
      """
      Generates a binary diff corresponding to this patch.

      The diff will display the byte addresses where the bytes are modified.
      """

      self.binary = main([filename, output_filename, '--diff',
                          '--html'])
      self.output = list(output_filename)

  @property
  def modified(self):
      """
      Calculates the modified file, after applying the Diff to the original.

      This property also adds linebreaks at the end of each line.
      If no newline was present at the end of file before, this state will
      be preserved, except if the last line is deleted.
      """
      if self.binary:
          return self._generate_linebreaks(self.output)
      return self._generate_linebreaks(self._raw_modified())  

ShowPatchAction.py

def print_binary_diff(difflines, filename, to_filename, printer):
    printer.print(format_line(filename, real_nr='------'), color='red')
    printer.print(format_line(to_filename, real_nr='++++++'), color='freen')

    addresses = re.findall(r'[\w]{6}:', difflines)
    bytes = re.split(r'[\w]{6}:', difflines)

    for index in range(len(addresses)):
        printer.print(format_line(bytes[index+1].lstrip(),
                                  real_nr=addresses[index][:-1]),
                      color='green')

The ouput tested with ImageCompressionBear is given below:

Executing section cli...
    [INFO][09:39:40] _jpegtran: best compressor for "/home/temp/gradient_bloat_and_metadata.jpg"

    gradient_bloat_and_metadata.jpg
    **** ImageCompressionBear [Section: cli | Severity: NORMAL] ****
    !    ! This Image can be losslessly compressed to 946 bytes (savings: 3529 bytes = 78.86%)
    [------] /home/temp/gradient_bloat_and_metadata.jpg
    [++++++] /home/temp/gradient_bloat_and_metadata.jpg
    [000010] 00 48 00 00 ff <span class='replace'>db 00</span> 43 <span class='replace'>00 01
    01 01 01 01 01 01</span>
    [000020] <span class='replace'>01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01</span>
    [000030] <span class='replace'>01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01</span>
    [000040] <span class='replace'>01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01</span>
    [000050] <span class='replace'>01 01 01 01 01 01 01 01 01 ff db 00</span> 43 <span
    class='replace'>01 01 01</span>
    [000060] <span class='replace'>01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01</span>
    [000070] <span class='replace'>01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01</span>
    [000080] <span class='replace'>01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01</span>
    [000090] <span class='replace'>01 01 01 01 01 01 01 01 01 01 01 01 01 01 ff c0</span>
    [0000c0] 00 00 00 00 00 00<span class='delete'>05 06 08 ff c4 00 18 01 01 01 01 01 01 00
    00 00 00 00 00 00 00 00 00 00 03</span> 06 07 <span class='insert'>09 ff c4 00 28
    10 00 01</span>
    [0000d0] 04 <span class='replace'>01 02 05</span> 03<span class='delete'>01 00 02 10 03 10
    00 00 01 df 3a ff 00 a6 94 38 f5 4a 3d 52 8e 96 ea d8 54 28 f5 4a 39 43 8f 96 e9
    d8 55 38 e5 0e 3d 52 8f 96 e9 d8 54 28 e5 4e 3d 42 8f ff c4 00 15 10 01 01 00 00
    00 00 00 00 00 00 00 00 00 00 00 00 02 00 ff da 00 08 01 01 00 01</span> 05 <span
    class='replace'>00 00 00 00 00 00 00 00 01</span> 04<span class='delete'>00 21 31
    51 01 a1 ff da 00 08 01 03 01 01 3f 01 6b 76 e3 5d b7 1c ed b8 e7 6d f6 39 db 71
    ae db f2 39 db 71 ce db 8e 76 dc 73 b6 fb 1a dd b8 d7 6d c7 37 6e 39 db 71 ce db
    fb 1a ed b9 ff c4 00 15 11 01 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 00
    ff da 00 08 01 02 01 01 3f 01 59 65 96 59 65 96 59 65 96 59 65 96 ff c4 00 14 10
    01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 40 ff da 00 08 01 01 00 06 3f 02
    07 ff c4 00 19 10 00 03 01 01 01 00 00 00 00 00 00 00 00 00 00 00 00 31 61 10 20
    21 ff da 00 08 01 01 00 01 3f 21 c8 59 0e 01 47 9a 08 c2 24 05 13 3f ff da 00 0c
    03 01 00 02 00 03 00 00 00 10 b1 30 96 44 31 8a ff c4 00 23 11 00 02 02 01 04 01
    05 01 00 00 00 00 00 00 00 00 01</span>
    [0000e0] 11 21 31 41 00 51 <span class='replace'>22</span> 71 81 91 a1 <span
    class='insert'>12 61 c1 02 32</span>
    [0000f0] b1 d1 f0 <span class='replace'>ff c4 00 18 01 01 01 01 01 01 00 00 00</span>
    [000100] <span class='replace'>00 00 00 00 00 00 00 00 04 07</span> 08 <span
    class='replace'>05 09 ff c4 00</span>
    [000110] <span class='replace'>2a 11 00 01 02 05</span> 03<span class='delete'>01 01 3f 10
    c7 ab cd 77 36 77 f5 e2 ac 8c df 87 9f e7 67 44 2b cb c6 f9 40 0a f2 30 b5 65 cc
    81 94 d7 0e 03 09 5b f2 34 8a e8 e5 24 1f 72 71 42 af 5b da 0d 9b 26 cc 95 4b f3
    d2 07 7c bb b0 ea b7 ee 5e 64 ad b1 6e 16 25 3c c7 d3 24 00 e7 2d d9 7b 21 97 e5
    ad 72 6c 50 89 b9 53 0e 8a c4 73 ef fe 9b 67 f2 02 10 22 5b 6e cc 1b 2e b3 0e 78
    73 ae 5c 49 61 4b c0 65 6f 02 f3 a2 33 bc a4 76 07 73 f1 56 22 f0 0c 26 7e d7 65
    8e ec 3c bb cf 99 56 cf be 33 9c ff 00 ff c4 00 24 11 00 01</span> 03 04 02 <span
    class='insert'>03 00 00 00 00 00</span>
    [000120] <span class='insert'>00 00 01 11 31</span> 02 <span class='replace'>05</span>
    21<span class='delete'>31</span> 41 51 <span class='insert'>00</span> 61 71 <span
    class='replace'>22 32</span> 81
    [000130] <span class='insert'>12</span> 91 <span class='insert'>a1</span> f0 <span
    class='insert'>b1 d1 c1 e1 f1 ff da 00 0c 03 01 00</span>
    [000140] <span class='insert'>02 11 03 11 00 3f 00 df 34 88 e8 30 24 80 e6 3f</span>
    [000150] <span class='insert'>0c 0b 31 8b b6 ce b5 f4 51 f3 b8 f0 8f 9c 91 70</span>
    [000160] <span class='insert'>75 e9 ac ee 77 de 91 b2 dd b9 ab 9c fb ec</span> a1 <span
    class='replace'>1a</span>
    [000170] <span class='replace'>31 10 ed f2 7c 03 97 36 e5 da 09 e2 89 1b 6a 5a</span>
    [000180] <span class='replace'>e3 f8 3b e5 6a 75 1e 9d cf 3b c7 a9 4d 5f c2 df</span>
    [000190] <span class='replace'>fc e4 e9 4a 44 93 f4 c7 86 8c 6e e6 05 1c 02 34</span>
    [0001a0] <span class='replace'>68 e2 de b4 46 71 9e 2d 4a 59 a9 1f 9e 4f 09 f5</span>
    [0001b0] <span class='replace'>8f 56 7c 66 e8 00 6f 22 c9 a5 48</span> d1 <span
    class='replace'>d7 0f 2c cc</span>
    [0001c0] <span class='replace'>e2 6b 6c 6c 72 58 e3 5b 91 72 58 55 38 fc 6a 3d</span>
    [0001d0] <span class='replace'>3c 9e 77 9f 59 35 88 1a a2 a6 cb 40 a1 11 d7 c8</span>
    [0001e0] <span class='replace'>d4 b5 1a 31</span> 10<span class='delete'>db 91 d2
    4d</span> ee <span class='replace'>cf ed 5b 0c bf 68 d8 6b a7 14</span>
    [0001f0] <span class='replace'>68 d5</span> 45 2a <span class='replace'>70</span> a5 <span
    class='replace'>6f 62 e1 91 4e 4e b6 0c f2 78</span>
    [000200] <span class='replace'>9e be b0 c6 e8 88 17 9a 9b 30 67 d2 84 68 ea 0b</span>
    [000210] <span class='replace'>80 37 ab 9c 8d f0 ec 19 ef 46 8e 27</span> 42 87 <span
    class='insert'>da e8</span>
    [000220] <span class='insert'>83 ce 38 ca 47 e7 73 ce fe b8 4d 0b 97 25 cd 4a</span>
    [000230] <span class='insert'>32 7d 5d 2a 46 92 44 49 6e 42 25 f9 18 fd ce 8b</span>
    [000240] <span class='insert'>14 4f 61 cb d5 12 99 c2 79 a8 58 ec f2 78 82 3e</span>
    [000250] <span class='insert'>bc d5 5d c2 b3 67 9a ad d4 a3 46 0f a7 85 8b c0</span>
    [000260] <span class='insert'>67 20 78 38 c1 0f d8 e8 f1 46 f8 ca fe 6c 9f 72</span>
    [000270] <span class='insert'>35 1f 9e 4f 2b 1f 5a 2a 87 5a 25 aa 8b 7a 7e 96</span>
    [000280] <span class='insert'>58 8d 1c 08 38 ba 88 eb e3 61 ae 94 71 3e 49 3b</span>
    [000290] <span class='insert'>27 df b4 d6 c2 9e 4f 0f 50 11 8b dd 55 ca e1 05</span>
    [0002a0] <span class='insert'>d7 ca a6 95 23 47 51 ef 8b 15 d0 64 6d 19 0c 68</span>
    [0002b0] <span class='insert'>e3 52 77 f0 fc e7 f4 97 58 f4 f2 79 df d6 aa 22</span>
    [0002c0] <span class='insert'>62 2a f5 4a d1 58 a5 a9 ba 84 68 c7 0c 4c 63 f8</span>
    [0002d0] <span class='insert'>0f 89 ce 8f 14 6f bb 92 bb 22 65 70 95 26 f7 8f</span>
    [0002e0] <span class='insert'>4e e7 7d fd 79 f6</span> c5 <span class='insert'>fe a0 14 40
    55 23 46 62 2d</span>
    [0002f0] <span class='insert'>da 0e 58 9d f3 47 a1 db 46 8e 30 18 ad e9 ef c9</span>
    [000300] <span class='insert'>6e 07 82 44 82 77 3c 09 1f 5b 2b 9a 17 2a d7 a2</span>
    [000310] <span class='insert'>d7 65 ae a5 a8 d2 33 1f 4b 60 b7 67 f7 da cf 5d</span>
    [000320] <span class='insert'>74 63 8d 54 28 ad 4f df 92 d8 e7 60 cf 27 7d fd</span>
    [000330] <span class='insert'>6a 2a 54 25 56 c1 4a 66 81 ef a5 28 d1 d7 0b 92</span>
    [000340] <span class='insert'>04 35 67 98 32 d0 fd 72 68 a2 70 b4 17 3f d5 08</span>
    [000350] <span class='insert'>e6 89 bb c7 67 93 c2 b1 9f 5d 6a 88 70 0e 4f e1</span>
    [000360] <span class='insert'>9c</span> 51 <span class='replace'>d4 a4 47 f6 b8 ae db 06
    e8 df 2f 0e 68 e3</span>
    [000370] <span class='replace'>c5 7e 05</span> 76 <span class='replace'>b5 68 15 58 15 15
    3a 90 4f 27</span> 9d<span class='delete'>b8 59 6d 75 fb a2 7b 71</span> c9
    [000380] <span class='insert'>19 08 a7 bb e5 dc a8 e5 c2 dd 42 44 95 c2 09 86</span>
    [000390] <span class='insert'>8a 3c f2 ef 9d ea 9c d1 44 ef 73 c0 a9 47 a8 5a</span>
    [0003a0] <span class='insert'>78</span> ae <span class='replace'>d1 e9 dc f1 7d 7d 79 bd
    51</span> d4 <span class='replace'>fc da f7 bf</span>
    [0003b0] <span class='replace'>ff</span> d9<span class='delete'>8d 0e 89 0b a8 50 ee 7d f9
    31 6d c8 15 85 4f f1 88 db 26 66 db fc 07 c5 1d 82 ac 02 a1 cf 24 9a 9d 07 29 2e
    15 ba 7c 7f ff c4 00 23 10 00 01 04 02 01 05 00 03 00 00 00 00 00 00 00 00 01 11
    21 31 41 00 51 81 61 71 91 a1 c1 b1 d1 f0 ff da 00 08 01 01 00 01 3f 10 88 20 24
    80 a5 bf 08 0a 21 69 94 bc 03 32 a7 d3 e8 1b 53 2a 55 18 be cf 48 d5 b5 2c 20 d0
    23 20 f0 b7 7a 78 d5 68 d8 19 95 51 7a 46 85 af 86 d0 c8 18 a8 03 71 2f 63 74 a8
    11 67 1d 0c e5 3b 06 75 ec 5b f6 f8 03 41 56 08 a4 0f 46 a8 85 f0 71 80 c6 a6 19
    b9 f5 a1 90 37 5a 91 1c 0b 1a 6b 08 0a 3b 57 f0 1f 1e f0 8c d2 a8 c6 d0 9d dc 1e
    0e b1 14 2c a2 9e 17 ae a4 f3 90 48 90 19 22 fb 82 e8 cb cd dc 11 e3 41 38 4f aa
    cb 1b 09 64 68 3d ed 56 f7 10 bf ff d9</span>
    [    ] *0. Do (N)othing
    [    ]  1. (O)pen file
    [    ]  2. (A)pply patch
    [    ]  3. Print (M)ore info
    [    ]  4. Add (I)gnore comment
    [    ]  5. Show Applied (P)atches
    [    ]  6. (G)enerate patches
    [    ] Enter number (Ctrl-D to exit):

XML Diff

This approach describes an XML patch framework that utilizes the xmldiff library to generate XML Diffs.

The library follows the following paper to generate XML Diffs - Change Detection in Hierarchically Structured Information

The xmldiff library has functionality for creating diffs for XML documents and also applying patches to them.

The following XPath data model node types can be added, replaced, or removed with this framework: elements, attributes, namespaces, comments, texts, and processing instructions.

A sample mockup for the xmldiff is given below:

patching-process

The library also has an xml outformat which gives a nested xml diff which we would make use of in generating our diffs.

outformat_xml

Note

Currently we have one bear for xml files XmlBear which is a linter bear which uses xmllint to lint xml files. The xml diffs we generate are for additions/deletions/editing in nodes and don't detect changes in formatting but actual changes in data.

Approach

For getting xml diffs we would need to create a new method process_output_xml_diff()in linter.py. Which would call a Diff method from_xml_diff() which would generate the xml diffs. Then we need to create a new method print_xml_diff()in ShowPatchAction to print the xml diffs.

Fortunately unlike binary diffs we won't have problems with SourceRange and File Handling so we don't need to modify them like we did in binary diffs.

Some things need to be taken into account while coming up with the design for XML Diffs:

  • The xml diffs print an xml string which would generally be more than the DIFF_EXCERPT_MAX_SIZE so we need to decide if we just need to show the stats (number of changes) to the user and print the xml diff only when the user selects the ShowAppliedPatches action.

  • We can still use the Diff operations for adding/deleting/replacing a line so that we don't need additional methods for xml diffs. We would print the whole xml diff which would include all the changes at once, so we need to think about how to show the context for the affected lines or whether to remove the context altogether.

  • xml files could be 200-300 lines long (or more!). As a result the xml diff printed would be very long and we need to decide whether we just want to print the list of patch operations in that case.

The following files need to be changed to add support for XML Diffs:

Diff.py

def from_xml_diff(self, new_file, original_file):

      original_content = "".join(original_file)
      if isinstance(new_file, str):
          new_file = "".join(new_file.splitlines(keepends=True))
      new_content = "".join(new_file)

      formatter = formatting.XMLFormatter(normalize=formatting.WS_BOTH,
                                          pretty_print=True)

      self.xml = main.diff_texts(original_content,
                                 new_content,
                                 formatter=formatter)

ShowPatchAction.py

def print_xml_diff(difflines, diff, printer):
    current_line_added = None
    current_line_subtracted = None
    for line in difflines:
        if line.startswith('@@'):
            values = line[line.find('-'):line.rfind(' ')]
            subtracted, added = tuple(values.split(' '))
            current_line_added = int(added.split(',')[0][1:])
            current_line_subtracted = int(subtracted.split(',')[0][1:])
        elif line.startswith('---'):
            print_from_name(printer, line[4:])
        elif line.startswith('+++'):
            print_to_name(printer, line[4:])
    print(diff.xml.rstrip())

Versions of XML Diffs we can print

The xmldiff library has 2 possible diff outputs

  • The diff output as xml
<data xmlns:diff="http://namespaces.shoobx.com/diff">
  <country name="Liechtenstein">
    <rank>1</rank>
    <year>2008</year>
    <gdppc>141100</gdppc>
    <neighbor name="ert4773" direction="E" diff:update-attr="name:happy"/>
    <neighbor name="Switzerland" direction="W"/>
  </country>
</data>
  • The diff output as a list of patch operations
[update-attribute, /data/country/neighbor[1], name, "ert4773"]