Automatic merging in git without conflicts (using word-by-word diff instead of line-by-line)












1














I’d like to automatically merge commits where each commit changes a different word on the same line. The goal is to use git as a document storage and access it programmatically (thus, ideally without having to resolve conflicts). In my use case, I know for sure that the changes will not overlap (they don't affect the same words, although the lie on the same line).



git-diff can show me the diff between two commits not only per-line but also per word or per character. For example:



$ git diff --word-diff-regex=. HEAD HEAD~


If git-diff can identify the words (as opposed to the entire lines) that changed, I was convinced I could make git-merge detect conflicts on word-by-word (or character-by-character) basis. I was wrong. From what I understood (source), deep down, the git-diff tool operates on lines and the word or character diff functionality works already with these line-based results returned by git.



In this answer, it is suggested to make use of the clean and smudge filters in order to store each word on a separate line in the snapshot. However, that seems to me to be too hacky.



What approach would you choose?










share|improve this question





























    1














    I’d like to automatically merge commits where each commit changes a different word on the same line. The goal is to use git as a document storage and access it programmatically (thus, ideally without having to resolve conflicts). In my use case, I know for sure that the changes will not overlap (they don't affect the same words, although the lie on the same line).



    git-diff can show me the diff between two commits not only per-line but also per word or per character. For example:



    $ git diff --word-diff-regex=. HEAD HEAD~


    If git-diff can identify the words (as opposed to the entire lines) that changed, I was convinced I could make git-merge detect conflicts on word-by-word (or character-by-character) basis. I was wrong. From what I understood (source), deep down, the git-diff tool operates on lines and the word or character diff functionality works already with these line-based results returned by git.



    In this answer, it is suggested to make use of the clean and smudge filters in order to store each word on a separate line in the snapshot. However, that seems to me to be too hacky.



    What approach would you choose?










    share|improve this question



























      1












      1








      1







      I’d like to automatically merge commits where each commit changes a different word on the same line. The goal is to use git as a document storage and access it programmatically (thus, ideally without having to resolve conflicts). In my use case, I know for sure that the changes will not overlap (they don't affect the same words, although the lie on the same line).



      git-diff can show me the diff between two commits not only per-line but also per word or per character. For example:



      $ git diff --word-diff-regex=. HEAD HEAD~


      If git-diff can identify the words (as opposed to the entire lines) that changed, I was convinced I could make git-merge detect conflicts on word-by-word (or character-by-character) basis. I was wrong. From what I understood (source), deep down, the git-diff tool operates on lines and the word or character diff functionality works already with these line-based results returned by git.



      In this answer, it is suggested to make use of the clean and smudge filters in order to store each word on a separate line in the snapshot. However, that seems to me to be too hacky.



      What approach would you choose?










      share|improve this question















      I’d like to automatically merge commits where each commit changes a different word on the same line. The goal is to use git as a document storage and access it programmatically (thus, ideally without having to resolve conflicts). In my use case, I know for sure that the changes will not overlap (they don't affect the same words, although the lie on the same line).



      git-diff can show me the diff between two commits not only per-line but also per word or per character. For example:



      $ git diff --word-diff-regex=. HEAD HEAD~


      If git-diff can identify the words (as opposed to the entire lines) that changed, I was convinced I could make git-merge detect conflicts on word-by-word (or character-by-character) basis. I was wrong. From what I understood (source), deep down, the git-diff tool operates on lines and the word or character diff functionality works already with these line-based results returned by git.



      In this answer, it is suggested to make use of the clean and smudge filters in order to store each word on a separate line in the snapshot. However, that seems to me to be too hacky.



      What approach would you choose?







      git git-merge git-diff git-conflict-resolution






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 21 '18 at 22:58







      Adam Libuša

















      asked Nov 21 '18 at 22:40









      Adam LibušaAdam Libuša

      455516




      455516
























          1 Answer
          1






          active

          oldest

          votes


















          1














          What you would need to do, to make Git work the way you wish, is to modify the merge code. In theory this is not too difficult. In practice, I am not sure how difficult it would turn out to be.



          In that other answer, I mention xdelta. More precisely, Git uses modified versions of both xdelta and libxdiff. The Git source puts most of this code in a subdirectory. Up one level you'll find a few more bits of code that work with the library, such as xdiff-interface.c.



          If you modified these to allow the xdiff code to treat "words" (separated, presumably, by any white space) rather than "lines" as individual symbols for the Myers, patience, and histogram algorithms, and modified the calling code similarly, you should be able to get Git to do the merge based on words instead of lines. (Git now adds an "anchor" thing that you might need to do something about; I have not looked at how this works.) You'd have to choose how to insert any conflict markers as well—presumably, around these white-space-separated words.



          The algorithms themselves are concerned with matching (or failing to match) symbols in two different inputs. Unfortunately, the symbols are, in libxdiff, always lines. The standard (non-Git-modified) libxdiff interface is documented here, and the interface itself is centered on whole files, with the libxdiff code doing its own line-breaking.



          Internally in the modified xdiff, it looks as though Git assigns each line to a "record" so that the symbols it is comparing are record-by-record. If you assigned each white-space-separated word to a record instead, you'd mostly get what you want, ignoring the small matter of (later) dealing with any actual white-space separating the actual records. That is, in xdl_hash_record, all you would do is stop at any white-space instead of at newline, then discard additional white space between this line and the next when finding the "next" record, to build the records themselves. The code calling this changed diff might have to change as it may assume that "record number" implies "line number" (this is not clear to me off-hand).



          (It might also work better if you included leading or trailing white space in each record, and just had the comparison function, xdl_recmatch—in the same file—say that the symbols match if they match excluding their whitespace. Note that xdl_hash_record should hash the symbol minus the whitespace as well: the diff engine requires that the hashes match if the symbols match, and for performance, wants the hashes to differ if the symbols differ. The test, in essence, is this: symbols S1 and S2 with hashes H1 and H2 match if H1 == H2 and recmatch(S1,S2) says that they match. The H1==H2 test eliminates a lot of subroutine calls to slow "compare"s when the symbols obviously differ, but for symbols whose hashes match, the call is required to verify that they're really the same.)



          The main Myers algorithm itself has time complexity O(ND) where N is the number of symbols and D is the number of differences—i.e., the length of the eventual edit-script—between the two input-sets. When the symbols are lines, a 1000-line file has 1000 symbols; when the symbols are, instead, words, the 1000-line file may have 30000 symbols. So this will obviously be slower, but at least it's generally linearly slower. The histogram and patience algorithms are modifications of Myers that should behave similarly, time-wise, I think, but I haven't really studied them properly.)






          share|improve this answer





















            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53421427%2fautomatic-merging-in-git-without-conflicts-using-word-by-word-diff-instead-of-l%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            1














            What you would need to do, to make Git work the way you wish, is to modify the merge code. In theory this is not too difficult. In practice, I am not sure how difficult it would turn out to be.



            In that other answer, I mention xdelta. More precisely, Git uses modified versions of both xdelta and libxdiff. The Git source puts most of this code in a subdirectory. Up one level you'll find a few more bits of code that work with the library, such as xdiff-interface.c.



            If you modified these to allow the xdiff code to treat "words" (separated, presumably, by any white space) rather than "lines" as individual symbols for the Myers, patience, and histogram algorithms, and modified the calling code similarly, you should be able to get Git to do the merge based on words instead of lines. (Git now adds an "anchor" thing that you might need to do something about; I have not looked at how this works.) You'd have to choose how to insert any conflict markers as well—presumably, around these white-space-separated words.



            The algorithms themselves are concerned with matching (or failing to match) symbols in two different inputs. Unfortunately, the symbols are, in libxdiff, always lines. The standard (non-Git-modified) libxdiff interface is documented here, and the interface itself is centered on whole files, with the libxdiff code doing its own line-breaking.



            Internally in the modified xdiff, it looks as though Git assigns each line to a "record" so that the symbols it is comparing are record-by-record. If you assigned each white-space-separated word to a record instead, you'd mostly get what you want, ignoring the small matter of (later) dealing with any actual white-space separating the actual records. That is, in xdl_hash_record, all you would do is stop at any white-space instead of at newline, then discard additional white space between this line and the next when finding the "next" record, to build the records themselves. The code calling this changed diff might have to change as it may assume that "record number" implies "line number" (this is not clear to me off-hand).



            (It might also work better if you included leading or trailing white space in each record, and just had the comparison function, xdl_recmatch—in the same file—say that the symbols match if they match excluding their whitespace. Note that xdl_hash_record should hash the symbol minus the whitespace as well: the diff engine requires that the hashes match if the symbols match, and for performance, wants the hashes to differ if the symbols differ. The test, in essence, is this: symbols S1 and S2 with hashes H1 and H2 match if H1 == H2 and recmatch(S1,S2) says that they match. The H1==H2 test eliminates a lot of subroutine calls to slow "compare"s when the symbols obviously differ, but for symbols whose hashes match, the call is required to verify that they're really the same.)



            The main Myers algorithm itself has time complexity O(ND) where N is the number of symbols and D is the number of differences—i.e., the length of the eventual edit-script—between the two input-sets. When the symbols are lines, a 1000-line file has 1000 symbols; when the symbols are, instead, words, the 1000-line file may have 30000 symbols. So this will obviously be slower, but at least it's generally linearly slower. The histogram and patience algorithms are modifications of Myers that should behave similarly, time-wise, I think, but I haven't really studied them properly.)






            share|improve this answer


























              1














              What you would need to do, to make Git work the way you wish, is to modify the merge code. In theory this is not too difficult. In practice, I am not sure how difficult it would turn out to be.



              In that other answer, I mention xdelta. More precisely, Git uses modified versions of both xdelta and libxdiff. The Git source puts most of this code in a subdirectory. Up one level you'll find a few more bits of code that work with the library, such as xdiff-interface.c.



              If you modified these to allow the xdiff code to treat "words" (separated, presumably, by any white space) rather than "lines" as individual symbols for the Myers, patience, and histogram algorithms, and modified the calling code similarly, you should be able to get Git to do the merge based on words instead of lines. (Git now adds an "anchor" thing that you might need to do something about; I have not looked at how this works.) You'd have to choose how to insert any conflict markers as well—presumably, around these white-space-separated words.



              The algorithms themselves are concerned with matching (or failing to match) symbols in two different inputs. Unfortunately, the symbols are, in libxdiff, always lines. The standard (non-Git-modified) libxdiff interface is documented here, and the interface itself is centered on whole files, with the libxdiff code doing its own line-breaking.



              Internally in the modified xdiff, it looks as though Git assigns each line to a "record" so that the symbols it is comparing are record-by-record. If you assigned each white-space-separated word to a record instead, you'd mostly get what you want, ignoring the small matter of (later) dealing with any actual white-space separating the actual records. That is, in xdl_hash_record, all you would do is stop at any white-space instead of at newline, then discard additional white space between this line and the next when finding the "next" record, to build the records themselves. The code calling this changed diff might have to change as it may assume that "record number" implies "line number" (this is not clear to me off-hand).



              (It might also work better if you included leading or trailing white space in each record, and just had the comparison function, xdl_recmatch—in the same file—say that the symbols match if they match excluding their whitespace. Note that xdl_hash_record should hash the symbol minus the whitespace as well: the diff engine requires that the hashes match if the symbols match, and for performance, wants the hashes to differ if the symbols differ. The test, in essence, is this: symbols S1 and S2 with hashes H1 and H2 match if H1 == H2 and recmatch(S1,S2) says that they match. The H1==H2 test eliminates a lot of subroutine calls to slow "compare"s when the symbols obviously differ, but for symbols whose hashes match, the call is required to verify that they're really the same.)



              The main Myers algorithm itself has time complexity O(ND) where N is the number of symbols and D is the number of differences—i.e., the length of the eventual edit-script—between the two input-sets. When the symbols are lines, a 1000-line file has 1000 symbols; when the symbols are, instead, words, the 1000-line file may have 30000 symbols. So this will obviously be slower, but at least it's generally linearly slower. The histogram and patience algorithms are modifications of Myers that should behave similarly, time-wise, I think, but I haven't really studied them properly.)






              share|improve this answer
























                1












                1








                1






                What you would need to do, to make Git work the way you wish, is to modify the merge code. In theory this is not too difficult. In practice, I am not sure how difficult it would turn out to be.



                In that other answer, I mention xdelta. More precisely, Git uses modified versions of both xdelta and libxdiff. The Git source puts most of this code in a subdirectory. Up one level you'll find a few more bits of code that work with the library, such as xdiff-interface.c.



                If you modified these to allow the xdiff code to treat "words" (separated, presumably, by any white space) rather than "lines" as individual symbols for the Myers, patience, and histogram algorithms, and modified the calling code similarly, you should be able to get Git to do the merge based on words instead of lines. (Git now adds an "anchor" thing that you might need to do something about; I have not looked at how this works.) You'd have to choose how to insert any conflict markers as well—presumably, around these white-space-separated words.



                The algorithms themselves are concerned with matching (or failing to match) symbols in two different inputs. Unfortunately, the symbols are, in libxdiff, always lines. The standard (non-Git-modified) libxdiff interface is documented here, and the interface itself is centered on whole files, with the libxdiff code doing its own line-breaking.



                Internally in the modified xdiff, it looks as though Git assigns each line to a "record" so that the symbols it is comparing are record-by-record. If you assigned each white-space-separated word to a record instead, you'd mostly get what you want, ignoring the small matter of (later) dealing with any actual white-space separating the actual records. That is, in xdl_hash_record, all you would do is stop at any white-space instead of at newline, then discard additional white space between this line and the next when finding the "next" record, to build the records themselves. The code calling this changed diff might have to change as it may assume that "record number" implies "line number" (this is not clear to me off-hand).



                (It might also work better if you included leading or trailing white space in each record, and just had the comparison function, xdl_recmatch—in the same file—say that the symbols match if they match excluding their whitespace. Note that xdl_hash_record should hash the symbol minus the whitespace as well: the diff engine requires that the hashes match if the symbols match, and for performance, wants the hashes to differ if the symbols differ. The test, in essence, is this: symbols S1 and S2 with hashes H1 and H2 match if H1 == H2 and recmatch(S1,S2) says that they match. The H1==H2 test eliminates a lot of subroutine calls to slow "compare"s when the symbols obviously differ, but for symbols whose hashes match, the call is required to verify that they're really the same.)



                The main Myers algorithm itself has time complexity O(ND) where N is the number of symbols and D is the number of differences—i.e., the length of the eventual edit-script—between the two input-sets. When the symbols are lines, a 1000-line file has 1000 symbols; when the symbols are, instead, words, the 1000-line file may have 30000 symbols. So this will obviously be slower, but at least it's generally linearly slower. The histogram and patience algorithms are modifications of Myers that should behave similarly, time-wise, I think, but I haven't really studied them properly.)






                share|improve this answer












                What you would need to do, to make Git work the way you wish, is to modify the merge code. In theory this is not too difficult. In practice, I am not sure how difficult it would turn out to be.



                In that other answer, I mention xdelta. More precisely, Git uses modified versions of both xdelta and libxdiff. The Git source puts most of this code in a subdirectory. Up one level you'll find a few more bits of code that work with the library, such as xdiff-interface.c.



                If you modified these to allow the xdiff code to treat "words" (separated, presumably, by any white space) rather than "lines" as individual symbols for the Myers, patience, and histogram algorithms, and modified the calling code similarly, you should be able to get Git to do the merge based on words instead of lines. (Git now adds an "anchor" thing that you might need to do something about; I have not looked at how this works.) You'd have to choose how to insert any conflict markers as well—presumably, around these white-space-separated words.



                The algorithms themselves are concerned with matching (or failing to match) symbols in two different inputs. Unfortunately, the symbols are, in libxdiff, always lines. The standard (non-Git-modified) libxdiff interface is documented here, and the interface itself is centered on whole files, with the libxdiff code doing its own line-breaking.



                Internally in the modified xdiff, it looks as though Git assigns each line to a "record" so that the symbols it is comparing are record-by-record. If you assigned each white-space-separated word to a record instead, you'd mostly get what you want, ignoring the small matter of (later) dealing with any actual white-space separating the actual records. That is, in xdl_hash_record, all you would do is stop at any white-space instead of at newline, then discard additional white space between this line and the next when finding the "next" record, to build the records themselves. The code calling this changed diff might have to change as it may assume that "record number" implies "line number" (this is not clear to me off-hand).



                (It might also work better if you included leading or trailing white space in each record, and just had the comparison function, xdl_recmatch—in the same file—say that the symbols match if they match excluding their whitespace. Note that xdl_hash_record should hash the symbol minus the whitespace as well: the diff engine requires that the hashes match if the symbols match, and for performance, wants the hashes to differ if the symbols differ. The test, in essence, is this: symbols S1 and S2 with hashes H1 and H2 match if H1 == H2 and recmatch(S1,S2) says that they match. The H1==H2 test eliminates a lot of subroutine calls to slow "compare"s when the symbols obviously differ, but for symbols whose hashes match, the call is required to verify that they're really the same.)



                The main Myers algorithm itself has time complexity O(ND) where N is the number of symbols and D is the number of differences—i.e., the length of the eventual edit-script—between the two input-sets. When the symbols are lines, a 1000-line file has 1000 symbols; when the symbols are, instead, words, the 1000-line file may have 30000 symbols. So this will obviously be slower, but at least it's generally linearly slower. The histogram and patience algorithms are modifications of Myers that should behave similarly, time-wise, I think, but I haven't really studied them properly.)







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Nov 21 '18 at 23:40









                torektorek

                184k17235316




                184k17235316






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53421427%2fautomatic-merging-in-git-without-conflicts-using-word-by-word-diff-instead-of-l%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    404 Error Contact Form 7 ajax form submitting

                    How to know if a Active Directory user can login interactively

                    Refactoring coordinates for Minecraft Pi buildings written in Python