1 Introduction
“Integration Hell” refers to the scenarios where developers integrate or
merge big chunks of code changes from software branches right before delivering a software product [
20]. In practice, this integration process is rarely smooth and seamless due to
conflicts, which can take developers hours or even days to debug so that branches can finally merge [
12]. To avoid “Integration Hell”, an increasing number of developers use
Continuous Integration (
CI) to integrate code frequently (e.g., once a day) and to verify each integration via automated build (i.e., compilation) and testing [
54,
57]. Nevertheless, CI practices do not eliminate the challenges posed by merge conflicts. Developers still rely on the merge feature of version control systems (e.g., git-merge [
25]) to automatically (1) integrate branches and (2) reveal conflicts that require manual resolution.
Such text-based merge usually produces numerous false positives and false negatives. For example, when two branches reformat the same line in divergent ways (e.g., add vs. delete a whitespace), git-merge reports a
textual conflict even though such a conflict is unimportant and poses no syntactic or semantic difference. Meanwhile, when two branches edit different lines modifying the program semantics in conflicting ways, git-merge silently applies both edits without reporting any conflict.
Background. In prior literature, many tools were proposed to improve text-based merge [
27,
28,
30,
39,
46,
59]. For instance, FSTMerge [
28] models program entities (e.g., classes and methods) as unordered tree nodes, matches entities based on their signatures, and uses text-based merge to integrate edits inside matched entities. JDime [
27,
39] extends FSTMerge by modeling both program entities and statements in its tree representation. It applies tree matching and amalgamation algorithms to integrate edits. Given textually conflicting edits, AutoMerge [
59] enumerates all possible combinations of the edit operations from both branches and recommends alternative conflict resolutions. However, all of these tools only compare edits applied to the same program entity; they do not check whether the co-application of edits to different entities can trigger any compilation or testing error. Crystal [
30] overcomes this limitation by building and testing tentatively merged code to reveal
higher-order conflicts (i.e., build conflicts and test conflicts).
Motivation. Despite the existence of diverse tools, some fundamental
research questions (
RQs) are yet to be explored, including
—
RQ1: How were the three types of conflicts (i.e., textual, build, and test conflicts) introduced in real-world applications?
—
RQ2: What are developers’ resolution strategies for different types of conflicts?
—
RQ3: What characteristics of conflicts are overlooked by existing tool design?
Exploring these questions is important for two reasons. First, by contrasting the characteristics of merge conflicts with the focus of existing merge tools, we can reveal the critical aspects of conflicts overlooked by such tools. Second, by examining how conflicts were introduced and resolved by developers, we can motivate new tools to address conflicts by mimicking developers’ practices or bringing humans into the loop.
Methodology. To investigate the RQs, we conducted a comprehensive characterization study of merge conflicts. A typical
merging scenario in software version history involves four program commits: base version
\(\boldsymbol {b}\), left branch
\(\boldsymbol {l}\), right branch
\(\boldsymbol {r}\), and developers’ merge result
\(\boldsymbol {m}\) (see Figure
2). To crawl for merging scenarios in the version history, we searched for any commit that has two parent commits. If a commit
c has two parent commits, we consider
c to be developers’ merge result
m, treat the first parent commit as
l, and regard the second parent as
r. To identify the common base version
b, we applied the command “
git merge-base” to the two parent commits.
As shown in Figure
1, we took a two-phase approach to extract and analyze merge conflicts in Java open-source software repositories. In Phase I, starting from an existing merging scenario, we applied git-merge, automated build, and automated testing in sequence. With git-merge, we tentatively generated a text-based merged version
\(\boldsymbol {A_m}\) from
l and
r (see Figure
2). We then built
\(A_m\) and tested it with developers’ predefined test cases. When any of these steps failed, we concluded
l and
r to have textual, build, or test conflicts. For deeper analysis, in Phase II, we manually inspected a sample set of scenarios with textual conflicts and all scenarios with build or test conflicts. For each of these scenarios, we compared the five program versions involved—
b,
l,
r,
m, and
\(A_m\)—to comprehend the root cause and resolution of conflict(s).
Our Results. In our study, we mined the software version history of 208 popular open-source Java projects, and found 117,218 merging scenarios (i.e., any code commit with two parent commits) in those repositories. With the two-phase methodology mentioned above, we identified 15,886 merging scenarios with textual conflicts, 79 scenarios with build conflicts, and 33 scenarios with test conflicts. Due to the huge number of revealed textual conflicts, we randomly picked 385 conflicts from distinct merging scenarios for manual inspection. We also manually analyzed all revealed build and test conflicts to characterize the root cause and developers’ conflict resolutions. The major findings are as below:
—
How were conflicts introduced? 65 out of 385 inspected textual conflicts are false positives because l and r edit adjacent lines instead of the same lines. 18 of the 65 false positives are located in non-Java files (e.g., pom.xml). Build conflicts occurred when the co-application of edits from l and r broke any def-use link between program elements (e.g., classes or libraries). For instance, when l updates a method from foo() to bar() and r adds a call foo(), the co-application of both edits can break a def-use link as it leads to a mismatch between the use (i.e., call) and def (i.e., declaration) of foo(). 85% (39/46) of test conflicts happened, as the co-application of edits broke def-use links between elements or led to mismatches between test oracles and software implementation.
—
How did developers manually resolve merge conflicts? Within the 320 true textual conflicts, developers resolved most conflicts (i.e., 206) by exclusively keeping changes from one branch. However, developers resolved most of the higher-order conflicts by combining all edits from both branches with additional edits. The additional edits solve build or test errors by repairing broken def-use links or fixing mismatches between implementation and tests. Such edits present systematic editing patterns that modify similar code in similar ways.
—
What conflicts cannot be handled by current tools? Existing tools detect textual conflicts in non-Java software artifacts (e.g., readme.txt) with relatively low precision (72%) and thus unable to resolve true textual conflicts fully automatically. When textual conflicts exist and \(A_m\) could not be generated, neither compilation nor testing is applicable for conflict detection. Even if \(A_m\) is available, compilation and testing can only reveal the symptoms (e.g., errors or failures) triggered by higher-order conflicts instead of the precise root causes. There is insufficient tool support for the detection and resolution of higher-order conflicts.
The insights from this study enlighten future software merge tools and suggest new research directions in related areas like systematic editing and change recommendation. The program and data presented in this work are publicly available at
https://figshare.com/s/c174e1ffd2ad02b15211.
3 Study Approach
Our approach has two phases (see Figure
1). Phase I uses automatic approaches to discover conflicts (Section
3.1). Phase II relies on manual inspection to comprehend the revealed conflicts and developers’ resolutions (Section
3.2).
3.1 Phase I: Automatic Detection of Conflicts
To ensure the representativeness of our study, we ranked Java projects on GitHub based on their popularity (i.e., star counts). We cloned repositories for the top 1,000 projects as our initial dataset. We then refined the dataset with two heuristics. First, we only kept the projects that can be built with Maven [
22], Ant [
24], or Gradle [
19], because these three build tools are popularly used and we will rely on them to build and test each naïvely merged version
\(A_m\). In particular, Gradle projects typically contain build files named
*.gradle; Maven and Ant projects often have build files separately named as
pom.xml and
build.xml. We recognized the projects that can be built with Maven, Ant, or Gradle based on the existence of corresponding build files. Second, we removed tutorial projects as they are not real Java applications and may not show real-world merging scenarios. Namely, if the readme file of any project contains some description like “this tutorial project serves for learning purposes”, we removed the project.
Consequently, our refined dataset comprises 208 repositories. Among the 208 repositories, we identified 117,218 merging scenarios by searching for any commit with two parent/predecessor commits. In each identified merging scenario, we regard the first and second parent commits as
l and
r separately. We treat their common child and ancestor commits as
m and
b. Figure
3 shows the distribution of projects based on their merging-scenario counts. According to this figure, 3 projects contain no more than two merging scenarios, while two projects contain more than
\(2^{13}\) or 8,192 scenarios. Among the 14 count intervals,
\((2^6, 2^7)\) corresponds to the most projects (i.e., 37). The median count per project is 131, which means that merging scenarios exist widely in the subject software repositories.
Although some of the 208 repositories have few merging scenarios (e.g., less than 10) or are maintained by single developers, we did not further refine this dataset for two reasons. First, if the developers of some repositories do not create or merge branches very often, our dataset can cover the merging practices by those developers to be representative. Second, if some developers eagerly create and merge branches even though they are the solo contributors of their projects, we believe it is also important to cover the merging scenarios in their repositories. Because different projects may define test files in different ways, we leveraged the regular expression “*Test*.java” to search for test files in the latest versions of 208 repositories. We found test files in 191 repositories; only 17 repositories have no test file. These numbers indicate that test files popularly exist in the subject repositories; they are usually available when testing is required to reveal test conflicts.
As illustrated in Figure
1, in Phase I, we process each merging scenario by taking three steps sequentially. In
Step 1, we apply git-merge to
l and
r to generate a text-based merged version
\(A_m\). If this trial fails, git-merge reports all textual conflicts, and we record that scenario. Otherwise, if both
l and
r build smoothly and we successfully generate
\(A_m\) then, in
Step 2, we attempt to build
\(A_m\). If the attempt fails, we log all build errors and label the merging scenario to have build conflicts. Otherwise, if
\(A_m\) builds successfully and both
l and
r pass developers’ test cases, then in
Step 3, we further execute the successfully built version of
\(A_m\) with the developers’ test suite. If the program fails any test, we log all runtime errors and label the scenario to have test conflicts; otherwise, we skip the merging scenario as no conflict is revealed. By applying such a three-step method, we marked 15,886 merging scenarios with textual conflicts, 79 scenarios with build conflicts, and 33 scenarios with test conflicts. As shown in Table
3, these three types of scenarios are separately from 183, 37, and 22 repositories. We found at most 4,172, 6, and four merging scenarios related to individual conflicts per repository. All reported numbers imply the prevalence of merge conflicts. Textual conflicts were reported more often than the other two types of conflicts.
3.2 Phase II: Manual Inspection
We inspected the merge conflicts to investigate the three research questions mentioned in Section
1. Due to a large number of merging scenarios with textual conflicts (i.e., 15,886), it is infeasible to analyze each conflict manually. We thus decided to sample 385 textual conflicts to reduce manual effort while ensuring the representativeness of our observations. In particular, when 15,886 merging scenarios contain in total thousands or millions of textual conflicts, 385 is a statistically significant sample size with 95% confidence level and
\(\pm\)5% confidence interval [
23,
41]. To construct our sample set, we randomly picked one or more merging scenarios in each of the 183 repositories and examined one of the reported textual conflicts for each scenario. To manually analyze every sampled conflict, we studied five related program versions:
b,
l,
r,
m, and
\(A_m\).
We inspected all build conflicts without sampling because of only a few such scenarios (i.e., 79). Different from git-merge, build tools report compilation errors but never pinpoint the conflicting edits. To locate and understand those edits that correspond to each compilation error, we checked five program versions (i.e., b, l, r, m, and \(A_m\)) and inspected edits applied to distinct locations. If the co-application of certain edits from l and r is responsible for the error, we identified the minimum set of involved edits as the root cause of a build conflict. Similarly, we inspected all scenarios labeled with test conflicts (i.e., 33). Automated testing reports runtime errors but does not locate any conflicting edits. To reveal those edits, we inspected all five program versions to compare the semantics and identified the minimum set of responsible edits.
With more details, to locate the root cause of a higher-order conflict, we checked whether developers’ merged version m had any build or test error. If both \(A_m\) and m had build or test errors, we could not decide how those errors were introduced or resolved. In such scenarios, we checked three more commits after m in the software history. If none of these additional commits resolved the build or test errors introduced earlier, we concluded that the merging scenario had some unknown build/test conflicts. Otherwise, if m or any of the later commits being checked had zero build/test error, we compared \(A_m\) with that commit to locate conflicts.
To ensure the quality of our manual inspection, two authors independently examined the sampled textual conflicts and all merging scenarios with build or test failures. The authors compared their description on root causes and resolutions for all conflicts and extensively discussed any disagreement until reaching a consensus. As shown in Table
3, we sampled 385 textual conflicts, and manually located 107 build conflicts as well as 46 test conflicts in the 112 (i.e., 79 + 33) examined scenarios. Notice that one merging scenario can have one or more higher-order conflicts. Thus, the total number of higher-order conflicts (i.e., 107 + 46) is larger than the total number of examined scenarios (i.e., 112). We report our empirical findings for different conflict types in Sections
4–
6.
5 Study Results On Build Conflicts
This section presents our analysis of the 107 build conflicts.
5.1 RQ1: Root Causes of Build Conflicts
We characterized each inspected build conflict from two perspectives: the edited files and edit types (see Figure
5). We did not classify conflicts based on relative edit locations because no conflicting edits overlap textually. In other words, each pair of conflicting edits were smoothly integrated into
\(A_m\) by git-merge. In terms of edited files, 107 build conflicts can be classified into four categories. 79 conflicts exist among Java edits (e.g., Table
8(a)); 19 conflicts occur among edits to build scripts (e.g., Table
8(c)); seven conflicts are due to simultaneous edits in Java code and build scripts (e.g., Table
8(d)); and two conflicts are caused by unknown reasons. All these conflicts are true positives because they all trigger compilation errors. We classified the conflicts into four categories in terms of edit types: (1)
update vs. add, (2)
delete vs. add, (3)
others, and (4)
unknown.
5.1.1 Update vs. Add.
These 68 conflicts can be further classified into four subcategories: declaration-reference, super-sub, version-version, and dependency-code. We defined these subcategories based on the content and semantic dependencies of conflicting edits. Table
8 presents an example for each subcategory, and we will explain all of them in detail below.
(a) Declaration-reference. When
l (or
r) updates the declaration of a program entity and
r (or
l) adds references to the original declaration, there is a mismatch between referencers and the referencee. For example, in Table
8(a),
r updates a field name
EPHEMERAL to
DISTRO, while
l adds a reference to
EPHEMERAL. The integration of both edits causes a build error “cannot find symbol: variable EPHEMERAL”.
(b) Super-sub. There are scenarios where one branch adds a type reference via inheritance (i.e.,
A extends B) or implementation (i.e.,
A implements B), and the other branch updates a method of the super or subtype. The edit co-application makes method signatures inconsistent between two Java types. For instance, in Table
8(b),
r revises an interface
IndexDAO to declare a new method
getTaskLogs(...), while
l defines a class
ElasticSearch5DAO to implement the original interface. Because the defined class does not implement the newly added method, the compiler outputs an error “ElasticSearch5DAO is not abstract and does not override abstract method getTaskLogs(String) in IndexDAO”.
(c) Version-version. In some merging scenarios, one branch updates the version number of a defined artifact in build scripts while the other branch adds one or more references to the original artifact version. As shown in Table
8(c),
l adds a reference to version
0.2.0 of artifact
spring-cloud-alibaba; however,
r updates the artifact’s version from
0.2.0 to
0.2.0.BUILD-SNAPSHOT. When both edits are applied, the compiler or build system produces an error “Could not find artifact org.springframework. cloud:spring-cloud-alibaba:pom:0.2.0”.
(d) Dependency-code. There are cases where one branch updates a library dependency in build scripts while the other branch adds code that accesses APIs only supported by the original library. In Table
8(d),
l upgrades artifact
elasticsearch and
r adds code to call
Bucket.getKeyAsText(), which is only supported by the old library. The co-application of both edits triggers a compilation error “cannot find symbol: method getKeyAsText()”.
5.1.2 Delete vs. Add.
These 31 conflicts can be further put into two subcategories: declaration-reference and dependency-code. The first subcategory represents scenarios where
l (or
r) deletes an entity declaration and
r (or
l) adds one or more references to that entity. In Table
9,
r deletes an entire file
DubboTransportedMetadata.java and thus removes the class definition for
DubboTransportedMetadata. Meanwhile,
l adds an
import-declaration for the class. Consequently, the combination of both edits leads to an error “cannot find symbol: class DubboTransportedMetadata”. The second subcategory means one branch deletes the library dependency in build scripts, and the other branch adds references to APIs solely provided by that library. These two subcategories are similar to the subcategories (a) and (d) of
update vs. add, but different in terms of edit types.
5.1.3 Others.
Six conflicts were introduced for miscellaneous reasons. Four of them are
add vs. add conflicts. For example, as shown in Table
10, when both branches add declarations of the same field
required, integrating those edits can trigger a build error like “variable required is already defined in class CodegenParameter”. The other two instances are
update vs. update conflicts. Namely, when
l and
r update different code regions of the same method, the edit integration accidentally introduces the usage of undefined variables.
5.1.4 Unknown.
For two conflicts, we did not figure out how the edits from l and r conflict with each other. The reported errors are all about missing packages. For instance, one error is “package org.springframework.mock.web does not exist”. Although this seems to be a library configuration issue, we could not locate the root cause or developers’ manual resolution.
Summary. We notice an interesting common characteristic among the 105 conflicts with known reasons. When developers define and use program elements (e.g., classes or libraries) in software, they should always observe two constraints regarding def-use links. First, no element should be defined twice. Second, any used element should correspond to a defined element. If the edit integration between branches breaks def-use links by violating any constraint, build systems report errors, and those integrated edits result in build conflicts.
5.2 RQ2: Resolutions of Build Conflicts
For the 105 conflicts with identified root causes, we further studied developers’ manual resolutions. Table
11 presents the resolution distribution. As shown in the table, developers resolved 78 conflicts via L+R+M. When the co-application of edits from branches triggers any build error, developers usually applied adaptive changes to glue those edits. For instance, to resolve the conflict shown in Table
8(b), developers kept edits from both sides and inserted code to
ElasticSearch5DAO in order to implement the newly added method interface
getTaskLogs(...). Additionally, developers adopted L and R to separately resolve five and 22 conflicts.
Interestingly, the resolution distribution shown in Table
11 is quite different from developers’ resolutions to textual conflicts (see Table
5). Most build conflicts were resolved via L+R+M instead of L or R. One possible reason to explain this contrast is that build conflicts are between edits applied to distinct program locations, while textual conflicts are between divergent edits to the same location. Although it is hard to co-apply the edits that textually overlap, it is easier to co-apply the edits whose locations are different. Thus, it is more favorable for developers to keep edits from both sides when resolving build conflicts and make adaptive changes as needed.
Although the additional edits M vary with merging scenarios, we observed two important commonalities. First, the edits were always applied to remedy broken def-use links. For simplicity, we name such edits for link-repair purposes as
adaptive changes. Second, for many scenarios, the adaptive edits similar to M were already applied in either
l or
r. Take Table
8(c) as an example. When developers updated the artifact’s version number from
0.2.0 to
0.2.0.BUILD-SNAPSHOT in
r, they also revised build scripts to consistently update all version references in order to use the new artifact version. We use
consistent edits or
systematic edits to refer to the similar edits repetitively applied to multiple locations to address the similar or identical coding issues in those locations. In Table
8(c), we observed developers to apply adaptive changes to resolve all build errors triggered by the version upgrade in
r. These adaptive changes are very similar to the additional edits M because both sets of edits intend to fix the same kind of errors.
Among the 78 conflicts resolved via L+R+M, we saw 64 conflicts (82%) to have M consistent with (i.e., similar to) the adaptive edits applied in one branch. These highly consistent edits indicate great opportunities for automatic conflict resolution.
5.3 RQ3: Discussion on Existing Tool Design
Limitations of Conflict Detectors. Table
2 shows three detectors for build conflicts— Crystal, WeCode, and CloudStudio. All of them rely on build systems to reveal build conflicts. However, based on our experience, the common methodology of these tools has three limitations.
First, the builder-based or compiler-based detection of conflicts is ineffective in reporting build conflicts when textual conflicts exist between l and r. Actually, in our procedural of manual inspection for textual conflicts, we noticed that build conflicts can coexist with textual conflicts in unmergeable versions by git-merge. For such scenarios, neither Crystal nor WeCode reports any build conflict, as they require users to remove all textual conflicts first and produce a merged version \(A_m\). Second, given a merged version, builders or build systems may not report all compilation errors in one run. Namely, the initially revealed errors can prevent builders from detecting other errors. Thus, it can be time-consuming for developers to recognize all build errors through repetitive compilation and program revision. Our study did not change any program or fix any build error. Instead, we focused on the initial build errors. As a result, our analysis can miss some build errors hidden by the initially reported ones. It is possible that in the 79 merging scenarios with build conflicts detected, there are more than 107 build conflicts between branches. Third, given a build error, developers have to identify the conflicting edits manually. When l and r contain many edits, developers may find it challenging to locate the root cause manually. In our manual analysis, even though we spent lots of time investigating the root causes for every reported build error, there are still two errors for which we cannot locate the conflicting edits. Existing tools do not help developers/users reason about conflicts.
Future Opportunities of Conflict Detectors. As mentioned in Section
5.2, build conflicts are mainly about the broken def-use links induced by software merge. The resolution edits always repair broken links, and those edits are often inferable from relevant adaptive changes in either
l or
r. Based on these observations, we envision a promising conflict detector to replace the usage of compilers with static program analysis. Specifically, the tool can contrast all edits separately applied in
l and
r, reason about the def-use links between edited program elements, and report conflicts whenever a def-use constraint between elements is violated. For example, for the merging scenario shown in Table
8(a), a future tool can scan for any updated field in either branch (e.g.,
r) and then check whether the other branch adds any reference to the original field before the update. If so, a broken def-use link is detected, and a conflict is reported accordingly. Without using any builder or compiler, such static analyzers can overcome the three limitations mentioned above.
Limitations and Opportunities of Conflict Resolvers. We were not aware of any tool that can resolve build conflicts, so we conducted a pilot study by applying four syntax-based tools (i.e., FSTMerge, JDime, IntelliMerge, and AutoMerge) to five merging scenarios with known build conflicts. According to our experiments, JDime and AutoMerge are unable to detect or resolve build conflicts. In the scenarios, when only Java files are involved in conflicts, both tools either naïvely integrate edits as git-merge does, or fail to process newly added Java files because they strictly require all three program versions (b, l, r) to be provided. In the scenarios when build scripts are also edited, both tools either fail to process non-Java files or output nothing. To sum up, JDime and AutoMerge are unable to detect and resolve build conflicts. Interestingly, when applied to the five scenarios, FSTMerge and IntelliMerge successfully handled one scenario by adding an extra edit to fix the broken def-use link but failed in the other four. In the scenario where FSTMerge and IntelliMerge resolved a conflict, l removes an import from a Java file, and r adds a reference to the originally imported class in the same file. The tools fixed the broken def-use link by adding back the removed import. Our pilot study shows that we lack systematic tool support for the resolution of build conflicts.
Based on our conflict characterization, we see great opportunities to create conflict resolvers. Specifically, a promising approach can infer systematic editing patterns from the adaptive changes applied in either
l or
r and customize the inferred patterns to resolve conflicts in
\(A_m\). Our insight is that
if developers resolve compilation errors in either branch for any edit that breaks def-use links, they are likely to resolve the same compilation errors in \(A_m\) in similar ways for that merged-in edit. Take Table
8(a) as an example. Given the conflicting edits between branches, a future tool can focus on the field-update edit and mine
r for any adaptive edits related to that update. Suppose an adaptive edits were applied in
r to remedy broken links between the defined fields and field accesses (i.e., by updating all added field accesses to refer to
DISTRO). In that case, the future tool can similarly apply such edits to
\(A_m\) to resolve conflicts.
6 Study Results On Test Conflicts
This section presents our analysis of the 46 test conflicts.
6.1 RQ1: Root Causes of Test Conflicts
We characterized test conflicts from two perspectives: the edited files and edit types. As shown in Figure
6, we classified conflicts into four categories based on edited files: 35 conflicts exist among Java edits (e.g., Table
12(b)); three conflicts are among edits to Java and other files (i.e.,
.xml or
.groovy); one conflict is among non-Java files (e.g., Table
12(a)); and seven conflicts were caused by unknown reasons. All these conflicts involve simultaneous edits to test oracles and/or software implementation. Compared with Figure
5, we have more test conflicts due to unknown reasons (i.e., 7 vs. 2). This is because build errors usually report the program elements that break def-use links, which helped us manually identify root causes. On the other hand, test errors only report the locations where abnormal program behaviors are observed. These locations may be far away from the places where program states initially become erroneous. Without project-specific domain knowledge and dynamic analysis tools, it is hard for us to manually reason about all root causes.
We classified conflicts into four categories in terms of edit types: (1)
update vs. add, (2)
add vs. add, (3)
delete vs. add, and (4)
unknown.
Update vs. add has four subcategories: declaration-reference, super-sub, dependency-code, and implementation-oracle. The first three subcategories are similar to the subcategories (a), (b), and (d) mentioned in Section
5, and they all break def-use links. However, these conflicts trigger test errors instead of build errors because part of the conflicting edits were applied to test code, which is only compiled in the testing instead of the build phase.
The subcategory
implementation-oracle means that a conflict happens if one branch changes program implementation, while the other branch adds test code whose oracle matches the original implementation. For instance, in Table
12(a),
l updates the implementation of
KotlineJsonAdapter and
r adds a test case to
KotlineJsonAdapterTest in order to test the original implementation. The co-application of both edits triggers an assertion error “
Expecting message: \(\lt\) “
Non-null value “
a”
was null at $”
\(\gt\) but was: \(\lt\) “
Non-null value “
a”
was null at $.a”
\(\gt\)”. It indicates that there is a semantic mismatch between the code implementation and test oracle.
Add vs. add means that
l and
r add the same method at distinct program locations of the same file, causing one method to have duplicated definitions. Take Table
12(b) as an example. Because both branches simultaneously define the same method
getId() in a file inside the test folder, the co-application of both edits triggers an error “method getId() is already defined in class Employee”. Notice that
such duplicated definition errors were introduced by git-merge, as text-based merge naïvely combines the edits simultaneously applied to distinct program locations by branches. These errors are only reported by automated testing instead of software build because the edits were applied to test files.
The single
delete vs. add conflict was introduced when one branch deletes a class, and the other branch adds a test class to refer to that deleted class. This subcategory is similar to the
delete vs. add category mentioned in Section
5. The only difference is that the added reference exists in a test class, which is compiled in the testing instead of the build phase and thus triggers test errors.
Summary. We noticed three interesting phenomena. First, many of the observed test errors are compilation errors in test cases. They are revealed in the testing phase just because the test cases get complied within this phase. We currently count the conflicts triggering these errors as test conflicts simply because the compiler-based detection of build conflicts is insufficient and produces false negatives. Second, the add-add conflicts are false negatives of the text-based merge. Git-merge naïvely introduced them into \(A_m\) without comparing the edits co-applied to different locations for conflict detection or resolution. Third, among the inspected 46 test conflicts, we observed 11 related to mismatches between code implementation and test oracles.
6.2 RQ2: Resolutions of Test Conflicts
For the 39 conflicts with revealed root causes, we further classified the corresponding manual resolutions. As shown in Table
13, developers resolved 21 conflicts via L+R+M, and all these conflicts belong to the
update vs. add category. For instance, to resolve the conflict shown in Table
12(a), developers kept edits from branches and revised the test oracle to match with the output of the updated implementation. Additionally, developers resolved 13 conflicts via R, all of which belong to the
add vs. add category. For instance, to resolve the conflict listed in Table
12(b), developers kept the right version because the methods repetitively added by branches cannot coexist in the same program context. We believe that the developers prefer L+R+M over other resolution strategies.
Similar to what we found among resolutions to build conflicts, the additional edits M applied to resolve test conflicts have two important characteristics. First, all edits were applied to fix def-use links or match test oracle with code implementation. For simplicity, we also refer to such edits with
adaptive changes. Second, they were usually consistent with the adaptive edits applied in one branch. Take Table
12(a) as an example. When developers updated the implementation of
KotlinJsonAdapter in
l, they also adapted assertions (i.e., test oracles) in
KotlinJsonAdapterTest to ensure test success. If we compare their adaptive changes in
l and the additional edits in
m (see Table
14), the two change-sets are identical albeit distinct program contexts. Among the 21 conflicts resolved via L+R+M, 12 conflicts have M similar or identical to the adaptive edits in one branch.
6.3 RQ3: Discussion on Existing Tool Design
Limitations of Conflict Detectors. Table
2 lists three detectors for test conflicts. Both Crystal and WeCode rely on automated testing to reveal test conflicts so that they can cover all test conflicts mentioned in Section
6.1. However, according to our experience, the common methodology of both tools suffers from five limitations. First, when test conflicts coexist with other types of conflicts, automated testing is infeasible. Developers have to resolve the other two kinds of conflicts before getting a chance to run code with test cases and explore test conflicts. During our manual analysis of textual conflict scenarios, we observed test conflicts to coexist with textual ones. For such scenarios, neither Crystal nor WeCode reveals any test conflict. Second, when multiple test conflicts coexist, the runtime errors triggered by some conflicts can stop program execution and prevent testing from detecting other conflicts. In other words, testing-based conflict detection can be slow and may require lots of test runs. Third, it is time-consuming for developers to write test cases and ensure sufficient coverage of testing. Fourth, both tools can only reveal the symptoms (i.e., runtime errors) instead of root causes for any test conflict. They do not provide further assistance in conflict localization. Fifth, there are flaky tests that pass and fail nondeterministically in different program runs. For such scenarios, automated testing cannot effectively reveal symptoms of conflicts, and we did not include such scenarios in our dataset.
SafeMerge is designed to statically reason about program semantics to overcome all limitations of testing-based conflict detection. However, due to its complex modeling for program semantics, SafeMerge cannot relate edits applied to distinct Java methods, neither can it analyze edits applied to non-code files. Therefore, SafeMerge cannot reveal any test conflict caused by the edits simultaneously applied to distinct program elements (i.e., files or classes). Our study found all 39 conflicts to be introduced by simultaneous edits to different program elements, limiting the usefulness of SafeMerge in real-world scenarios. In fact, we tried to apply SafeMerge to test conflicts but without success. Therefore, we were unable to conduct a pilot study to validate the tool’s capability.
Opportunities of Conflict Detectors. SafeMerge demonstrates a promising way to detect test conflicts via static program analysis. However, SafeMerge is limited to intra-procedural analysis, while most conflicts we observed require inter-procedural analysis. The mismatch between the problem domain and the current solution indicates great opportunities for conflict detectors based on inter-procedural static program analysis.
Limitations and Opportunities of Conflict Resolvers. We were unaware of any tool that can resolve test conflicts, so we conducted a pilot study by applying four syntax-based tools (i.e., FSTMerge, JDime, IntelliMerge, and AutoMerge) to 15 merging scenarios with known test conflicts. Among the scenarios, 10 conflicts belong to add vs. add, which were wrongly introduced by git-merge and five conflicts belong to update vs. add. As syntax-based tools were designed to align Java methods between branches based on method signatures and implementation, they should be able to detect or even resolve the add vs. add conflicts. Unsurprisingly, our study shows that the four tools resolved all 10 conflicts successfully. Specifically, these tools all recognized the duplicated addition of the same methods in 10 scenarios and included single copies of those methods in the merged versions. In other words, our pilot study confirmed that JDime, IntelliMerge, FSTMerge, and AutoMerge could nicely address the add vs. add conflicts incorrectly introduced by git-merge. Meanwhile, no syntax-based tool can detect or resolve the other five test conflicts (update vs. add). This is because these tools do not model or compare any program semantics for co-applied edits.
There is no tool that automatically resolves semantic mismatches in test conflicts. Based on our conflict characterization, nevertheless, we see great research opportunities to create conflict resolvers. Similar to the future tool design described in Section
5.3, a promising approach can infer systematic editing patterns from the adaptive changes in
l or
r and customize the inferred patterns to resolve conflicts in
\(A_m\). Our insight is that
if developers resolve test errors in one branch for edits that either break def-use links or cause mismatches between code and tests, they are likely to resolve the same errors in \(A_m\) in similar ways. Among the 21 conflicts resolved with L+R+M, this proposed approach will be able to resolve 12 conflicts.
7 Our Recommendations
Based on our study, we would like to provide the following recommendations for future research.
Resolution Prediction for Textual Conflicts. According to our analysis, it seems too ambitious to build a tool that automatically resolves all conflicts because the additional edits M involved in some strategies are specific to projects and domains. However, it is promising to predict developers’ resolution strategies, as the developers of different projects implicitly share decision-making patterns like preferring L and R over other strategies. Future tools can characterize merging scenarios from different perspectives (e.g., the file types, edit types, edit content, program contexts, authors, and timestamps) and train machine-learning models for strategy prediction. As developers resolve most conflicts by keeping all edits from one branch, a highly accurate prediction model can automatically resolve those conflicts with high success rates to reduce manual effort.
Detection of Higher-order Conflicts. Existing tools mainly detect higher-order conflicts via automatic compilation and testing. Both compilation and testing heavily rely on the existence of merged versions and require intensive human-in-the-loop interactions before uncovering all conflicts. Therefore, the applicability and capability of such tools are not always satisfactory.
We recommend future tools to statically analyze and contrast software branches for conflict detection without requiring branches to be merged or compiled successfully. Specifically, a new approach can enumerate all situations where the naïve edit integration between branches can cause build or test errors and define patterns to represent those situations. For instance, conflicting situations can include (1) one branch renames or removes a program element (e.g., class, method, or field), while the other branch adds references to the original element; (2) one branch adds a method \(m()\) to a Java interface, while the other branch defines a new class to implement the interface but does not define any method body for \(m()\). With patterns defined for such conflicting situations, the new approach can search for pattern matches between l and r in any given merging scenario to detect conflicts. In this way, the future tool does not need automatic compilation or testing.
Resolution of Higher-order Conflicts. In our study, we observed that most conflicts were resolved via L+R+M, and the additional edits M are often similar to the adaptive changes applied to one of the branches. The rationale behind the observed similarity is that when developers apply certain edits to remove the build or test errors in either branch, they are likely to apply similar edits to remove the same build or test errors in the naïvely merged software \(A_m\).
Some systematic editing tools like Sydit [
45] and Lase [
44] can generalize abstract program transformations from concrete code change examples and repetitively apply similar edits to similar code snippets. We see promise in extending these tools for automatic conflict resolution. For instance, given edits from
l and
r producing a higher-order conflict, a future tool can extend Lase to search for relevant adaptive changes applied to one branch. The tool can then extract systematic editing patterns from those adaptive changes, customize patterns for particular program locations, and apply the customized edits.
9 Threats to Validity
Threats to External Validity:. Our study is based on the 538 conflicts extracted from 208 Java project repositories on GitHub. The characterization of conflicts and the observed resolutions may not generalize well to other conflicts, other projects, other programming languages, or other hosting platforms (e.g., Bitbucket). Our data collection started from the 208 popular open-source Java projects in GitHub, but we only found 79 build conflicts and 33 test conflicts. Although we made our best effort by spending one year in exploring all merging scenarios in those repositories, the relatively small number of higher-order conflicts are the best dataset we can create at this point, mainly due to the limitations of existing tool support. Consequently, the relatively low numbers can also limit the generalizability of our empirical findings. In the future, we plan to mitigate this threat by creating and using better tools to reveal higher-order conflicts more effectively.
As with prior work [
18,
29,
50,
58], we relied on the intuitive pattern “one child commit following two parent commits” to recognize merging scenarios in the software version history. However, it is possible that some developers may merge software branches without using the command
git merge or producing the commits matching our pattern. They may simply copy and paste code from branches, manually integrate the edits of distinct branches, and then discard some branches. As such non-typical software merge practices leave no obvious footprint in the software version history, we were unable to identify or study such merging scenarios. Therefore, our empirical findings may not generalize well to the non-typical merging scenarios.
Threats to Construct Validity:. Although we tried our best to manually inspect the sampled textual conflicts and all revealed build/test conflicts, it is possible that our manual analysis is subject to human bias and restricted by our domain knowledge. To alleviate the problem, two authors independently examined each sampled textual conflict, every reported build or test error in the merged software \(A_m\), and related resolution edits applied by the developers. The two authors actively discussed the instances they disagreed upon until reaching a consensus. The root causes of some observed compilation/test errors are not manually located, mainly because we are not familiar with the project-specific software context. In the future, as we gain more knowledge of merge conflicts and create better tools to reveal the root causes automatically, we can further improve the quality of empirical findings.
Threats to Internal Validity:. We adopted compilation and testing to reveal build and test errors and then manually analyzed merging scenarios to locate conflicts responsible for those errors. Both methodologies (i.e., compiler-based and testing-based conflict detection) have various limitations (see Sections
5.3 and
6.3). For instance, the compiler-based method is inapplicable when build conflicts coexist with textual conflicts. This is because the existence of textual conflicts prevents
git merge from creating a merged version
\(A_m\), making it impossible for a build process to tentatively build
\(A_m\) and to detect compilation errors. Similarly, the testing-based method is inapplicable when test conflicts coexist with textual (or build) conflicts. Additionally, the testing-based method only reports test failures or runtime errors; it does not directly pinpoint the conflicting edits responsible for any abnormal program behavior. Therefore, the number of higher-order conflicts reported in our study can be lower than the actual number of conflicts contained by the studied projects. In the future, we would like to overcome this limitation by creating better approaches for conflict detection.
Our current method automatically detects test conflicts only when (1) both l and r compile and pass all tests, and (2) \(A_m\) compiles but fails one or more tests. Given the fact that this method only revealed 33 test conflicts in our study, people may be tempted to relax the filtering condition (2), and only require \(A_m\) to partially instead of fully compile but fail certain tests. We thought about using this alternative method to enlarge the dataset of test conflicts, and believed the method to not work for two reasons. First, we currently rely on Maven/Ant/Gradle to compile and test projects. All build tools have their internal predefined workflows. In a typical workflow, the test phase follows the build phase, and the test phase only starts when the build phase completes successfully. We cannot manipulate this built-in process to run tests on partially compiled code. Even with the straightforward and standard built-in process, we have already spent tremendous time and effort on all studied merging scenarios. It is hard to imagine how much more time and effort we need to put in order to customize the workflows for distinct projects. Second, when partial compilation is successful and test failures occur, it can be even harder to manually diagnose the root cause of test conflicts. This is because a test failure may be due to incomplete compilation or merged edits. It can be more time-consuming to triage the root causes of test failures produced by partially compiled code. Meanwhile, we may lack the domain knowledge to always correctly triage the root causes.
10 Conclusion
Prior empirical studies showed that merge conflicts frequently occur, and conflict resolution is important but challenging. This study comprehensively studied three kinds of conflicts and their resolutions and characterized the conflicts that existing tools cannot handle. Unlike prior studies that focus on textual conflicts, our study is wider and deeper for two reasons. First, we examined higher-order conflicts in addition to textual conflicts, as (1) higher-order conflicts are harder to detect and resolve and (2) few tools are available to handle those conflicts. Second, by comparing the approach design or methodologies against characteristics of real-world conflicts, we identified the limitations of current approaches and suggested future research to overcome those limitations. Our study intends to (1) explore the gap between real conflicts and current tool support and (2) suggest future research to close the gap. No prior work shares the same goals.
Our study provides multiple insights. First, developers usually resolved textual conflicts by keeping all edits from one branch. Our empirical study characterizes the scenarios where a future tool can accurately predict developers’ decision-making to select one branch over the other. Second, current tools mainly rely on two methods to uncover higher-order conflicts: compilation and testing. However, these tools are limited as (1) neither method is applicable when textual conflicts exist between branches, and (2) both methods require heavy human involvement to locate all conflicts. Our research shows that higher-order conflicts present typical commonalities (e.g., broken def-use links and test-implementation inconsistency), which future conflict detectors can leverage. Third, developers usually resolved higher-order conflicts by applying similar edits to similar code locations. By automating such practices, future tools can resolve higher-order conflicts.