Rebasing and merging: some git best practices

By Jonathan Corbet
April 14, 2009

In a typical development cycle, Linus Torvalds pulls patches from over 100 git trees into the mainline repository. While this is going on, it's not unusual for him to complain about how some of those trees are managed; most of the gripes have to do with excessive use of rebasing and merging operations. In a recent discussion on the dri-devel list, Linus clarified his rules somewhat on subsystem tree management. Your editor, on the theory that there might be a developer or two out there who does not read dri-devel, thought that it might be good to expose those rules more widely.

The git "rebase" operation takes a set of patches applied to one tree and reworks them to apply to a different tree. If a developer has written some patches against 2.6.29, he or she can use "git rebase" to turn them into patches against 2.6.30-rc1 instead. With git, rebasing can also be used to make edits to the commit history. If something needs to be fixed in a patch which was made some time ago, the developer can (1) remove the more recent patches from the tree, (2) make the needed changes, and (3) rebase the removed patches back onto the fixed patch. This technique can be used to silently disappear an embarrassing bug from the history, improve patch changelogs, fix a patch conflict against somebody else's tree, and more. It's something that git-based developers simply end up doing occasionally.

There are a couple of problems associated with rebasing, though. One of those is that it changes the commit history. Whenever a series of commits is rebased, anybody who was working with the old history is left out in the cold. If a heavily-used tree is rebased, all developers depending on that tree are forced to scramble to readjust to the new reality. The other problem is that rebased patches are changed patches; any testing that they saw may no longer be applicable. That is why Linus tends to grumble hard at trees which have obviously been rebased just prior to the sending of a pull request. The changes in those trees probably worked before the rebase, but the post-rebase changes have not been tested and may not work as well.

Rebasing is clearly a useful technique, though. Linus does not tell developers not to use it; in fact, he encourages it sometimes. The key rule that was passed down is this: Thou Shalt Not Rebase Trees With History Visible To Others. If a developer has pulled in somebody else's tree, the resulting tree can no longer be rebased, since that would break the second developer's history. Similarly, once a tree has been exported such that others may be making use of it, it can no longer be rebased.

On the other hand, private history can be rebased at will - and it probably should be. If a patch is seen to introduce a bug, it's best to fix it at the source rather than reverting it or adding a second, fixup patch; the result is a cleaner history which is less likely to create problems for people trying to bisect unrelated bugs. Your editor has found that rebasing is often needed to add tags ("Acked-by," for example) to patches which have been circulated for review. When one is creating a set of patches for the mainline kernel, one is really creating an entire history, not just the end result. Making that history clean and readable is to everybody's benefit.

The associated rule that goes with this, though, is that trees which are subject to rebasing should not be exposed to the world:

This means: if you're still in the "git rebase" phase, you don't push it out. If it's not ready, you send patches around, or use private git trees (just as a "patch series replacement") that you don't tell the public at large about.

So, in other words, trees which might be rebased should be kept private. They should also not have other developers' trees pulled into them.

It's worth noting that Linus very much practices what he preaches on this front. The mainline git repository accepts 10,000 or so changesets every development cycle, but it is never rebased. And that is a good thing: rebasing the mainline would cause massive pain throughout the development community.

Merging is the other place where subsystem maintainers can run afoul of the Chief Penguin. A "merge" in git is similar to a merge in most other source code management systems; it joins two (or more) independent lines of development into the current branch. Git merges differ, though, in that they can have more than two incoming branches; Ingo Molnar is famous for his use of "octopus merges" joining vast numbers of branches in a single operation. In almost all cases, performing a merge adds a special commit to the repository indicating that the merge has been done and noting which files, if any, had conflicts.

Merges go both ways. When Linus pulls a subsystem tree into the mainline, the result is a merge. But it is also common for developers to perform merges in the other direction; they will pull the mainline (or some higher-level subsystem tree) into a branch containing a local line of development. It is natural to want to develop code against the current state of the art; it gives confidence that the end result will work with everybody else's changes and minimizes the chances of an ugly merge conflict later on.

But excessive pulling from the mainline (as evidenced by the merge commits which result) tends to irritate Linus. As he put it:

But if I see a lot of "Merge branch 'linus'" in your logs, I'm not going to pull from you, because your tree has obviously had random crap in it that shouldn't be there. You also lose a lot of testability, since now all your tests are going to be about all my random code.

As anybody who has worked with tip-of-the-repository kernels knows, the state of the mainline at any random point can be, well, random. So frequent pulling of the mainline into a development branch will add a certain amount of randomness to that branch; this randomness is not particularly helpful for somebody who is trying to get a feature working. It also increases the chances that another developer who ends up in the middle of the series while running a bisect operation will encounter unrelated bugs. So Linus would rather that developers not pull down from upstream trees:

And, in fact, preferably you don't pull my tree at ALL, since nothing in my tree should be relevant to the development work _you_ do. Sometimes you have to (in order to solve some particularly nasty dependency issue), but it should be a very rare and special thing, and you should think very hard about it.

The reality of the situation tends not to be so strict, though. An occasional merge to stay on top of what's happening elsewhere can make sense. What Linus suggests, though, is that the merges happen at specific release points. So pulling the tip of the mainline tree into a development tree probably does not make sense, but there might be an argument for pulling in 2.6.29 or 2.6.30-rc1. Doing things this way allows development to be based on a (hopefully) relatively stable point where the issues are known.

The temptation to merge in the mainline during development can be hard to resist; one likes to know whether one's work is even remotely relevant to the current state of the code. Fortunately, git makes it really easy to create throwaway branches and test out merges and integration there. Once it's clear that things work, the test branch can be deleted and the (unmerged) development branch sent upstream.

Similar rules apply to the merging of downstream code. The receiving repository should be in a reasonably well defined and stable state; typically developers maintain a "for upstream" branch for this kind of merge. And the downstream code should be "ready": it should be at some sort of release point and not in a random state of development.

Of course, these rules are not absolute:

Git does allow people to do many different things, and solve problems different ways. I just want all the regular workflows to be "good practice", but then if you have to occasionally break the rules to solve some odd problem, go ahead and break the rules (and tell people why you did it that way this time).

Linus first started playing with BitKeeper in February, 2002, so the kernel community now has seven years worth of experience with distributed version control. But the truth of the matter is that we are still figuring out the best way to use this particular tool. This is a process which is likely to continue for some time yet. As other large projects move toward using tools like git, they may want to look hard at the processes and rules which have been developed in the kernel community; they might just be able to shorten their own learning experience.

Index entries for this article
Kernel	Development tools/Git
Kernel	Git

Rebasing and merging: some git best practices

Posted Apr 16, 2009 11:25 UTC (Thu) by Frej (guest, #4165) [Link]

Simple and easy...