Preserving Git Blame History when Refactoring
By Eric — — 3 minute readI have some big source files in a project I'm working on. Embarrassingly big files -- multiple thousands of lines long. Those files also have years of Git blame history associated with them, and not wanting to lose that history has deterred me from refactoring to break those files into more manageable modules. It is wonderful, after all, to turn on blame/annotate in my editor to see what bozo made the last edit (Oh, wait, I did that?) as well as when and why.
If you Google about this problem, there are a bunch of StackOverflow and blog posts about it, which I'll divide into two categories:
- Lofty Git expositions that tell you Git just doesn't work that way, implying that you don't really want what you think you want.
- Pragmatic steps you can take to coerce Git to keep the blame history when splitting files apart.
I was partial to category #2, because I was pretty sure I did want to preserve the history, and if the tool needed some arm twisting, so be it.
Unfortunately, the general workflow for category #2 involves a lot of copying files and deleting the parts you don't want in the final file. So if you want to split a file into 5 files, make 5 copies of the original and edit each one to delete the lines you don't want in that copy, with some strategically placed commits along the way. I actually did this once, and it took the better part of an incredibly tedious day to accomplish. This is where programmers naturally think, "There really needs to be a tool to automate this." So I made one.
I based the tool on a couple of blog posts by Raymond
Chen, because
he's smart and had nice step-by-step instructions with simple examples. I was
coding away, having fun with the elusive promise of a side-project that might
actually be useful. I had unit tests that followed Raymond's example, and they
all passed! It was time to try my nifty tool in the real world... where it
immediately failed for one of the primary uses. git-blame
showed me, the
one splitting the file, as the author of all of the lines in the extracted
file, rather than the original authors.
Humbled a bit, I had to go back to category #1 of the posts about how Git works. Funny thing, if you're writing a tool to extend the functionality of some other tool, it is good to understand how the original tool works.
One of those posts led me to this explanation from Linus
Torvalds
about the difference between content and files, and how Git keeps the
information that allows tools (like git blame
) to figure out that content
was moved from one file to another. This, as opposed to keeping some
representation of the "move" operation between files in the repository. Linus
had some confidence that this design choice was appropriate:
In other words, I'm right. I'm always right, but sometimes I'm more right than other times. And dammit, when I say "files don't matter", I'm really really Right(tm).
So, pragmatically, what do you do?
- Split the file apart without any special efforts.
- Commit the changes all together in a single commit.
- Use the
-C
flag when runninggit blame
.
The -C
flag, according to the
documentation, detects moved or copied
lines within a file, and detects lines moved or copied from other files that
were modified in the same commit. That is exactly what I was looking for.
Except... I don't usually run git blame
in the terminal -- my editor does
it. Fortunately, in the GitLens extension for Visual Studio Code, there is an
option you can use to supply -C
when running blame. Add this to
settings.json:
"gitlens.advanced.blame.customArguments": ["-C"]
So now that I can split those monstrous files apart, how big or small should the resulting files be? My rule of thumb: Small enough that it wouldn't be unreasonable to read the entire file if I needed to make a change to it.