I joined the company last year in April into a team of Data Engineers. The team consisted of 3 interns and 3 Data Engineers collaborating using the version control system, Git. We didn’t have much of a structure in place to peer review the code that went into production. We did use pull requests on Github at the time, but we had all the issues in our git repositories that an early stage startup might have. We added inessential files into the repositories (such as .DS_Store) along with hardcoded paths to local files, no proper commit messages, no logical commits to name a few.
Since then, we have come a long way. Our team has grown from 6 engineers to 14 engineers. We have made significant improvements to our code review process. We ensure that at least two engineers see every piece of code before it goes into production. Our git commits are much more logical and always accompanied with much more meaningful and detailed commit messages than before. Through all of this, one of the decisions we had to make was to choose the right Git Workflow, specifically whether to “merge” pull requests or use “rebase and merge” instead.
In this article, I first discuss the “rebase” workflow that we follow at zeotap in the Data Engineering team. Then, I argue that the “rebase” workflow builds better team dynamics compared to the “merge” workflow. While the “rebase” workflow is trickier to follow, it has numerous benefits. For instance, it forces the developers to make logical commits resulting in better code readability and code ownership. It also fits well with the agile software development strategy by ensuring frequent commits and pull requests.
At zeotap, every developer works on a particular part/feature of a project, building commits in their own branch. Once a logical portion of the feature is complete, the developer raises a pull request on Github. Then, at least two different peer engineers review this pull request before it can be merged into the master/sprint branch.
While merging the pull request, we consider two choices. We can either “merge” the pull request resulting in a merge commit, which is a special kind of commit merging a developer branch into the master/sprint branch [https://www.atlassian.com/git/tutorials/merging-vs-rebasing]. A sequence of merge commits in a Git repository builds a complex tree of commits leading up to the current version of the code.
Another choice is to “rebase” all the commits into a developer branch on top of the master/sprint branch and merge the pull request using a fast-forward merge strategy. [https://ariya.io/2013/09/fast-forward-git-merge]. This results in a structure where it looks as if all the commits were made on the master/sprint branch directly and the developer branch never really existed. In practice, we actually delete the developer branch once the pull request is merged.
Before we compare both the workflows, below is the “rebase” workflow that we follow at zeotap. In this workflow, we never force-update the master/sprint branch, since only developers can force-update their branch.
Now, let’s compare both the “merge” and “rebase” workflows.
First of all, let me accept that “rebase” is complex, destructive and takes time to get used to. If there are conflicts, they need to be resolved for each commit that is rebased on the current branch. Once “rebase” is complete, one has to destructively force push to the remote branch resulting in not a safe practice. Github does have mechanisms available to ensure that the sprint/master branch is never modified once committed, nevertheless, the process is complex.
On the other hand, “merge” is relatively easier, requires little effort and git does most of the work for you. It creates a “merge” commit explicitly marking the merge of the developer branch into the sprint/master branch. This can lead to conflicts as well, but those are resolved in a single step for all commits in a branch unlike with the “rebase” approach.
While “rebase” is hard and destructive, it reinforces the idea of logical commits. By logical commits, I mean that each commit has a complete logical unit of code such as fixing a bug, adding a feature or a part of it. This has really helped us improve the readability of our commits in a git repository. When a developer has to rebase their branch on the master/sprint branch, the sequence of new commits on the developer’s branch are applied on top of the master/sprint branch. If these commits are not logical, it could be tedious to resolve conflicts in this process. Logical commits allow us to distinctly identify how to resolve the conflicts based on the logical change that has been made in a given commit. This reinforces the idea of building logical commits while writing code. On the contrary, in the case of the “merge” workflow, the conflicts need to be resolved in the “merge” commit and no such need of creating logical commits ever arises.
When git is unable to perform “merge” due to conflicts in the code, it typically informs the developer, and he has to take the matter into his own hands. In the case of the “rebase” workflow, this is done while the pull request is getting reviewed. Whereas, in the case of the “merge” workflow, the conflicts are sometimes resolved while merging the pull request. What we really liked about the “rebase” workflow is that it takes away the choice of merging later. It becomes the responsibility of the developer to ensure all the changes go are done correctly before the pull request is merged. The developer has the responsibility of rebasing his changes on the master/sprint branch before creating the pull request compared to relying on Git or code reviewers for merging the changes in the pull request.
This is a straightforward consequence of using the “rebase” workflow. It provides a linear history of commits in a repository which is easy to reason about. Also, from the history, it is easier to understand how the current state of the code is achieved. It is also much easier to undo commits from the linear history of commits. Whereas, with the “merge” commit, a much more complex tree structure of commits is built, which makes it difficult to identify and revert.
This is the final advantage that I’d like to discuss in this article. I believe that the “rebase” approach fits better into the world of Agile Software Development. Zeotap’s environment is extremely fast-paced and therefore requirements and code tend to change very frequently. In such a fast paced environment, the “rebase” workflow helps us keep up with the code changes made by peer developers. We create and merge our code frequently through pull requests to ensure that our branch does not lag behind the master/sprint branch. Given that “rebase” requires resolving conflicts commit by commit for all the commits in the pull request, the fewer the commits, the easier it becomes to resolve conflicts. It, therefore, pushes developers to merge changes as frequently as possible to reduce the workaround effort when resolving conflicts. Sometimes when we do find conflicts, we also engage in discussions among ourselves to understand the reason behind them. This always leads us into becoming a much more collaborative team.
The “Rebase” and “Merge” approach are both excellent strategies for the Git workflow which can be adopted at the workplace. Nevertheless, I think the major difference is that the “rebase” workflow is destructive and takes some time to get used to whereas the “merge” workflow is easier to use and doesn’t require training. Despite that, the “rebase” workflow has a slight advantage over “merge” in an agile environment: it requires developers to think more deeply about their commits in order to build logical commits and take ownership of each pull request, providing a linear history which is easy to reason about.
We have been using the “rebase” workflow for more than a year now. Whenever developers join the team, we help onboard them with our Git workflow. Once they get used to this workflow, the rest of the benefits simply follow.
Aman Mangal, Software Engineer at zeotap, is one of the many team members that works with a tremendous amount of data on a daily basis. His main tasks include managing the different partners’ data ingestion into zeotap’s systems as well as monitoring and optimizing the data pipeline for smooth functioning. After completing his computer science studies both in India and the US, Aman had the opportunity to contribute to companies like Nokia Bell Labs, Zerostack and Lumiata before joining zeotap. His areas of expertise include data engineering, distributed systems and linux containers amongst others.
For more info on Aman’s profile, visit his Linkedin page here.