How to work in a team: version control and git

Differences between academia and industry

If you are an engineering student you are probably used to programming courses that focus on solving relatively small problems, often (but not always) in isolation. Perform a binary search, traverse a graph, implement stacks and queues, maybe handle a bit of concurrency, maybe a bit of functional programming. All very interesting and challenging, don't get me wrong, but not nearly enough for what's out there.

You see, in academia we rely on a lot of assumptions. A usual group project is based on the following assumptions:

  • There will only be one user at a time.
  • The code will only be seen by the group members and trashed (be forgotten) after the project is done.
  • The project will run only when the instructor has to evaluate it or for testing, probably on the machine of the team member who has a MacBook (my department was GNU/Linux-only).
  • Continuous integration means "continuously integrate new code, compile and test until things work".
  • Testing means "make sure it doesn't throw a segmentation fault when you do that thing that caused a segmentation fault last time".
  • Versioning (the topic of this article) means "remember to start Dropbox when your PC boots or we are screwed".

Unfortunately this is very true for research as well. I know many graduate students and most of them don't know what versioning is, how to write tests or even how to deploy an application so that actual people can finally use it. And whenever somebody says stuff like "but I just need to get my results and get my paper published, why would I care about all these things?" I get mad, oh I get very mad.

There's a contradiction here: the industry is usually known for secretive development, competition and profit-oriented decisions, while academia is usually known for open development, collaboration and decisions based on improving the state of the art, with no regard for the cost. And yet, how many research projects are actually published on a public repository? How many have demos deployed somewhere that can be accessed by interested parties or other researchers who want to improve on that?

Learning how to share your code with the world also teaches you how to write good, maintainable code. It's a win-win. Learning how to deploy your app, no matter how simple, teaches you how to advertise your accomplishments. If that's not a good skill to learn I don't know what is.

But I digress.

Versioning

If you work on a project you'll likely be part of a team. If you are not, the project is just not big enough yet. Or maybe you are thinking of open sourcing your project. Regardless, learning how to use versioning tools is fundamental if you want to call yourself a "software developer". Most versioning tools also allow you to trigger events (e.g. using "webhooks"). Imagine updating the code and having your tests (because you do write tests don't you?!?), deployment scripts, sanity checks and whatnot run automatically. Kinda like This Too Shall Pass by OK Go.

But what is versioning? The name says it all: it's a way of keeping versions of files stored in a repository. In our case these files happen to be text representing the source code of a program, nonetheless a VCS (Version Control System) can be used for all sorts of files. That is, imagine that a set of files and directory represents a state in time. Without a VCS you end up overwriting those files and, therefore, rewriting the state. With a VCS, you keep the states as a stack. At any point in time you can retrieve an older state or add a new one on top of the stack.

But what if multiple developers work on the same project? Sharing the same stack might yield unexpected results. Imagine Alice and Bob start working on the same state, the one on top of the stack. Now imagine they both change the same set of files but in different ways, for example they change line 48 of utils.py but Alice writes num = 10 while Bob writes num = 5. Now Alice and Bob want to add the new state on top of the stack. There are 3 courses of action:

  1. The latest state overwrites the previous one.
  2. We merge changes such that both lines are kept but the order is picked according to who submitted the new state first.
  3. We warn the developer who submits last that his/her changes cause a conflict with the current state.

The second option is downright silly. Imagine having tens of lines where the same variable is assigned different values. Chaos!

The first option might be viable but, you know, it's not respectful towards other developers to simply overwrite their changes and, most of the time, it's the cause of many errors. What happens if Charlie starts working on the state submitted by Bob? Wrong assumptions!

The third option is the one normally picked in modern VCS. This way the developers can decide whether it is Alice assigning the wrong value to num or Bob. Once the conflict is resolved things continue as usual.

But wouldn't it be nice if every developer could work on a separate stack and, only once a feature is complete or a bug is fixed, "merge" that stack with the main one? This technique is called "branching". Because of its popularity and relatively ease of use we'll take a look at the VCS git.

alt

The tree of life

When I had to learn git I started reading lots of articles and asked for help to friends of mine who were more experienced in versioning. They all had the worst approach: they started teaching me how to use the git CLI. There are two groups of developers: those who do git add ., git commit, git push and start praying that their push doesn't get rejected and those who actually know what they are doing. Until a couple of years ago I was in the first group. Now that I'm part of the second, I invite you to join me. We have cookies!

The first, foremost and fundamental lesson about git is the following: a git repository is a tree.

...

Absorb that, internalize it, make it your mantra for the rest of your life or career: a git repository is a tree. This sentence is incredibly simple and, at the same time, incredibly powerful. Following Elon Musk's beloved first principles every time you have a problem with git you'll resort to this core concept, find what operations you want to perform on the tree and, only then, find the commands you need.

Now it's time to define what a repository is and see how to create one. As per the description on their website "Git is a free and open source distributed version control system". The keyword here is distributed. In practice, there usually exists a central repository (usually hosted on services like GitLab or GitHub) where all changes are "pushed" and from where developers "pull" to keep their own repositories up to date. As you can guess each developer has a subset or a one-to-one copy of the main repository and it's their responsibility to keep them in sync to avoid inconsistencies.

We will now create a local repository, perform some basic operations, create a remote repository and push our changes there.

Local repository

Depending on your system you might have to install git. Refer to this guide on how to do that. I can wait.

Protip: to master git you can read "Pro Git", it's the official book, available for free.

Now that you have git installed create a new directory (I'll call it git_test) and cd into it. Now let's create a new local repository:

git init

Git will create a hidden directory .git containing all the metadata about the repository. A very important file is .git/config which, unsurprisingly, contains the configuration variables of your repository. We'll visit Mr. config again, let's move on.

Let's add a file to our repository:

echo "line one" > test.txt

Adding a file does not trigger anything in the git domain. It's very important to understand one thing: git doesn't give a damn about your files. It only cares about your commits. As we'll see now a commit represents a state on the stack we were talking about earlier.

Let's try to create a commit:

git commit

You will likely get a message along the lines of nothing added to commit but untracked files present. That's because you created the file but you haven't told git to add that file to the new state. In fact try to run:

git status

Protip: if the output of git is not colored (and, trust me, you'll want colors while using git) you can run git config --global color.ui auto. If the output is still not colored you might want to follow this guide.

You'll see (similarly to the previous output) Untracked files and a list of the files that git is not tracking. Git will also likely suggest to run git add <file> so let's do that:

git add test.txt

If you run git status again you'll see new file: test.txt (in green if you followed the protip). This means git is now keeping track of that file. But what if we have multiple directories and files and we want to add all of them at once? Git allows us to add directories recursively. Let's try it:

makedir mydir

echo "hello" > mydir/newfile.txt

git add .

Remember, . in *nix represents the current directory so we are adding everything that is in it. If we run git status again we'll see that we now have two entries: mydir/newfile.txt and test.txt. We are happy with the state, let's commit!

git commit

This will open a text editor (usually vim or nano if you are on Mac or GNU/Linux). The content will be pre-filled by git with some metadata and an explanation of what to write for your commit message. Commit messages are extremely important, they help keep track of what happened when and, more importantly, why. Writing good commit messages is an art. For now let's focus on two important factors: you should always describe what you did at a high level and never write a commit message for the sake of it.

Examples of bad commit messages:

  • "Wrote some code" (...what?)
  • "Fixed something" (fixed what?)
  • "Added a new feature" (what feature?)
  • "Changed mymodule.py and added a new directory" (not high level)

alt

Examples of good commits:

  • "Scaffolding for project XYZ"
  • "Wrote module for communicating with API ABC"
  • "Refactored module for API ABC"

If you exit the editor without writing a message the commit will be aborted. If you do write a message and exit, the commit will be created in the local repository. Once a commit is created it represents the latest state of that repository. If you go through and run git status you'll see something like:

On branch master  
nothing to commit, working directory clean  

Branching

To be precise, the last commit represents the latest state of the "branch" you are on. Remember we talked about stacks and branching? The branch is just a stack. You can create multiple branches as copy of existing branches (hence the mantra "a git repository is a tree") with the main branch, the "root", called master. We will now create a new branch called new_branch starting from master:

git branch new_branch

This command will create a new branch but we will still be on the master branch. The way git keeps track of what branch we are on is by using a cursor called HEAD. This is the current situation (where A is the first commit you created):

new_branch  -- A

master      -- A <- (HEAD)  

To switch to new_branch we run:

git checkout new_branch

The situation is now:

new_branch  -- A <- (HEAD)

master      -- A  

As usual we can run git status to check what's going on. In this case you'll see On branch new_branch. If you just want to know what branch you are on git branch is enough. Now let's add a commit to our new branch.

echo "line two" >> test.txt

git add .

git commit

Now let's check that our commit is on new_branch but not on master:

git log

Should show a total of two commits. If we checkout master and do the same:

git checkout master

git log (shows all the commit history as a list)

We should see only one log. What's going on here? By committing to new_branch we created the following situation (where B is the new commit):

new_branch  -- A -- B

master      -- A <- (HEAD)  

Since we are back on master HEAD is pointing to A and that is why we only see one commit in the history. With git log we explore the history of the current branch. Whenever you run git checkout ${BRANCH} you are simply moving HEAD to the last commit of the specified branch. Guess what? The names of the branches are simply shortcuts for the last commits of those branches! Gitception!

Refer to a commit

This is all very cool but, if you think about it, we still haven't used git for actual versioning. What if we decide the current state has some problems and we want to roll back to a previous state? As you may have noticed every time we checkout a branch (assuming there aren't untracked files) the content of our working directory changes to reflect the latest state of that branch (i.e. the commit HEAD is pointing to). But, as we saw earlier, checking out a branch simply means moving HEAD to the last commit of that branch. And if the latest commit of a branch is identified by the branch's name how are the other commits identified? Run git log and take a look at the list. You will see something like commit: d1137cb6824c1893eb6e3c77549d31082c6a3cea. That alphanumerical string is a hexidecimal ID that uniquely identifies that commit across all branches. Therefore the name of the branch is just an alias for the ID of the latest commit, no more no less. If a commit is present in multiple branches (e.g. the initial commit A will be in both master and new_branch) it means we can move HEAD to that commit, without any issue, as long as we are on one of those branches.

Let's try to rollback to A (identified by another hex string):

git checkout new_branch

git log (copy the ID of the first commit)

git checkout ${ID_OF_FIRST_COMMIT}

You will see something like You are in 'detached HEAD' state. This means you are in a situation like this:

new_branch  -- A -- B  
               ^------(HEAD)
master      -- A  

HEAD is now pointing to a commit that is not the latest in your branch. In this state you can change and commit as much as you want without touching the original branch (new_branch in our case). But this is not what we want. We want to replace the last commit of the branch with the previous one! There are two ways of doing this:

  • git reset --hard ${ID_OF_FIRST_COMMIT}
  • git revert ${ID_OF_LAST_COMMIT}

The first method simply erases the last commit from the history of our branch. This is dangerous. Whenever you change the history of a branch (not just the last state) you must always act carefully. The --hard option means that, no matter what's in our working directory, it will be replaced with the state of the commit we are resetting to.

Protip: you can also omit --hard (or explicitly specify --soft) and reset to a commit without touching the files in the working directory. We won't cover this case here but there are numerous use cases for it.

The second method keeps the history intact but adds a new commit that simply reverts the changed applied by the specified commit. With the last commit we added the line "line two" to our test.txt file. Let's try reverting that change:

git revert ${ID_OF_SECOND_COMMIT} (be careful! Do not revert the first commit!)

Again you'll be given the chance to change the commit messages. I highly suggest you leave the default text and simply add more information on, perhaps, why you reverted that commit.

Now if we run git log we'll see something like this:

commit 9f1fa5084e83efcbe1fe946fde28a31ad60f9b20  
[...]
    Revert "Commit 2"

    This reverts commit b5712bd073bd0035326ff92bb23a3206ebe6e5b3.

commit b5712bd073bd0035326ff92bb23a3206ebe6e5b3  
[...]
    Commit 2

commit d1137cb6824c1893eb6e3c77549d31082c6a3cea  
[...]
    Commit 1

And if you look into test.txt you'll see "line two" is not there anymore. Great! But now new_branch highly differs from master. We want to "merge" the changes into master to keep our branch synchronized.

Merging and rebasing

There are two techniques in git to merge changes into a single branch:

Merging

Merging, as the word says, simply "merges" the commits of the two branches into the new one. To understand what happens let's imagine we have the following situation:

new_branch  -- A -- B -- C

master      -- A -- D -- E <- (HEAD)  

If we merge new_branch into master:

git merge new_branch

The situation changes into:

new_branch  -- A -- B -- C

master      -- A -- B -- C -- D -- E <- (HEAD)  

Every time we merge git will automatically create a commit message along the lines of Merged branch X into branch Y.

Because the correct time order of commits is maintained it's like shuffling a deck of cards. As you can see we put B and C before D in master since that's their order in time. But most of the time we don't actually care when a commit was created but, rather, when it was merged into a branch. We can either remember to merge master back into new_branch every time we start to work on something new, to keep the timeline in the correct order (and fill our history of merge commits) or we can use rebasing.

Rebasing

Rebasing allows us to always append commits instead of interleaving them. This is especially useful if we want to keep a linear history without changing the order of commits branch-wise. Let's start with this situation:

new_branch  -- A -- B -- D

master      -- A -- C <- (HEAD)  

When merging into master we would, rightfully, expect B to go before C and D after. But before merging we can rebase new_branch on top of master:

git checkout new_branch

git rebase master

Now the situation has changed:

new_branch  -- A -- C -- B -- D

master      -- A -- C <- (HEAD)  

That's perfect! If we now merge new_branch into master:

git checkout master

git merge new_branch

The situation will be:

new_branch  -- A -- C -- B -- D

master      -- A -- C -- B -- D <- (HEAD)  

It makes sense. B was created before C but on another branch! And it was merged into master just now so, branch-wise, B was "created" in master at the time of merging, not before! If all of this is a bit confusing I suggest you draw the states on a piece of paper yourself and follow along. Once it clicks you can come back. I can wait.

alt

Conflicts

You might believe it takes at least two people to generate a conflict. You'd be wrong. Let's create two new branches from master:

git checkout master

git branch conflict1

git branch conflict2

Let's checkout conflict1 and add a new line to our test.txt file:

git checkout conflict1

echo "conflict1" >> test.txt

And commit. Now do the same for conflict2 but let's change the line:

git checkout conflict2

echo "conflict2" >> test.txt

And commit again. Here's the current situation:

conflict1   -- A -- B

master      -- A

conflict2   -- A -- C  

If we merge conflict1 into master (no rebasing necessary since there's only one commit of difference) we get the following:

conflict1   -- A -- B

master      -- A -- B

conflict2   -- A -- C  

If we merge conflict2 into master we get...A conflict!

conflict1   -- A -- B

master      -- A -- B -x- C   <CONFLICT!!!>

conflict2   -- A -- C  

What happened is that the commit C has been created on master but not yet attached to B. Git will warn about the conflict with a message like CONFLICT (content): Merge conflict in test.txt. To resolve the conflict we have two options:

  • Manually edit the files until all conflicts are resolved.
  • Use git mergetool.

The first option is ok for trivial conflicts but you should quickly learn how to use the second one. If we run git mergetool git will start iterating over all the files that contain conflicting lines and offer us the chance to apply the due modifications. At some point you'll see something like Hit return to start merge resolution tool (${MERGE_TOOL_NAME}): (on Mac the default tool is opendiff). Because I don't want this article to be OS-specific I will just let you figure out how your merging tool works. It's usually fairly intuitive (unless you get a vim window and you can't use vim, in that case go look for another merging tool).

For each conflict, no matter the tool, we'll be given the following options:
- Keep the changes in the original branch. - Overwrite the changes with the new branch. - Keep both changes (but give more priority to the original branch). - Keep both changes (but give more priority to the new branch). - Delete the conflicting lines (almost never what you want).

Once the conflicts are resolved we'll have the following situation:

conflict1   -- A -- B

master      -- A -- B -- C -- M

conflict2   -- A -- C  

Where M is the merge commit created by git which will also include which files were conflicting.

Fixing authorship

Before proceeding it is good practice to set the authorship of the commits. By default the commits' author will be your OS' user. But if you want to contribute to a project you'll likely want to set the username and email to values that reflect your actual identity so forget committing as lollipop99.

git config --global user.name "${YOUR_NAME}"

git config --global user.email "${YOUR_EMAIL}"

These two commands will suffice to set your identity straight across all repositories. If you wish to keep separate identities for different repositories simply omit the --global option.

Old commits' authorship will not change! Figuring out how to do that is left as a google sear...aehm...an exercise for the reader.

Pulling and pushing

If you followed along so far and understood how the mechanics of git work, pushing and pulling are nothing new. Remember: we only created a local repository. Now we want to make this repository available to the public (or our team members). To do so we need to create a remote repository that can be accessed by anywhere in the world. The leading service for creating public git repositories is, hands down, GitHub. Go and create an account. I can wait.

Now create a new repository and fill in the required information (keep it simple, it's just for testing). I can wait.

Great! Now that your remote repository is ready it's time to fill it with what we have locally. The first thing we need to do is set a remote repository for the local one. GitHub will likely give you two options once the repository is ready: create a new local repository or push an existing repository. To proceed with the second choice we'll be given two commands:

git remote add origin ${URL_TO_GITHUB_REPOSITORY}

git push origin master (assuming you are working on master)

A repository can have multiple remote repositories but origin is reserved as the default. The URL to a GitHub repository can follow two schemas: git@github.com:${USER}/${REPOSITORY_NAME} and https://github.com/${USER}/${REPOSITORY_NAME}. The most important difference is that the first one allows to push using SSH and, therefore, authenticate using your key pairs while the second one requires you to type in your password every time. The choice is up to you, if you don't know what "authenticate using your key pairs" means it won't make a big difference so just proceed with the default, whatever it is.

And now the grand finale: push means to submit all of the commits on the specified branch to the remote repository. pull means to retrieve all of the commits from the specified branch in the remote repository. That's it. If you treat a remote branch as "yet another branch" push and pull are just merges! You are either merging from the local branch to the remote one or viceversa. Let's visualize this:

master        A -- B

origin/master (empty)

When we do push origin master it becomes:

master        A -- B

origin/master A -- B

Now imagine somebody else pushes two more commits to origin/master (C and D):

master        A -- B

origin/master A -- B -- C -- D

To sync your local repository you'll have to:

git pull origin master

And the result will be:

master        A -- B -- C -- D

origin/master A -- B -- C -- D

All the rules we saw still apply (yes, including conflicts!).

Always remember: git works in terms of commits, not files! Now get out there and start convincing your friends to use git!

Frequently Answered Complaints (F.A.C.)

  • I hate copy/pasting the commit's ID every time. Is there a shortcut?

    • Yes. Remember HEAD is pointing to the latest commit of the current branch? Well, HEAD~1 will point to the previous commit, HEAD~2 to the one above and so on. Using ~ will always select the first parent in case of multiple commits at the same level. You can use ^ to explicitly select a parent. More info are in this awesome Stack Overflow thread.
  • I want to reset to a previous commit without touching the files.

    • git reset --soft ${COMMIT_ID} . --soft is the default so it can be omitted.
  • I made a mess, haven't committed my changes yet and want to rollback to the last working state.

    • git reset --hard . The --hard flag makes sure to restore the tracked files to their last commit's state. Untracked files will not be touched. So if you run git status you will still see them untracked and what you do with them only depends on your purpose.
  • I want to hard reset to a specific commit.

    • git reset --hard ${COMMIT_ID}. The default for COMMIT_ID is HEAD, the current commit (if you are not in detached mode), hence the tip above.
  • I want to clean all the untracked files.

    • git clean -n (-n is for a dry-run) will tell you which untracked files would be removed. git clean -f will actually remove them.
  • I want to reset a single file to its previous state.

    • git checkout -- ${PATH_TO_FILE}. It's like git reset --hard but on a single file. DO NOT try git reset --hard ${PATH_TO_FILE}. In the best case you'll get an error along the lines of fatal: Cannot do hard reset with paths., in the worst you'll throw away any uncommitted change.
  • I want to retrieve a specific file from a specific branch.

    • Similarly to the above: git checkout ${BRANCH} -- ${PATH_TO_FILE}.

Conclusion

I hope you now appreciate the importance of using a VCS, regardless of the size of your project or the number of people in the team. Dropping that DropBox premium subscription and moving all of your projects to a hosted VCS is your first step towards a happier life (and a sustainable career as a software developer).

(Special thanks goes to XKCD for being an awesome webcomic.)