Case study: Basic use of Git#

This case study will take you through the setup and commands needed to create and use a local git repository. Note that as you work through this example the dates, username and email address will appear different from mine (so don’t panic).

The material here assumes you have Git installed on your machine.

Scenario#

You are writing a new analysis program for your healthcare organisation. Its a complicated task so you have decided to use Git to manage and version control your code. This is the first time you have used Git.

Step 1: Setup Git#

The first time you use Git you need to setup the username and email address that you are going to use. When you do this make sure that you select your own username and email address and not mine!

In a later section we will use Github for remote version control, collaboration, and sharing. I very much recommend using the email address you would normally use to sign up to an online service like GitHub. If you have privacy concerns then GitHub has a solution.

$ git config --global user.name "Tom Monks"
$ git config --global user.email "email_I_use_for_github@gmail.com"

You can check your config via

$ git config --list

Step 2: Initialise the Git repository#

Navigate to the directory on your machine where you want to store your code. If you need to create a new directory for example analysis_code

$ git init

When you issue the init command your terminal should return the following Initialised empty Git repository in <path to your directory/analysis_code/.git> This means there is a hidden directory called .git within your analysis_code directory. This is the git repository (I’ll shorten this to repo from now on).

Step 3: Add some code#

Within analysis_code we create a file main.py that contains the following code:

'''
main module

'''

def do_something():
    pass

if __name__ == '__main__':
    do_something()

Step 4: Check the status of the git repo#

The status command is something I would recommend using frequently. It let’s you know the status of the repo.

$ git status

After steps 1-4 git reports that we are on the master branch (please note that future releases of git will migrate from master to main branch); there are no commits yet and we have one untracked file: main.py

On branch master

No commits yet

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	main.py

Step 5: Stage the changes#

Now we need to stage (add) the changes. Staging a file tells git that you want to include it in the next commit to the repository. There are many file types that you do not want to track in a repository. For example, python IDE’s such as VSCode or Jupyter produce additional files to support the experience of using them (for code analysis or checkpoints). These types of file clutter a repo, increase its size and are not needed to run the code within them.

Becareful of what you commit

I advise you always look at what you have staged before you commit. Don’t commit any sensitive information such as password or personally identifiable data.

Git is very flexible in how you stage changes. We could for example add all modified or untracked files:

$ git add .

That has the downside of being a kitchen sink approach and may include files you do not wish to commit in the staging. An alternative is to add all python files only

$ git add *.py

or add the specific python file:

$ git add main.py

If we follow any of these commands with git status then git tells which files are to be committed.

On branch master

No commits yet

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)
	new file:   main.py

Step 6: Commit the changes#

We are now ready to commit our code to the repository. A commit also contains meta-data describing the commit. Writing a good commit message is a bit of an art. Before I give you some tips let’s have a look at the command

$ git commit -m "INIT: add main.py"

The option -m just let’s us add a short descriptive message the commit. A useful tactic a very bright colleague recommended to me was to divide a commit message into two parts: the commit type and a description of changes. INIT tells me and others the type of commit or feature it is targeting. In this case its the the initial commit. The remaining text add main.py is just referring to the file that has been added.

If we now follow this with the log command

$ git log

We can see the commit history

commit 6b91b6a06a2e624e8fd8a61b0fb50b46e7c994c3 (HEAD -> master)
Author: Tom Monks <email_I_use_for_github@gmail.com>
Date:   Tue Aug 3 15:48:11 2021 +0100

    INIT: add main.py

Commit message tips

  • TEST -testing code

  • DOCS - documentation (or README if limited to that)

  • PATCH - bug fix (or FIX)

When I am working on a ‘feature’ I also sometimes use a shorthand. For example, on the markdown on this page I was using `GIT:’. Here are a few real commits I made

  • GIT: case study 1 steps 1-7

  • GIT: + commit message tips

  • GIT: case study 1 typo fix

Step 7: Add a .gitignore#

To help avoid staging and committing files that you do not want you need to include .gitignore (a git ignore file). This is just a list of file types and directories that Git will ignore when staging changes. For example, let’s assume the analysis code inmain.py generates a .log file that reports intermediate results and if the run was successful. There is no value in committing this data to a repo as it varies on each run and increases the size of the repo. For this example let’s manually create a few log files and add them to the directory. We will also create a file called .gitignore

analysis_code
├── main.py
├── run_1.log
├── run_2.log
├── .gitignore

Before you add anything to .gitignore issue a status command

$ git status

The output will include both of the log files as untracked file to be staged.

On branch master
Untracked files:
  (use "git add <file>..." to include in what will be committed)
	.gitignore
	run_1.log
	run_2.log

We can avoid this problem by adding the following to .gitignore

# analysis log files
*.log

Now rerun the status command

$ git status

and you will find that both log files have been ignored.

On branch master
Untracked files:
  (use "git add <file>..." to include in what will be committed)
	.gitignore

So we now need to stage .gitignore and add an appropriate commit message. For my commit message I’ve chosen to use SETUP as the commit type as I am changing the setup of my git repo (another option might be CONFIG).

$ git add .gitignore
$ git commit -m "SETUP: .gitignore + *.log"

Step 8: View the diff#

We now have a couple of commits. Taking a look at the log we have

commit 250435844c4d13c08cc34045d0be9f31e0b76ebd (HEAD -> master)
Author: Tom Monks <email_I_use_for_github@gmail.com>
Date:   Tue Aug 3 16:04:03 2021 +0100

    SETUP: .gitignore + *.log

commit 6b91b6a06a2e624e8fd8a61b0fb50b46e7c994c3
Author: Tom Monks <email_I_use_for_github@gmail.com>
Date:   Tue Aug 3 15:48:11 2021 +0100

Each of these commit’s has a (rather long) identifier that we will call a hash. We can view the changes between two commits using the diff command

$ git diff 6b91b6a..2504358

Notice that I did not need to specify the full commit id (although you could copy paste). Instead git allowed me to specify a enough characters of each so that they were identifiable (typically 7). The diff reported by git is as expected:

diff --git a/.gitignore b/.gitignore
new file mode 100644
index 0000000..5281d88
--- /dev/null
+++ b/.gitignore
@@ -0,0 +1,2 @@
+# analysis log files
+*.log

This tells us that .gitignore was added including what it contained.

A tip is use the command git log --oneline This will provide a one line summary of each commit including the shortened commit hash.

Step 9: Add readme.md#

Finally let’s add some basic documentation to the project. We will do this using a markdown file called readme.md.

# Uber important health data science project

To run the code open a terminal and enter

`python3 main.py`

Once we have added the docs we can check the status of the repo, stage the changes, check that the changes have been successfully stage, and commit to the repo. After this we will look at the last two commits in the log to check everything has worked.

$ git status
$ git add *.md
$ git status
$ git commit -m "DOCS: run instructions"
$ git log -2
commit f732e5c1302c0f467853b1b256fcd4a4625b1a76 (HEAD -> master)
Author: Tom Monks <email_I_use_for_github@gmail.com>
Date:   Wed Aug 4 10:39:06 2021 +0100

    DOCS: +run instructions

commit 250435844c4d13c08cc34045d0be9f31e0b76ebd
Author: Tom Monks <email_I_use_for_github@gmail.com>
Date:   Tue Aug 3 16:04:03 2021 +0100

    SETUP: .gitignore + *.log

The final project folder looks like this:

analysis_code
├── main.py
├── readme.md
├── run_1.log
└── run_2.log

(Optional) Step 10: Tag the last commit as v1.0.0#

It is sometimes useful to tag a commit when it represents a milestone. This gives it a specific version number, for example v0.1.0, v1.0.0, v1.1.0 or v2.0.0, along with additional descriptive meta data. The formatting of the tags typically used is <major>.<minor>.<patch>. I find it very useful to think in terms of version numbers and milestones as opposed to individual commits (I’m sure I’m not alone there). Tags are stored in the git repo along with meta data. To tag the last commit we use

$ git tag -a v1.0.0 f732e5c -m "Analysis of full patient cohort"

You can view the tag via git tag show v1.0.0 and view all tags via git tag.

I don’t often include a tagging step in my local workflow.

The reason for that is tags are not automatically pushed to a remote repo (for example on GitHub). In fact I tend to create tagged releases on Github itself. That allows collaborators to download a milestone version of the code and provides an shared language when talking about versions/milestones with others.