Overview
Teaching: 20 min
Exercises: 0 minQuestions
What are the fundamental concepts and motivations of version control?
Objectives
Get a mental representation for commits and branches.
Understand the difference between forks and clones.
Understand the difference between Git and GitHub.
Introduction to the lesson
This tutorial provides a brief but hands-on introduction to Git and GitHub to exercise useful workflows within GitHub. Git is a version control system which allows files and data to be tracked synchronized, which allows collaboration and reproducibility. GitHub is a popular website that stores these repositories.
This tutorial avoids using the command line. Instead, we will practice collaborating and sharing using either the GitHub website or GitHub desktop application. Why? Because for many cases, it is enough. Especially if you are contributing to existing non-code projects, this may be the fastest, easiest way to do it. Git and GitHub provide collaboration tools to all kinds of projects, and there are all kinds of good ways to use it.
This serves as an introduction to Git. After this lesson, you will both be able to use Git, and feel much more confident taking a command-line Git course such as CodeRefinery’s git-intro course. Check out their other upcoming workshops.
What we will not cover
- Command line interface
- Cloning using SSH protocol and SSH keys
- Rebasing and squashing
- Many Git tricks which can be explored later
Why GitHub?
We will do this exercise on GitHub Links to an external site. but also GitLab Links to an external site. and Bitbucket Links to an external site. allow similar workflows and basically everything that we will discuss is transferable. With this material and these exercises we do not endorse the company GitHub Links to an external site.. We have chosen to demonstrate a number of concepts using examples with GitHub Links to an external site. because it is currently the most popular web platform for hosting Git repositories and the chance is high that you will interact with GitHub Links to an external site.-based repositories even if you choose to host your Git repository on another platform.
We also encourage course participants to use CodeRefinery’s Nordic research software repository platform Links to an external site., for more information see https://coderefinery.org/repository/ Links to an external site..
Version control in general
- Version control is a tool that can record snapshots of a project.
- You can think of version control like regularly taking a photo of your work (movie sets take regular polaroids to be able to recreate a scene the next day).
Snapshots (commits) in the nf-core/viralrecon Links to an external site. repository.
What we typically like to snapshot
- Software (this is how it started but Git/GitHub can track a lot more)
- Scripts
- Documents (plain text file much more suitable than Word documents)
- Manuscripts (Git is great for collaborating/sharing LaTeX manuscripts)
- Configuration files
- Website sources
- Data
Why are snapshots valuable? Reproducibility!
- We can always go back if we make a mistake.
- We have the means to refer to a well-defined version of a project when sharing, collaborating, and publishing.
- If we discover a problem, we can find out when it was introduced.
Version control with Git and GitHub
Difference between Git Links to an external site. and GitHub Links to an external site.
Git
- Tool that can record and synchronize snapshots.
- Not the only tool that can record snapshots (other popular tools are Subversion Links to an external site. and Mercurial Links to an external site.).
- Not only a tool but also a format that can be read by many different tools.
GitHub
- Service that provides hosting for Git repositories with a nice web interface.
- Not the only service that provides this (other popular services are GitLab Links to an external site. and Bitbucket Links to an external site.).
GitHub Desktop
- Graphical user interface to Git and GitHub which runs locally on your computer.
- There are other tools that can do this too (e.g. Sourcetree Links to an external site.).
Commits, branches, repositories, forks, clones
- repository: The project, contains all data and history (commits, branches, tags).
- commit: Snapshot of the project, gets a unique identifier (e.g.
c7f0e8bfc718be04525847fc7ac237f470add76e
). - branch: Independent development line, often we call the main development line
main
. - tag: A pointer to one commit, to be able to refer to it later. Like a sticky note that you attach to a particular commit (e.g.
phd-printed
orpaper-submitted
). - cloning: Copying the whole repository to your laptop - the first time. It is not necessary to download each file one by one.
- forking: Taking a copy of a repository (which is typically not yours) - your copy (fork) stays on GitHub and you can make changes to your copy.
GitHub file view of the nf-core/viralrecon Links to an external site. repository. This is the version of all files at a single point in time.
Github history view of the viralrecon Links to an external site. repository. This is the progression of the repository (with the commit message over time).
Network graph of all commits in the nf-core/viralrecon Links to an external site. repository. This shows the relationship between different forks of people who are contributing and sharing code.
Interesting repositories to explore these concepts
- Assembly and intrahost/low-frequency variant calling for viral samples
- Repository: https://github.com/nf-core/viralrecon/ Links to an external site.
- Commits, branches, forks: https://github.com/nf-core/viralrecon//network Links to an external site.
- Swedish COVID-19 data portal
- Activity inequality study
Links to an external site.
- Contains data and code necessary to create figures from their article.
- Data: https://github.com/timalthoff/activityinequality/tree/master/data Links to an external site.
- FiveThirtyEight story Why We’re Sharing 3 Million Russian Troll Tweets
Links to an external site.
- Contains data and readme file, no code.
- Data: https://github.com/fivethirtyeight/russian-troll-tweets Links to an external site.
- The NY Times Coronavirus (Covid-19) Data in the United States
- Contains data, readme, license, but no code. As of 2020 April, being updated every day.
- Data: https://github.com/nytimes/covid-19-data Links to an external site.
- Website: https://www.nytimes.com/interactive/2020/us/coronavirus-us-cases.html Links to an external site.
- CSV exports of the Getty Provenance Index
- Entire books are written using Git/GitHub:
- Papers under open review:
Why using repositories?
- All changes are recorded.
- We do not have to send changes via email.
- We can experiment with several ideas which might not work out (using branches).
- Several people can work on the same project at the same time (using branches).
- We do not have to wait for others to send us “the latest version” over email.
- We do not have to merge parallel developments by hand.
- Group-based access model where shared access is the default, instead of everything fundamentally owned by individuals who manage sharing as-needed: with Git you can easily have collaboration be the default.
- It is possible to serve websites directly from a repository.
Discussion
- How have you solved these in the past without version control?