Git and the Centralized Workflow
Note: This lesson is in beta status! It may have issues that have not been addressed.
Handouts for this lesson need to be saved on your computer. Download and unzip this material into the directory (a.k.a. folder) where you plan to work.
As your research project moves from conception, through data collection, modeling and analysis, to publishing and other forms of dissemination, it’s components can fracture, lose their development history, and—worst of all—become conflicted or lost.
This lesson explains a high level strategy for organizing your collaborative workflow and introduces accompanying software and cloud solutions. This strategy for distributed work on a shared codebase—the centralized workflow—is widespread in collaborative research.
A central hub stores project files and their history. Researchers are spokes on the wheel, working on private copies of the project. Project integrity is maintained through rules enforced by the hub for synchronizing between hub and spokes.
Objectives for this lesson
- See what version control does
- Learn about centralized workflows
- Try out GitHub
- Make “commits” to a project file with git
- “Push” and “pull” project work to GitHub
- “Merge” your work with a GitHub collaborator’s
Git in the Shell
The namesake of GitHub is the command-line utility
git. It performs
the clone, push, pull, and merge procedures just mentioned, and many
git from the command line, you issue commands through
the Unix shell. These commands have their own special syntax.
If you aren’t familiar with Unix shell commands, you might want to
look at this lesson from Software Carpentry.
Or check out explainshell.com, which is a
handy tool that gives you the help text associated with specific shell
Note on terminology and configuration
As of October 1, 2020, all new repositories created on GitHub will have a default branch
main. Previously, the default name was
master. The change was
made to promote inclusive language in the version control world. SESYNC is planning to update
the GitLab server to match this new default. However, the git client will still
master if you create a repository locally, unless you configure it as
You should also be aware that any documentation, tutorial, or StackOverflow
post written before 2020 will assume your default branch is called
We recommend setting the default branch name for new repositories you create
main. Enter the following into your terminal prompt.
git config --global init.defaultBranch main
This option is available for git version 2.28 or later.
The software has no GUI of its own, and works through commands always beginning with
git given in the shell.
For example, the command to turn the “current folder” into a git repo is
You would run
git init locally from an existing folder containing project code.
cd <path to directory> git init
Add files to git’s watchlist with the “add” command. This action is also known as “staging” files.
git add <path to files> git status
You can stage all files that have been modified since the last commit with
git add ..
“Commit” updates the added (staged) files in a newly labeled version of your project’s history.
git commit -m "initial commit"
*** Please tell me who you are. Run git config --global user.email "email@example.com" git config --global user.name "Your Name" to set your account's default identity. Omit --global to set the identity only in this repository. fatal: empty ident name (for <(null)>) not allowed
The above error message appears if you have not yet configured your local machine with your GitHub user credentials.
Every commit needs an author. Follow git’s instructions, using a real email address so your commits can be associated with your GitHub account, and try again.
git commit -m "initial commit" git status
Now, author information will be associated with any commits you make. This is a one-time configuration for each computer on which you use git.
Saving, staging, and committing are each separate steps, none of which imply any of the others. This may seem like a hassle, but is a good thing! As your project grows larger, you will frequently save changes you don’t want to commit: staging lets you choose what changes get packaged into a commit.
Look at the Log
Version control gives you access to the state of the repository at any previous commit. View this history in the log.
commit <sha> Author: <author> Date: <datetime> initial commit
Edit your committed file with some small, breaking change. Create a second commit that includes this change, and make sure it shows up in the log.
Let’s investigate the most recent commit.
commit <sha> Author: <author> Date: <datetime> <message> <diff>
git revert --no-edit <sha>
[main <sha>] Revert <message> 1 file changed, 1 insertion(+), 1 deletion(-)
A Plug for Reproducible Research
Reproducibility is a core tenet of the scientific method. Experiments are reported in sufficient detail for a skilled practitioner to duplicate the result.
The principle applies equally to modeling, analysis, and perhaps most of all to data synthesis.
Hallmarks of reproducible research:
|Reviewable||All details of the method used are easily accessible for peer review and community review.|
|Auditable||Records exist to document how the methods and conclusions evolved, but may be private.|
|Replicable||Given sufficient resources, a skilled practitioner could duplicate the research without any guesswork.|
|Open||The orginator grants permissions for reuse and extension of the research products.|
Let your workflow help achieve these same goals:
|Thoroughly-comment scripts and share continusously with collaborators||Reviewable|
|Maintain project history to correct mistakes when necessary||Auditable|
|Provide “one-click” file & data sharing, of a streamlined analysis “pipeline”||Replicable|
|Publically release on GitHub (or similar) with (implied) open licensing||Open|
What’s a GitHub?
The origin is the central copy of the project, a repository that lives on GitHub. Every member of the team uses a local copy of the entire project, called a clone.
Cloning is the initial pull of the entire project and all its history. In general, a worker pulls the work of other teammates from the origin when ready to incorporate their work, and she pushes updates to the origin when ready to share her own work.
A commit is a unit of work: any collection of changes to one or more files in the repository. A versioned project is like a tree of commits, although the current tree has just one branch. After a worker creates a clone, the local copy is viewing the same commit as the origin.
Notice that the local and remote (origin) repos are both on a branch called main in the diagram below. This is the default name given to the primary version of the repository.
When the origin has commits that do not exist in the local repo, it has gotten ahead and a pull is required to synchronize state.
A pull, or initially a clone, applies commits copied from the origin to your local repo, syncing them up as if you had created identical commits locally.
In the opposite situation, commits created locally are not immediately synchronized to the origin.
A push copies local commits to the origin and applies them remotely.
Create a GitHub Repository
Sign in or create a GitHub account.
Create a personal access token.
IMPORTANT: As of August 2021, a personal access token is now required to authenticate
pushing to a remote repo. The link above is to a GitHub documentation page
with very detailed instructions on how to navigate to the settings page where
you can generate a token. When you are prompted to select the scopes
(permissions) to give the token, check the box marked
After you generate the token, save it in a safe
place; you will need it in a moment. The best place to save it long-term is
a password manager such as LastPass.
3. Create a new repository on your GitHub page.
- Give the repo a name
- Add a short “tag line” to jog your memory
- Do not check the box or add anything
You have created an empty repository. The quick start information provides clues on how to see your first commits.
Configure your clone
To push and pull from your local repo to GitHub, you must configure your local repo with the URL of the remote repo. By convention, we call the central copy the “origin”.
git remote add origin <URL>
Push your commit up to the origin.
IMPORTANT: When you are prompted to enter your password, paste your personal access token
into the prompt, not the password that you use to sign in to GitHub.com in your browser.
On Windows you will need to use
Insert or right-click to paste, because
will not work in a terminal window.
Username for 'https://github.com': <username> Password for 'https://<username>@github.com': Counting objects: <progress> Delta compression using up to 4 threads. Compressing objects: <progress> Writing objects: <progress> <stats> remote: Resolving deltas: <progress> To 'https://github.com/<username>/<repo>.git' <sha>..<sha> main -> main Branch 'main' set up to track remote branch 'main' from 'origin'. Counting objects: <progress>
Take a look at the repository on GitHub.
- There is a space for files
- There is a suggestion to create a
README.md, a project summary in Markdown.
- You are looking at a branch called main.
- The commit history is available from the top bar.
- The “Clone or download” button provides a URL.
The online editor is good for quick-n-easy fixes, and for working on documentation. Its a bad place to modify code, because it’s not tested before reaching the origin. It’s great for creating a project README.
Create a new file called “README.md” and add the following content on separate lines with a blank line in between.
- A title, preceded by
#(the markdown “level 1” heading)
- A “About” section, preceded by
##(the markdown “level 2” heading)
- A “Contributors” section, preced by
- Your name, preceded by
-(the markdown bulleted list)
As you go, utilize the Preview tab to see the result of rendering your Markdown to HTML.
An essential component of the centralized workflow is the ability to merge commit histories that have diverged. Each fork in the log has to be re-integrated, and git does this automatically through merging.
git add <path> git commit -m 'feel the learn'
[main <sha>] feel the learn 5 files changed, 955 insertions(+)
Merge commits most commonly arise when a commit shows up on GitHub that isn’t in your local clone. Such as the current situation.
Even though these changes do not conflict, GitHub won’t allow you to push. Take a moment to read the message, it gives a good explanation of what has happened.
To https://github.com/<username>/<repo>.git ! [rejected] main -> main (fetch first) error: failed to push some refs to 'https://github.com/<username>/<repo>.git' hint: Updates were rejected because the remote contains work that you do hint: not have locally. This is usually caused by another repository pushing hint: to the same ref. You may want to first integrate the remote changes hint: (e.g., 'git pull ...') before pushing again. hint: See the 'Note about fast-forwards' in 'git push --help' for details.
The origin does not even attempt to reconcile diverging commit histories; it does not matter that the diverging commits affect separate files. In order to preserve the repo, the contributor is always responsible for “overseeing” the merge on a local clone.
Take the Hint!
remote: Counting objects: <progress> remote: Compressing objects: <progress> remote: <stats> Unpacking objects: <progress> From https://github.com/<username>/<repo> <sha>..<sha> main -> origin/main Auto-merging README.md Merge made by the 'recursive' strategy. README.md | 1 + 1 file changed, 1 insertion(+)
The message tells you about any changes made by this merge commit, which seamlessly integrates changes to the same file by multiple authors.
Working with Collaborators
True collaboration goes deeper than commenting on a final report, but integrated work on a project from start to finish raises workflow challenges.
- Be it data, a script, or a write-up, who has the most up-to-date version?
- Will a teammate’s work overwrite any of your own?
- How do I recover the working version of code the PI broke?
Centralized workflows, managed by
git, help solve these challenges.
- The origin becomes the official up-to-date repo, even if your work is a few commits ahead.
- Diverging files are easily reintegrated with a merge algorithm.
- The complete project history is available to checkout.
Note, version control works really well with text. Non-textual components of your project (e.g. large or binary data) rarely live in a repository. Use cloud storage for more static files and a database for dynamic records.
## Collaborators - <your name> - My Neighbor
Add a section where you can list collaborators to the
Our aim is to let your project collaborator replace “My Neighbor” with his or
Commit it with git
Before you can commit changes involving a new file, you have to tell the version control system (that’s
git!) what changes to include.
git add README.md git commit -m 'just me so far!'
Look at the
git status and notice that your branch is ahead of origin/main! Push those commit(s) to your GitHub repo.
The first step to collaborative workflows is granting access to the origin of your project. Introduce yourself to your neighbor, and decide which of you will be the “owner” and which the “collaborator”. The owner will need the collaborator’s GitHub username.
Even on public GitHub repos, only the owner has “push access” by default. The owner can allow any other GitHub user to push by inviting collaborators under the Settings tab (Settings > Manage access > Invite a collaborator).
Add your neighbor as a collaborator!
As the collaborator on your neighbor’s repository, you have permission to edit his or her
README.md. Make sure you accept the invitation to collaborate in your email!
The text below shows where you’ll see the owner’s name if you’re looking at the right (not your own). The collaborator should edit the file in the owner’s repo, by replacing “My Neighbor” with his or her own name.
## Collaborators - <the owner's name> - <your name>
Write a meaningful commit message while “saving” your work. Note that on the GitHub editor, there’s no distinction between save and commit. The owners should then pull the new commit into their local clone of the project.
Diverging commits that do not affect the same files, or affect different lines within a file, can usually be merged automatically. That’s what happened in the previous example where everything happened in sequence. First, the owner committed and pushed, then the collaborator pulled, committed, and pushed, then the owner pulled again. But if both owner and collaborator modify the same file simultaneously, git cannot safely merge the commits because it has no way of knowing which version to use. If git cannot safely merge commits, it guides you through conflict resolution.
A “merge conflict” will arise when two contributors change a line of text. For example, if you both add a project description.
The owner adds a description under “# About” in the local clone. Meanwhile the collaborator adds a description under “# About” using the GitHub editor in the owner’s repository.
# About ...
The owner commits his or her change, but receives an error message from git when attempting to pull.
CONFLICT (content): Merge conflict in <path> Automatic merge failed; fix conflicts and then commit the result.
Any conflicted region is fenced off in the named files with conflict markers and must be manually tidied up.
<<<<<<< indicates the beginning of your version of the conflicted section,
======= indicates the beginning of your neighbor’s version, which ends
<<<<<<< HEAD:main ... ======= ... >>>>>>>
Follow all the instructions in the original message (or ask again with a
You have unmerged paths. (fix conflicts and run "git commit") Unmerged paths: (use "git add <file>..." to mark resolution)
Important note: If you find resolving merge conflicts confusing, the best way to avoid them is to pull before you push! That means always pull the most recent version of the repo from the remote before making changes. That way, merge conflicts will only occur if you and your collaborator(s) are working on the code at the exact same time.
Switch roles with your neighbor and repeat both Exercise 3 and the steps above to introduce and resolve a merge conflict.
Share and Contribute
The repository you created is an example of the heart of a distributed workflow. Putting the origin of your project on GitHub (or similar) will make it accessible not only by your collaborators, but also available for review and extension by your research community.
GitHub is the home of the vast majority of open source sofware, including R and Python packages, that help research advance. Through GitHub you can track issues with software you use, pitch in on solving problems, and even submit “pull requests” for new features you develop.
If you need to catch-up before a section of code will work, just squish it's 🍅 to copy code above it into your clipboard. Then paste into your interpreter's console, run, and you'll be ready to start in on that section. Code copied by both 🍅 and 📋 will also appear below, where you can edit first, and then copy, paste, and run again.
# Nothing here yet!