Small PRs, Big Impact: A Git Workflow for Data Scientists
Table of Contents
Collaborating on projects using Git can be challenging – especially for scientists, academics, and those without a software engineering background. This was certainly true for me early in my career. Since then, I’ve learned a lot from the incredible people I’ve worked with as a machine learning engineer, and it’s a topic I’ve read quite a bit about.
The main bottleneck in any project – software, data science, or otherwise – is almost always the human element. A small investment in learning how to smooth out the collaboration process pays off substantially. I want to share some of what I’ve learned to make Git workflows more accessible to data scientists.
A Typical Git Workflow
Here’s the general workflow when working on a Git project:
git clonea repository- Plan a change to the repository
git checkout -b <branch>to create a new branch- Make changes
git push- Open a pull request (PR) on GitHub to merge
<branch>intomain
A major problem arises when a branch includes too many changes. As a reviewer, it’s hard not to shudder at the sight of 800+ changed lines. This leads to several issues:
- Reviewers procrastinate – reviewing so many changes is a gruelling task
- Reviews take longer
- Important details are missed
Reading code is much harder than writing it. There’s significant cognitive load involved in giving good-quality reviews (which raises the question – why bother with reviews if they’re ineffective?). Large PRs are unsustainable and don’t scale well across a team.
The key takeaway:
Smaller and more focused PRs tend to spend less time stuck in review limbo.
Tips for a High-Quality Review
To ensure your PRs are easier to review and more likely to be approved quickly:
- Keep them small and focused
- Include tests and docstrings early
- Provide code that reviewers can run to verify the changes work
- Bonus points for screenshots or other testing evidence
- Don’t wait until the end to get feedback – ping someone on Slack and start a conversation
Breaking Down Large Changes
For large changes, I often create a “dev” branch where I can experiment freely. Once I’ve got a working draft, I split it into smaller, reviewable PRs.
For example, say I’m implementing a new data processing pipeline. I might split the work into PRs like:
- Implement skeleton code with placeholder endpoints
- Add data downloading function
- Add data cleaning function
- Add data transformation function
- Add data uploading function
This keeps each PR focused and easier to reason about.
Example Workflow
We’ll check out two copies of the repo on our local machine:
$ git clone git@github.com:<user>/<repo>.git <repo>
$ git clone git@github.com:<user>/<repo>.git <repo>-devGo to your -dev version of the repo and use this purely for development:
$ cd <repo>-dev
$ git checkout -b <name>/dev/<task-name>
# Make *all* the changes needed
$ git add .
$ git commit -m 'Draft implementation of feature'
$ git pushYou can open a draft PR for this branch, but if it’s too large, it’ll be hard to review. So let’s break it up using our other repo copy:
$ cd ../<repo>
$ git pull # Make sure main is up to date
$ git checkout -b <name>/feat/add-skeleton-and-docs
$ code .
# Make *only* the changes needed for skeleton code with placeholders
# You can copy-paste code or use `git diff`/`git apply`
$ git add .
$ git commit -m 'feat: Add skeleton for new service'
$ git pushNow we can open a PR for <name>/feat/add-skeleton-and-docs. It’s smaller, more
focused, and will get reviewed much faster.
If there’s PR feedback, we can sync our dev branch after the PR is merged:
$ cd ../<repo>-dev
$ git checkout <name>/dev/<task-name>
$ git fetch
$ git merge origin/main
# Check what's left to merge
$ git diff origin/main
# Check which files still have differences
$ git diff origin/main --numstatRepeat this process until all the code from the dev branch has been merged via
small PRs. Then, you can safely delete the dev branch!
Reply to this post by email blZake@proZbableodyssey.blog (remove Z characters) ↪