Useful Git commands every Data Engineer should know

Useful Git commands every Data Engineer should know

There is no getting around it. If you are working as a Data Engineer you will be using some form of source control to manage and track your pipelines code changes, handle your deployments (CI/CD), collaborate with you team, push infrastructure changes, and document your pipelines. If you are not using source control. Stop reading now and implement it!

The most used source control out there is Git. Without a doubt. Git is the single most popular version control system in use today. So much so I can’t even name another. I’d like to sit here and say I know everything there is to know about Git, but I don’t. I use a handful of commands and they have served me well through the years. I probably only use 10% of what Git is capable of, and thats being generous.

I’ve found that every company is different and every team has their own set of Git Conventions in place. In this post I list out the most common commands for Git and the ones’ I’ve use most in my travels.

Setup Commands

Setting a name and email address that will be attached to your commits. This is useful so people don’t think some random person is pushing changes to the repo. I’ve seen this happen first hand.

git config --global user.name "you name"
git config --global user.email "example@email.com"

Start a Project

When I’m starting out a project I tend to log into git and create a new repository and then clone it and begin working. However on occasion I’ve had to turn a local directory in to repo .

To create a local repo and initialise the current directory as a git repo.

git init <directory>

A candidate for most used git command git clone. This downloads a remote repository. You will have to grab the ssh link from git to do this.

# clone a repo from git
git clone <url>

Make a change

The git add . command is used to add all the changes in the current directory and its subdirectories to the staging area. I tend to be a little more specific to what I add. When using git add I always supply the files I changed rather than add all changes. You never know what else you might accidentally add.

git add <filename>
# or use "." to add all modified files
git add .

It is important to note that the git add command only adds files to the staging area, and doesn’t actually commit the changes to the repository.

Once you’ve done your adding you can finally commit all staged file to git using git commit

git commit -m "commit message something short but descriptive"

Leaving a short but useful message informing other’s what you have changed will help massively going forward. You will thank yourself later for doing this!

Branches

Also useful pulling a list all local branches . Add the -r flag to show all remote branches. Adding -a will bring back all remote and local branches.

git branch
git branch -r
git branch -a

When I first pull down a repository I always checkout master and then create a new branch and do my changes there. To create a branch use git branch give your branch a name.

git branch <new branch>

To use that branch you need to switch to it. To switch to a branch and update the working directory do the following:

git branch checkout <branch>

Or you can simply create a new branch and switch to it instantly way easier!

git branch checkout -b <branch>

You may need to cleanup some old branches locally and to do this you can delete a merged branch with the -d flag. USE WITH CAUTION!

git branch checkout -d <branch>

Undoing things

Undo the commit: Use the git log command to find the commit hash of the commit that you want to undo.

git log

Then use the git revert command to create a new commit that undoes the changes made by the previous commit. For example, if the commit hash is abcdefg, the command would be as below. This will open a text editor where you can enter a commit message for the new commit.

git revert abcdefg

Discard changing that you have made i.e changes in the working directory

git restore <file>

Reviewing things

One of the most used commands in git is git status it listed all the new or modified files not yet committed.

git status

List commit history, with ID’s. Note the --oneline displays the commit history in a simplified one-line format.

git log

git log --oneline

Using git diff shows the difference between two states of a repository. When used used with some the commit Ids it can show the difference between two commits. I rarely use this but it can come in handy.

git diff

git diff commit-id-1 commit-id-2

Synchronising

Fetch the most recent version of the remote repository and update the local version with the changes. If there are no conflicts this will merge automatically with your local version of the repository use the git pull command.

git pull

So there you have it some of the most useful git commands I use daily. I’m sure there are a ton more that would be useful to me, but there are the few that have served me well through the years.

Tim