Version control with Git and GitHub

quick start on Git and GitHub

This post is a summary note based on what I learned on DATA 550: Data Science Toolkit. I want to recap what I’ve learned but also refine and expand on it, turning it into a clear actionable workflow that I can rely on for future projects. It’s about creating a practical guide that I—and hopefully others—can follow with ease.

Step 1. Setting Up Git

Install Git

Git is a version control system that helps track changes in your code and collaborate with others. To start, you need to install it on your system.

Checking If Git is Installed

To confirm Git is installed, run:

git --version

Configure Git

After installing Git, you need to set up your user identity so that Git can properly attribute changes to you

git config --global user.name "Your Name"
git config --global user.email "your-email@example.com"

Step 2. Initializing a Git Repository

A Git repository (repo) is where all version history for a project is stored.

Create a New Repository

Navigate to the project folder (on your local computer) and initialize a Git repository:

git init

This creates a hidden .git folder that Git uses to track changes.

Track and Save Changes

To start tracking files, add them to Git and make an initial commit:

git add .
git commit -m "Initial commit"

Step 3. Tracking Changes and Committing

Add and Commit Files

The add command tells git which file you want to track on. git add . will track all the files in the current directory. The commit command then takes a snapshot on the added files, to track the current changes.

git add filename.txt  # Add specific file
git add .             # Add all changes
git commit -m "Commit message"

Check File Status

Check which files have been modified:

git status

Viewing Commit History

To see previous commits:

git log --oneline --graph --decorate --all

Step 4. Moving this workflow online with GitHub

GitHub is a platform that lets you store Git repositories online and collaborate with others.

Generate SSH Key (for authentication)

We will need SSH keys to build secure connections between our local computer and GitHub. The following command generates the key pair.

ssh-keygen -o -t rsa

After generating an SSH key, you need to add it to GitHub to connect your local machine with GitHub.

Here’s how:

  1. Copy Your SSH Key Run the following command to copy your public key:

    cat ~/.ssh/id_rsa.pub
    

    This will display the key. Copy the entire output.

  2. Add the Key to GitHub

    • Go to GitHub → Settings → SSH and GPG Keys
    • Click New SSH Key
    • Paste your copied key into the provided box
    • Click Add SSH Key
  3. Verify the Connection Run the following command to check if GitHub recognizes your key:

    ssh -T git@github.com
    

    If successful, you should see a message like:

    Hi your-username! You've successfully authenticated, but GitHub does not provide shell access.
    

You may see messages like “The authenticity of host ‘github.com’ can’t be established” while trying to connect to GitHub via SSH, it means that your SSH client (such as OpenSSH) doesn’t recognize the host key for GitHub’s server.

This message typically appears the first time you attempt to connect to a remote server using SSH. To proceed, you typically have the option to accept the authenticity of the host by typing “yes” when prompted.

Connect Local Repository to GitHub

  1. Create a repository on GitHub. It’s like your project’s home base. This is where all your Git history, changes, and progress will live, making it easy to keep track of everything as you work. Think of it as your project’s storybook, documenting every step along the way.

  2. Connect your local repository to the one you just created on GitHub:

git remote add origin https://github.com/your-username/repo-name.git

This ties your local project to the remote repository, so you can easily push your changes and keep everything in sync. Think of it as building a bridge between your computer and GitHub.

  1. Push code to GitHub:
git branch main  # Rename to main if needed
git push origin main

Step 5. Pulling and Pushing Changes

Fetch Latest Updates from GitHub

If you are working with a team, you need to pull the latest changes before making new edits:

git pull origin main

Push New Changes to GitHub

After making changes locally, push them to the remote repository:

git push origin main

Step 6. Branching and Merging

Branches allow you to work on new features without affecting the main codebase.

Create and Switch Branches

git branch new-feature  # Create a new branch
git checkout new-feature  # Switch to it

Merge Branches

Once a feature is complete, merge it back into the main branch:

git checkout main
git merge new-feature

Delete a Branch

After merging, you can delete the feature branch:

git branch -d new-feature

Step 7. Undoing Changes

Undo Last Commit (Keep Changes)

Revert the last commit but keep the changes staged:

git reset --soft HEAD~1

Undo Last Commit (Discard Changes)

Revert the last commit and remove all changes:

git reset --hard HEAD~1

Discard Changes in a File

Revert a specific file to the last committed version:

git checkout -- filename.txt

Force Remove File from Tracking

To stop tracking a file that was previously added to Git, without deleting it from your local disk.

git rm --cached filename.txt

Step 8. Working with Remote Repositories

Cloning a Repository

Copy an existing repository from GitHub to your local machine:

git clone https://github.com/username/repo-name.git

Forking a Repository

Forking allows you to make changes to someone else’s repository without affecting the original.

  1. Click Fork on the repository page.

  2. Clone your fork:

    git clone https://github.com/your-username/forked-repo.git
    

Submitting a Pull Request

A pull request (PR) lets others review and merge your changes.

  1. Push changes:

    git push origin branch-name
    
  2. On GitHub, go to Pull Requests and click New Pull Request.

Merging a Pull Request

Once approved, you can merge the PR into the main branch.

Syncing with Upstream Repository

Keep your fork updated with changes from the original repository:

git remote add upstream https://github.com/original-repo.git
git fetch upstream
git merge upstream/main

Step 9. Managing Large Files and Ignoring Files

Use .gitignore to Ignore Files

Specify files and folders that should not be tracked by Git. Create a .gitignore file and add:

output/*.rds
output/*.png

Track Large Files with Git LFS

Git LFS (Large File Storage) helps manage large files efficiently:

git lfs track "*.dta"