No Fluff Just Stuff 2013 Day 3:Gittin' Git

Gittin' Git (Raju Gandhi)

November 10, 2013:  2:15 PM – 3:45 PM


From the conference materials:

Git, at it's core, leverages a relatively simple data structure to maintain history. In this session we will take a look at this data-structure, which in turn will give us a better view of how Git manages history, and how better to work with it.

Git has fast emerged as one of the leaders in DVCS. Git may seem arcane, but under the covers, leverages a very simple data-structure to store your version history. As developers, it has always serves us well to know how things fundamentally work, and Git is no different. In this talk we will explore this data-structure, and how the various commands you invoke against Git mutate it.

Prerequisite: You must have used Git in the past or are actively using it on a project.


From my personal notes:

  • this is not an intro seminar on GIT – it is a seminar about how it works
  • someone asked why we should care how it works, and he asked them if they ever used Hibernate
  • bad example, they never used Hibernate, so he asked about SQL, and well, that was not a good pick either
  • there is a lot of confusion on the web as to what commands should be used when, and this seminar revolves about the whys
  • we are only going to focus on a couple things today – HEAD, "objects" (Object Database), and "refs" (References)
  • when GIT starts to work, it internally uses only 4 types of objects to store history – Blob, Tree, Commit, Tag
  • these 4 files are essentially unreadable ZIP files
  • in this session, we are not looking at Tags, which are "glorified PostIt Notes"
  • a Blob is the most primitive of all the objects
  • GIT only manages Blobs
  • the role of a Blob is to store the content of your files – the content
    of your files, not the files themselves, because it does not work at the
    file level
  • in contrast to CVS and Subversion, which work at the file level
  • a SHA is essentially a digestion algorithm – GIT generates a unique
    40-character SHA for a string – e.g. echo 'Hello, NFJS!' | git hash-obj
  • .git/objects
  • GIT takes the first 2 characters of the SHA to create a directory name,
    and the rest of the SHA to create the file name within it to represent
    the string object
  • the fact that GIT puts this file into the objects directory has nothing to do with the file name
  • the first thing to remember is that if two pieces of information look
    exactly the same, GIT only stores it once in the repository
  • the
    second thing to remember is that these files are compressed, and need to
    be decompressed in order to read them once GIT zips them
  • how does GIT know the file name? the Tree stores the structure and the file names
  • "git write-tree" asks GIT to write the structure of the Tree – this is not a command you would use in real life
  • if you ask GIT to tell you what the SHA looks like with "git cat-file
    -p nameoffile", it stores the name of the file, the SHA of the file, and
    the size
  • in the next example, the speaker added the file and a directory, not just a file, so a Tree was then created as well
  • the SHAs of individual files together calculate the SHA of the Tree
  • a Commit references a Tree
  • GIT first calculates leaf-level SHAs, and then works up the Tree to calculate the SHA for the Commit
  • since the Tree, the Author, and the Committer will be different for each commit, the SHA is guaranteed to be different
  • the parent of a Commit is the Commit that happened just before the subsequent Commit
  • every time you Commit, GIT stores the entire working directory, every single time – everything gets snapshotted
  • GIT is extremely efficient – Blobs are stored as zip files
  • GIT reuses pointers to files when they are not changed as part of a Commit
  • in response to an attendee comment that GIT is just doing a delta, the
    speaker reiterated that this is not correct, GIT stores the full
    content, but it is reused
  • GIT uses persistent immutable data structures just like Clojure
  • in response to an attendee, the speaker explained that an amended
    Commit is really just another Commit with the first targeted for garbage
    collection
  • how many people are going to remember 40-character strings? – GIT provides an easy solution – the Branch
  • a Branch is really just a humane way to reference a Commit
  • a look under the hood in .git/refs
  • creating a new branch is just GIT taking a SHA of the branch and using the name that you choose
  • a Commit does not create a backup – a Push creates the backup
  • pushing to GitHub just copies the objects to GitHub
  • "git cat-file t nameofdirectory" indicates the type of object – Commit etc
  • from a practical standpoint, Blobs and Trees do not matter, just the Commits
  • "git branch" just lists the Branches – the same thing as "ls .git/refs/heads/" – it's a glorified "ls"
  • how does GIT know what branch you are on? this is where HEAD comes in
  • the HEAD points to the currently checked out Commit – the HEAD is essentially a PostIt note that points to another PostIt note
  • HEAD points to Master, and Master points to a Commit – it is a symbolic reference
  • "git checkout" recreates the working directory with whatever Tree that you want looks like
  • in GIT, the working directory is just a scratch pad – a fundamental shift from Subversion etc
  • someone argued that GIT can be slow, and the example they gave was
    managing images and PDFs with GIT – the speaker indicated that GIT
    really isn't made for this per se, it's used for version control
  • after some more arguments, the speaker said we could return to
    discussing what should be stored in GIT after we first understand GIT
  • a Detached HEAD is when you are just pointing to a Commit and not a Branch
  • Reset moves the HEAD and Branch – "git reset –soft someSHA"
  • when you do a Reset, you are moving a pointer – you can end up with a dangling reference
  • only the graph that is visible to GIT can get pushed
  • garbage collection occurs every 30 days according to the speaker, but you can always use "git gc"
  • as long as someone is pointing to a Commit, it cannot get garbage collected
  • at this point in the presentation, the speaker emphasized that all of
    what we are discussing is local, and the next time we read an article
    about GIT, everything will make much more sense
  • GIT 2 will only
    push the current Branch – the speaker thinks the current design of
    pushing everything is a fundamental flaw of current GIT
  • Merge joints 2 or more Commits and almost always creates a child Commit
  • Rebase relocates a Branch to a new parent – Rebase is like a Merge, but
    is not – essentially, it is not a marriage, but a relocation
  • with Rebase, the Commit gets rewritten
  • when merging public Commits, nothing gets destroyed – but as with public Reset, never, ever use public Rebase
  • the question is not whether you should use Merge or Rebase, it is whether you are working publicly or privately

Subscribe to Erik on Software

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe