By Erik Gfesser — Nov 11, 2013

No Fluff Just Stuff 2013 Day 3:Gittin' Git

Gittin' Git (Raju Gandhi)

November 10, 2013: 2:15 PM – 3:45 PM

From the conference materials:

Git, at it's core, leverages a relatively simple data structure to maintain history. In this session we will take a look at this data-structure, which in turn will give us a better view of how Git manages history, and how better to work with it.

Git has fast emerged as one of the leaders in DVCS. Git may seem arcane, but under the covers, leverages a very simple data-structure to store your version history. As developers, it has always serves us well to know how things fundamentally work, and Git is no different. In this talk we will explore this data-structure, and how the various commands you invoke against Git mutate it.

Prerequisite: You must have used Git in the past or are actively using it on a project.

From my personal notes:

this is not an intro seminar on GIT – it is a seminar about how it works
someone asked why we should care how it works, and he asked them if they ever used Hibernate
bad example, they never used Hibernate, so he asked about SQL, and well, that was not a good pick either
there is a lot of confusion on the web as to what commands should be used when, and this seminar revolves about the whys
we are only going to focus on a couple things today – HEAD, "objects" (Object Database), and "refs" (References)
when GIT starts to work, it internally uses only 4 types of objects to store history – Blob, Tree, Commit, Tag
these 4 files are essentially unreadable ZIP files
in this session, we are not looking at Tags, which are "glorified PostIt Notes"
a Blob is the most primitive of all the objects
GIT only manages Blobs
the role of a Blob is to store the content of your files – the content
of your files, not the files themselves, because it does not work at the
file level
in contrast to CVS and Subversion, which work at the file level
a SHA is essentially a digestion algorithm – GIT generates a unique
40-character SHA for a string – e.g. echo 'Hello, NFJS!' | git hash-obj
.git/objects
GIT takes the first 2 characters of the SHA to create a directory name,
and the rest of the SHA to create the file name within it to represent
the string object
the fact that GIT puts this file into the objects directory has nothing to do with the file name
the first thing to remember is that if two pieces of information look
exactly the same, GIT only stores it once in the repository
the
second thing to remember is that these files are compressed, and need to
be decompressed in order to read them once GIT zips them
how does GIT know the file name? the Tree stores the structure and the file names
"git write-tree" asks GIT to write the structure of the Tree – this is not a command you would use in real life
if you ask GIT to tell you what the SHA looks like with "git cat-file
-p nameoffile", it stores the name of the file, the SHA of the file, and
the size
in the next example, the speaker added the file and a directory, not just a file, so a Tree was then created as well
the SHAs of individual files together calculate the SHA of the Tree
a Commit references a Tree
GIT first calculates leaf-level SHAs, and then works up the Tree to calculate the SHA for the Commit
since the Tree, the Author, and the Committer will be different for each commit, the SHA is guaranteed to be different
the parent of a Commit is the Commit that happened just before the subsequent Commit
every time you Commit, GIT stores the entire working directory, every single time – everything gets snapshotted
GIT is extremely efficient – Blobs are stored as zip files
GIT reuses pointers to files when they are not changed as part of a Commit
in response to an attendee comment that GIT is just doing a delta, the
speaker reiterated that this is not correct, GIT stores the full
content, but it is reused
GIT uses persistent immutable data structures just like Clojure
in response to an attendee, the speaker explained that an amended
Commit is really just another Commit with the first targeted for garbage
collection
how many people are going to remember 40-character strings? – GIT provides an easy solution – the Branch
a Branch is really just a humane way to reference a Commit
a look under the hood in .git/refs
creating a new branch is just GIT taking a SHA of the branch and using the name that you choose
a Commit does not create a backup – a Push creates the backup
pushing to GitHub just copies the objects to GitHub
"git cat-file t nameofdirectory" indicates the type of object – Commit etc
from a practical standpoint, Blobs and Trees do not matter, just the Commits
"git branch" just lists the Branches – the same thing as "ls .git/refs/heads/" – it's a glorified "ls"
how does GIT know what branch you are on? this is where HEAD comes in
the HEAD points to the currently checked out Commit – the HEAD is essentially a PostIt note that points to another PostIt note
HEAD points to Master, and Master points to a Commit – it is a symbolic reference
"git checkout" recreates the working directory with whatever Tree that you want looks like
in GIT, the working directory is just a scratch pad – a fundamental shift from Subversion etc
someone argued that GIT can be slow, and the example they gave was
managing images and PDFs with GIT – the speaker indicated that GIT
really isn't made for this per se, it's used for version control
after some more arguments, the speaker said we could return to
discussing what should be stored in GIT after we first understand GIT
a Detached HEAD is when you are just pointing to a Commit and not a Branch
Reset moves the HEAD and Branch – "git reset –soft someSHA"
when you do a Reset, you are moving a pointer – you can end up with a dangling reference
only the graph that is visible to GIT can get pushed
garbage collection occurs every 30 days according to the speaker, but you can always use "git gc"
as long as someone is pointing to a Commit, it cannot get garbage collected
at this point in the presentation, the speaker emphasized that all of
what we are discussing is local, and the next time we read an article
about GIT, everything will make much more sense
GIT 2 will only
push the current Branch – the speaker thinks the current design of
pushing everything is a fundamental flaw of current GIT
Merge joints 2 or more Commits and almost always creates a child Commit
Rebase relocates a Branch to a new parent – Rebase is like a Merge, but
is not – essentially, it is not a marriage, but a relocation
with Rebase, the Commit gets rewritten
when merging public Commits, nothing gets destroyed – but as with public Reset, never, ever use public Rebase
the question is not whether you should use Merge or Rebase, it is whether you are working publicly or privately

Subscribe to Erik on Software