Prior to versioning, I used most of the SVN, and I was just using some of the most common operations. Recently, many projects in the company have started to git, taking this opportunity, I plan to study the operation and principles of git, as well as the design ideas contained therein. Colleagues recommend a "Pro Git", read it feels good, here to share the thought of reading. Online reading of this book address: http://iissnan.com/progit/
The first chapter starts
This chapter describes the history and basic features of Git, as well as the installation configuration method. The features of git mentioned here include "direct record snapshot, not differential comparison", "nearly all operations are local execution", "Keep data integrity at all times", "most operations only add data", "three states of files", except for the last point I will comb in the next chapter, Here are some ideas to share.
Direct recording of snapshots, rather than differential comparisons
The main difference between Git and other version control systems is that git only cares about whether the overall file data is changing, while most other systems only care about the specific differences in file content.
This policy requires git to record the full file for each version. If you need to compare the differences between two successive versions of the same file, Git will directly compare the two files, while the other system can directly take out the specific differences in the save, but if you compare the interval versions of the files, the latter will need to merge all the differences to display. This means that the more versions are spaced, the greater the amount of computation required by the difference-based system to compare the differences, and git will not be affected at all.
This strategy can be thought of as the practice of space-time change. Now that the cost of unit storage is getting lower, terabytes of hard drives have been reduced to cabbage prices, and even the use of a hundred-gigabit SSD by developers has become commonplace, and additional space consumption can be completely out of consideration.
Maintain data integrity at all times
Before saving to Git, all data is evaluated for content checksum (checksum), and this result is used as a unique identifier and index of the data. In other words, Git doesn't know anything about a file or directory after you've modified it. This feature, as a design philosophy of Git, is built at the bottom of the overall architecture. So if the file becomes incomplete during transmission, or if the disk is damaged, the file data is missing and Git is immediately aware of it.
Git calculates the checksum of the data using the SHA-1 algorithm, and calculates a SHA-1 hash value as a fingerprint string by calculating the contents of the file or the structure of the directory. The string consists of 40 hexadecimal characters (0-9 and a-f) and looks like this:
24b9da6552252987aa493b52f8696cd6d3b00373
Git's work relies entirely on this type of fingerprint string, so you'll often see such a hash value. In fact, everything stored in a Git database is indexed with this hash, not by file name .
The advantage of using the hash value generated by SHA-1 instead of the file name is that the length of the hash value is fixed and the randomness is good, conforming to the requirement of hashing full hash. SHA-1 itself is a commonly used hash function, its application is not restated here. A while ago Google announced that "it will gradually reduce the security instructions for the SHA-1 certificate in Chrome," but it does so for security reasons, does not mean that Git uses SHA-1 as a hash function is inappropriate, interested readers can look at the relevant analysis, such as: Depth: Why Google is eager to kill the encryption algorithm SHA-1.
What is the disadvantage of file name indexing? The length is not fixed is not the main problem. In the case of Maven-managed code, if the dependencies are more complex, each package has its own pom.xml, and their filenames are exactly the same, resulting in a serious hash collision.
Chapter Two Git basics
This chapter describes the most basic Git local operations: Creating and cloning warehouses, making changes, staging and submitting these changes, and viewing all history changes. The commands for these operations are no longer listed, to look at the file statuses mentioned in chapter one but not described in detail.
File status
To comb the process and logic of each state of the file, you can draw the following diagram. In this diagram, the frequently used local file manipulation commands and the resulting state changes are clear:
In addition to the file status, simply say the meaning of the tag in Git. As is known to all, each version of SVN has a version number, starting from 1, each commit will rise. In git, however, each commit returns only one SHA-1 checksum and no other information, no version number.
How do I specify the version of code on Git when I publish? Then you can use tag to mark it. Tag is the equivalent of adding a tag to a specific version that can replace the SVN version number and is more powerful.
Chapter three branches of Git branching with lists
If you want to understand git, understand the Git branch, and if you want to understand Git's branching, first understand the four basic object models in Git: blob, tree, commit, tag. This part of the original writing is relatively simple, specific reference to the "Git Community book" Chapter I. Fortunately, the book also has a web version, this part of the address is: http://gitbook.liuhui998.com/1_2.html. To put it simply, these four objects correspond to each of the following:
- Blob: Represents the file content, which is the index to the file.
- Tree: Represents a directory-level relationship that holds pointers to other blob objects and tree objects.
- Commit: Saves the information of a commit and the root directory of a tree object.
- Tag: Mark a commit once.
Branches are organized into the form of a list of commits, different branches point to the corresponding commit object, each commit on the branch, will insert new objects on the list header, such as the upper master and testing two branches, the green box in the diagram represents a Commit object. You can specify which branch is used by controlling where the head pointer is located.
Simple recall that the linked list of related operations can be found, as long as the respective branches to save the corresponding header, we can easily be assigned to the head in the various branches to switch. Also, for each commit, the list insert operation is simple.
After understanding how git versioning is implemented, the understanding of other operational principles can become very rapid as long as you have basic algorithmic knowledge. The following figures are from Pro Git.
1. From the master pull new branch iss53, just add a pointer to that branch, and when the commit is not modified, iss53 points to master. When a new content is submitted, the corresponding commit object is created and the iss53 pointer moves forward.
2. When the ancestor node of the branch hotfix includes the master branch, merge the hotfix branch back into the master branch, simply move the master pointer over the hotfix, without any file processing work, called Fast Forward.
3. When the branch iss53 merges back into the master branch, but master is not the ancestor of the iss53, the last common ancestor of the two is computed, and the commit of the two branches is combined to create a new commit object. There are two ancestors of this object. If a merge encounters a conflict, it is not committed, but is not committed until the conflict is processed manually and git Add.
How do I find the first common node of a cross-linked list? This is a common algorithm problem, you can refer to fallacy "The beauty of programming" 3.6 to determine whether a linked list of the extension of the intersection: the link list method to prove the extension problem 2.
4. (personal speculation) see if a branch has been merged into the master branch: Compares two branch pointers to the same object.
5. (Personal speculation) Delete the branch that has merged master: delete the branch pointer directly; Delete the unincorporated branch (git branch-d XXX): Remove the branch from all commit objects and related objects on master, and delete the branch pointers.
Merge or rebase?
The merge is to merge two branches directly together: Create a merged commit node with two ancestors, two branches A and b that are merged, and the node content is the result of merging the three-party (branch A, Branch b, branch A, and B's shared nearest ancestors). The nodes on the original linked list remain, and the commit history on the branch has not changed. As shown in (from "Pro Git"):
Rebase is a patch in the content of a branch A to re-hit on the other branch B, after the end of the branch a node becomes the successor of branch B. When the rebase is complete, the unique node of branch A is changed. As shown in (from "Pro Git"), C3 and C3 ' are different nodes:
In fact, the content of the nodes generated by merge and Rebase is the same, and it still needs to be solved manually when the conflict occurs, but the different points are just the historical nodes of the commit. Rebase is more suitable for objects that are not publicly committed (which can be understood as push to the remote repository), to clean up the commit history, and if the submitted object has been rebase, and has been followed up by someone, it can make the commit history very confusing. Detailed examples can be seen in the original book "Branch of the Rebase" section.
Sixth chapter Git Tools
In the "Debugging with Git" section, we mentioned git bisect for the binary lookups of each commit. As is known to all, the single-linked list itself does not support binary lookup, presumably git may use the following two ways to support:
(1) Save the start and end of the two commit all nodes in the middle of the pointer to a temporary array, the binary lookup based on this temporary array;
(2) Git uses a list of links similar to a jump table. The Jump table can refer to http://www.cnblogs.com/liuhao/archive/2012/07/26/2610218.html.
It is further speculated that when a binary lookup is made, the commit modification may result in a query error.
Nineth. What is the underlying command of GIT's internal principles?
The first time I read this chapter, I was a little disoriented from the second verse and wondered what the line of thinking was. The second reading only a little bit, and found that the first time did not understand the reason is that many parts of the original text only describes the execution of the underlying command after the occurrence of the phenomenon, and did not fully tell the reader of the results of this command execution. Many of the online introduction to Git, the pragmatic, the bottom of these commands do not spend much ink. Fortunately, Git's own documentation is perfect, git-help <command> is valid for the underlying command, and can be viewed on its own. But for the sake of convenience, here is a brief introduction to these underlying commands. The following introduction to the underlying command, the actual use of Git XXX, such as Git hash-object, précis-writers for hash-object.
Of course, the introduction here is not a document translation, which also includes some personal understanding, so the introduction of individual commands may have a small amount of continuity.
Another interesting fact: git high-level commands can be automatically complete, and the bottom command is not.
Hash-object
Computes the object ID of a file (which can be specified by--stdin as read from a standard input), which is actually the key value of the k-v relationship for the content addressing file system in Git. You can use the-w option to add the object to the Git file object library, rather than just displaying the object's key values on the screen.
Cat-file
Displays the contents or type of a Git object, specifying the object ID. -P is used to output formatted content. You will find that the file generated by Hash-object is garbled and you want to view the original content, you must use Git cat-file. It can be speculated that the Git object not only preserves the contents of the file, but also preserves the structure information, and has the possibility of being compressed. The end of this section confirms this: first write the file header (including the file type and content length), the content body, and then calculate the SHA-1 checksum (as the file path and file name, do not participate in the preservation of the file itself), and finally compression.
If you use Cat-file-p for a tree object, you can see that the tree object includes references to other tree or BLOB objects (as well as Object IDs).
Update-index
Create or update index for the file. Doing so causes the file to be put into the staging area (recall the staged state of the file in Git). After running this command, the next step is to run Git write-tree. There are no prompts to run the same file repeatedly.
Write-tree
Create a tree object for the current index (note: At this point staging area may have multiple files). The purpose of separating Update-index and Write-tree, I think, is git in order to gain finer granularity of control.
Read-tree
A tree object (which can specify its corresponding directory name as--prefix, which does not exist at this time) is read into index. With this command and Update-index, Write-tree, we can arbitrarily assemble any directory-the structure of the file. Note that I'm using "assemble" instead of "assemble" because these three commands are not able to split the directory structure.
Commit-tree
Specify a tree object to create a commit object. If you run the command again on a commit object, git log can see the full commit history, which is the two commit objects.
Update-ref
Securely update a reference to a file that is represented by an object ID. The result and Git branch specifies that a branch (a reference to UPDATE-REF) is a commit (corresponding to the object ID of update-ref).
Lightweight tag objects can be created by Update-ref.
Symbolic-ref
Specify a reference to a tag, such as the most common head. The current reference to this tag is read without specifying a reference.
Gc
Clean up the file. The file is actually packaged and compressed.
Verify-pack
View packaged git objects that are made through GC.
I encountered two translation errors while reading, and I have submitted a pull request:
1. 第9-2 section " -stdin
specifies that the content be read from the standard input device (stdin), and if this parameter is not specified, specify a path to the file to be stored. "should be the path to the file to be read."
2. 第9-4 Section "can then use git cat-file command ...", the following is the du command. "Git Cat-file" is not mentioned here in the actual text.
"Pro Git" reading random Thoughts