background
As the largest and most successful open source project, Linux has attracted the contributions of programmers from all over the world. So far, more than 20,000 developers have submitted code to Linux Kernel.
Simple Application Server
USD1.00 New User Coupon
* Only 3,000 coupons available.
* Each new user can only get one coupon(except users from distributors).
* The coupon is valid for 30 days from the date of receipt.
Surprisingly, in the first ten years of the project (1991 ~ 2002), Linus as a project administrator did not use any configuration management tools, but manually merged the code submitted by everyone through patches. It's not that Linus likes manual processing, but because he is very picky about software configuration management tools (SCM), whether it is commercial clearcase or open source cvs, svn, etc., he can't get into his eyes.
In his opinion, a version control system that can meet the development and use of Linux kernel projects needs to meet several conditions: 1) Fast 2) Support multi-branch scenarios (thousands of branch parallel development scenarios) 3) Distributed 4) Can support large project. It wasn't until 2002 that Linus finally found a tool that basically met his requirements-BitKeeper, and BitKeeper is a commercial tool. They are willing to give the Linux community free use, but they need to ensure that they comply with the provisions of no decompilation. The default interface provided by BitKeeper obviously cannot meet all the needs of community users. A community developer decompiled BitKeeper and used the undisclosed interface, which caused BitKeeper to withdraw the license for free use. As a last resort, Linus used the ten-day holiday to implement a DVCS - Git, and pushed it to community developers.
design
Git is already known as the standard for software developers worldwide. Needless to say about the introduction and usage of Git, I want to talk about the internal implementation of Git today. But before reading this article, let me ask you a question: If you are designing git (or redesigning git), how do you plan to design it? What functions are ready to be implemented in the first version? After reading this article, compare your own ideas. Welcome to leave a message to discuss.
The best way to learn the internal implementation of Git is to watch Linus's initial code submission and checkout the first submission node of the git project (see the blog: "Tips for reading open source code"). You can see that there are only a few in the code base. Two files: a README, a build script Makefile, and a few C source files. The remarks of this commit are also very special: Initial revision of “git”, the information manager from hell.
commit e83c5163316f89bfbde7d9ab23ca2e25604af290
Author: Linus Torvalds <torvalds@ppc970.osdl.org>
Date: Thu Apr 7 15:13:13 2005 -0700
Initial revision of "git", the information manager from hell
In the README, Linus described the design ideas of Git in detail. For seemingly complex Git work, there are only two object abstractions in Linus' design: 1) object database ("object database"); 2) current directory cache ("current directory cache").
The essence of Git is a series of file object collections. Code files are objects, file directory trees are objects, and commits are also objects. The name of these file objects is the SHA1 value of the content, and the value of the SHA1 hash algorithm is 40 bits. Linus uses the first two digits as the folder and the last 38 digits as the file name. You can see many directories with two-letter/digital names in objects in the .git directory, which store many files with 38-digit hash value names. This is all Git information.
Linus defines the data structure of the object according to <tag ascii code representation> (blob/tree/commit) + <space> + <length ascii code representation> + <\0> + <binary data content>, you can use The xxd command looks at the object files in the objects directory (decompression by zlib). For example, the content of a tree object file is as follows:
00000000: 7472 6565 2033 3700 3130 3036 3434 2068 tree 37.100644 h
00000010: 656c 6c6f 2e74 7874 0027 0c61 1ee7 2c56 ello.txt.'.a..,V
00000020: 7bc1 b2ab ec4c bc34 5bab 9f15 ba {....L.4[....
There are three types of objects: BLOB, TREE, and CHANGESET.
BLOB: Binary object. This is the file stored by Git. Git does not store delta information like some VCS (such as SVN), but stores the complete information of each version of the file. For example, if you submit a copy of hello.c and enter the Git library, a BLOB file will be generated to fully record the contents of hello.c; after you modify hello.c, submit commit, and a new BLOB file will be generated to record the modified All contents of hello.c. When Linus was designed, only the content of the file was recorded in the BLOB, and metadata information such as the file name and file attributes were not included. This information was recorded in the second object TREE.
TREE: Directory tree object. In Linus's design, the TREE object is an abstraction of directory tree information in a time slice, including file name, file attributes and SHA1 value information of BLOB objects, but no historical information. The advantage of this design is that it can quickly compare the TREE objects of the two history records, and cannot read the content, but according to the SHA1 value to display the same and different files.
In addition, since the file name and attribute information are recorded on TREE, BLOB objects can be reused to save storage resources for modifying file attributes or modifying file names or moving directories without modifying the file content. In the subsequent development and evolution of Git, the design of TREE was optimized, and it became an abstraction of folder information at a certain point in time. TREE contains the object information (SHA1) of TREE in its subdirectories. In this way, it can save storage resources for Git libraries with complex or deep-level directory structures. History information is recorded in the third object CHANGESET.
Picture taken from Pro Git 1
CHANGESET: Commit object. A CHANGESET object records the TREE object information (SHA1) submitted this time, as well as information such as the committer and commit message. Different from other SCM (software configuration management) tools, Git's CHANGESET object does not record file renaming and attribute modification operations, nor does it record the delta information of file modification, etc. The CHANGESET will record the SHA1 value of the parent node CHANGESET object , Obtain the difference by comparing the TREE information of this node and the parent node.
Linus allows a node to have up to 16 parent nodes when designing the CHANGESET parent node. Although the merging of more than two parent nodes is very strange, in fact, Git supports multi-head merging of more than two branches.
Linus emphasized the trustworthiness (TRUST) after the design explanation of the three objects: Although Git does not involve the category of trustworthiness in design, Git as a configuration management tool can be trusted. The reason is that all objects are encoded in SHA1 (Google’s implementation of SHA1 collision attacks is a later story, and the Git community is also preparing to use the more reliable SHA256 encoding instead), and the process of signing in objects is guaranteed by signature tools, such as GPG tools etc.
Understand the three basic objects of Git, then Linus has a good understanding of the two abstractions of "object database" and "current directory cache" that Linus originally designed for Git. In addition to the original working directory, Git has three levels of abstraction, as shown in the following figure: one is the current working area (Working Directory), which is where we view/write code, and the other is the Git repository, which is the object database Linus said. , The content stored in the .git folder that we see in the Git warehouse, Linus named .dircache in the first version of the design, and there is a layer of intermediate staging area (Staging Area) in these two storage abstractions, namely. The information stored in git/index, when we execute the git add command, we add the current modification to the cache area.
Linus explained the design of the "current directory cache". The cache is a binary file with a content structure similar to the TREE object. The difference from the TREE object is that the index will no longer contain nested index objects, that is, the contents of the current modified directory tree are all in one index file. This design has two advantages: 1. It can quickly restore the complete content of the cache, even if the files in the current workspace are accidentally deleted, all files can be restored from the cache; 2. It can quickly find out the cache and the current work Files with inconsistent zone contents.
achieve
Linus completed the most basic functions of Git in the first code submission of Git and can be compiled and used. The code is extremely concise, and the Makefile is only 848 lines in total. Interested colleagues can use the method described in the previous paragraph to checkout the earliest commit of Git to get started compiling and playing, as long as there is a Linux environment.
Because of the dependency on the library version, you need to make some minor modifications to the original Makefile script. The first version of Git relied on two libraries, openssl and zlib, which need to be installed manually. Execute on ubuntu: sudo apt install libssl-dev libz-dev; then modify the makefile in the LIBS= -lssl line and change -lssl to -lcrypto and add -lz; finally execute make, ignore the compilation warning, and you will find that it is compiled 7 executable program files: init-db, update-cache, write-tree, commit-tree, cat-file, show-diff and read-tree.
The following is a brief introduction to the implementation of these executable programs:
(1) init-db: Initialize a git local warehouse, which is the git init command that we now initialize and build a git library every time. It's just that the name of the warehouse and cache folder created by Linus at the beginning is called .dircache, not the .git folder we are now familiar with.
(2) update-cache: Enter the file path and add the file (or multiple files) to the buffer. The specific implementation is: verify the legitimacy of the path, then calculate the SHA1 value of the file, add the blob header information to the file content and write it to the object database (.dircache/objects) after zlib compression; finally, the file path, file attributes and The blob sha1 value is updated to the .dircache/index cache file.
(3) write-tree: Generate TREE objects from the cached directory tree information and write them into the object database. The data structure of the TREE object is: ‘tree ‘+ length + \0 + file tree list. The file tree list is stored in the file attribute + file name + \0 + SHA1 value structure. After writing the object successfully, the SHA1 value of the TREE object is returned.
(4) Commit-tree: Generate the commit node object from the TREE object information and submit it to the version history. The specific implementation is to enter the SHA1 value of the TREE object to be submitted, and choose to enter the parent commit node (up to 16). The commit object information includes the name, email and date information of the TREE, parent node, committer and author, and finally writes the new one commit the node object file and return the SHA1 value of the commit node.
(5) cat-file: Since all object files are compressed by zlib, you need to use this tool to decompress and generate temporary files if you want to view the content of the file in order to view the content of the object file.
(6) show-diff: Quickly compare the difference between the current cache and the current workspace, because the attribute information of the file (including modification time, length, etc.) is also stored in the cache data structure, so you can quickly compare whether the file has been modified, and Show the difference.
(7) read-tree: According to the input TREE object SHA1 value, output and print TREE content information.