Sometimes the best way to deal with scale problems is to make things simple and try to avoid them as much as you can. This is the method that GitHub uses, Linus Torvalds (Linus Torvalds) developed the GIT Source Control tool ten years ago, GitHub provides the database service for the tool (repository service), has already had the explosive development, and has become one of the focus of open source software development efforts.
It can be understood why programmers choose the tools they use to create code and share them with others, and in turn, they adjust and refine the tools. A very real feeling is that when software developers "live" in these systems, the way the source-code version control system works can have a positive or negative impact on the creative process of the collaborators.
The establishment of GitHub dates back to 2007, and its founding members include the company's chief operating officer (Coo,chief Operating Officer) PJ Hyett, CEO (Ceo,chief Executive Officer) Chris Wanstrath, former chief executive Tom Preston-werner, Chief Information Officer (Cio,chief information Officer) Scott Chacon. These people were all developing Ruby applications under the Rails framework and wanted to work together in a better way to encode them, and they started to build GitHub, which was expected to start running in 2008. Rather than a business plan, they develop more in order to have a tool to help them automate their software development efforts.
It turns out that GitHub is the world's largest Ruby on Rails application, and GitHub systems director reckons Sam Lambert and the Platform (translator: a website http://www.theplatform.net/ ) made a small discussion about the system. Lambert inconvenient to publicly discuss how many lines of code the GitHub has, no company has announced how many lines of code are hosted on the GitHub warehouse, but Lambert did give me some metric data about the usage growth of GitHub, and how the system supports about 6000 10 million programmers working for 0 organizations or individuals maintain 26 million open source projects.
"Basically it's a simple stack and it's really important to us," Lambert said. "We tried to keep this stack simple with as few things as possible."
On the other hand, 2008 was a dividing line for startups (two years later, Amason Web Service released the EC2 Compute Cloud), and GitHub could use the cloud for the first time without investing in infrastructure. But, without that, the founders of the company and the engineers they hired have drawn a technical stack sketch, and through the chat tool Hunt purchased a series of creative system management, software deployment Tools, and basic IT operations on GitHub.
Of course, the company has its own private repository on GitHub to develop GitHub. Although Lambert did not reveal the exact size of the Ruby application that made up GitHub, he told us that the platform had 250,000 commits in the GitHub repository, with hundreds of people contributing their code and committing the changes, although not everyone was gith UB works.
Project person
"GitHub was originally created for us and we were basically software engineers so we wanted a good tool for development. "We use GitHub to build GitHub, and that's what we do every day to manage everything," Lambert says. The human resources and legal teams are also using GitHub in their workflow. It's not just that programmers are using GitHub. We are very fortunate to be able to complete our code in a way that other companies may not necessarily be able to do. If you recruit some developers to develop the advertising system, they will not be willing to do it unless they don't care about the ads at all. And all of our developers like git and all the work revolves around it, so we have special treatment for the tools we use every day. ”
The bottom of the Github stack is hardware, which consists of hundreds of X86 servers distributed across data centers. (GitHub does not reveal where these servers are located, but Lambert did say that GitHub is considering building data centers in other parts of the world as the global user base grows.) )
"We use ready-made machines from standard suppliers," Lambert said, but did not mention the names and configurations of the suppliers. "We have done a lot of optimizations for software operations, but we haven't done any inappropriate mass customization for our hardware. As the scale gets bigger, we try to make the software more fault tolerant and copy the data to a disposable machine so we don't have to repair the machine. You just have to destroy it and put the data back on another machine. This will make it cheaper to buy a machine and also to expand at a lower cost. ”
"We do need to build custom and unusual things, because once we do, we lose the benefits of what the community is doing." This also tells us how to choose the database because MySQL is the database that everyone is using. If you encounter problems when you use it, this problem will be met by others, you will naturally not meet the failure of anyone can understand. ”
Hardware is obviously not that interesting-especially for software engineers. But Lambert is particularly excited about Gpanel's own deployment system, which is developed in Ruby and hooked up to the Puppet Configuration tool, allowing anyone in the company to prepare machines and publish software on them.
"This allows us to deploy software as if we were on a public cloud, while allowing us to enjoy all the benefits of owning our own hardware." ”
The software base of Github is of course Linux,lambert also said that the company certainly has enough experts to run its own Linux. But it does not, but simply uses Canonical Ubuntu distributed server. For databases that store Git code and other parts of the GitHub Code warehouse access control system, GitHub relies on the MySQL relational database. Github maintains its own Linux and MySQL software, as well as Ruby and Rails. GitHub employs the main maintainers of the Ruby and Rails community, so it can be inferred that GitHub is doing its own technical support in the community. But in fact, as the scale of the application expands, Github has a custom version of Ruby and Rails.
Fork Code
"When the data comes to life, it's really a matter of scale for us, and we're using a highly available way to store data elastically," Lambert says, "It's about adapting to Git's scalability and ease of use because it never thought about it." We measure that GitHub is one of the largest Ruby on Rails programs – many companies do not run Ruby on a large scale. We keep lean and do optimization to keep it that way.
We're not entirely at this stage, unlike what Facebook HipHop and Facebook do with PHP, but we have people who dedicate Ruby's core to make it faster and leaner. ”
GitHub tuned the Ruby interpreter and created its own garbage collection routine, but it was also keen to locate Ruby and rails bugs as fast as possible and get code repaired to GitHub, applications, and output to the Ruby and rails community. (Ruby development is hosted on GitHub because this is for Rails.) MySQL's development has just moved over, with Oracle some time to do this. )
GitHub may be a developer's machine for the Crazy Fork Code – Well, the crazy Fork Code at least – makes GitHub laborious and not surprising. Lambert explains:
"The reason we keep GitHub as a Ruby on Rails application is that it is very easy and fast to learn. People started working on GitHub on the first day of the company's work. We really need a bespoke and differentiated build, because if we do, we will lose the benefits of all communities. This is what tells us about database selection, because MySQL is used by everyone. If you encounter MySQL problem, it is known that you will not encounter obscure and no one knows the error message. There is no strange mistake in finding the answer, because you have encountered a problem that someone has encountered. "
GitHub's infrastructure has WEB servers, proxy servers, authentication servers, and a bunch of systems that perform analytics on warehouses, upload and submit analytics, and millions of of managed project Analytics, but the real core is the repository itself. Most of this data is text, of course, which does not take up much space, and the video and audio media are more capable of stuffing the disk drives behind the Internet than some of the richer photos.
Curiously, GitHub does not compress text data using traditional data compression, but it has its own compression to save space. If an item is forked, only the original changes are saved in the fork. (we assume that this approach also allows you to easily identify changes and iterate through each Fork.) If you save every change on GitHub, each Fork, it will quickly have countless petabytes of data, traditional data compression will slow down the system. It turns out that even with hundreds of gigabytes of new data per day from programmers, the entire GitHub repository size is measured at hundreds of terabytes.
At some point, there are a lot of pictures of cats on the internet, pictures of all cats from the master cat, and stored in the Fork according to the changes (translator Note: Here's a metaphor for GitHub's fork to store only the difference before fork) (we're a little joking.) )
"A lot of companies say they've reached terabytes and petabytes of data, and you ask them what data they are, and they're usually just rubbish," Lambert said with a smile. "Most big data companies just use it to store events--and these are basically useless. We are very proud that we have been maintaining lean and optimized, and we do not store large amounts of useless data. Compared to our competitors, the ratio of storage to warehouse shows that we are very, very lean. We try not to store the data as far as we can because we have some very smart stuff in the back end that keeps us loose and forked. We have a lot of Git, but we will do our best to optimize it. ”
Looking back at GitHub's development experience, from the company to the old school, you can quickly and easily get the storage and computing power you specify and start them.
"We're always one step ahead, and I can't say it's pressure, but we do have pressure." Lambert did not specify how the cluster developed rapidly. "We have hundreds of grams of new data per day, and the use of warehouses has grown rapidly, but we have created the infrastructure to keep pace with business growth" because our plans are doing well and there are no signs of slowing down. “
If GitHub is like any other hyperscaler, its infrastructure development lags behind the development of infrastructure drivers. It's hard to expand services, storage and users, which is why there's so much engineering creativity in Hyperscaler.
Using a public Github repository is free, but the code above can be fetched and forked by anyone who is interested. GitHub offers a private warehouse, which is the way it plans to make a profit. The price from $7 per month of personal plans containing 5 private warehouses to a $200 programmer team can share a business plan for 125 private warehouses. For companies that need to build github internally to develop code, they can purchase a GitHub Enterprise license for $2,500, install 10 hosts per year, and have the same look as GitHub. GitHub Enterprise can be built on an on-premises host or on an Amazon Web Services or Microsoft Azure public cloud. Currently GitHub and GitHub Enterprise are maintained by the same support team, but if you are doing internal development on GitHub Enterprise and want to open source to GitHub, there is no automated way to do it. But Lambert means there is space.
In addition to core Ruby on Rails applications and storage algorithms that store git code on file servers, GitHub is also working on other applications. "Some technology you just didn't put it off the shelf because the world we are the biggest code custodian, we have a lot of custom field issues," said Lambert.
One of the key areas of development is to provide a richer set of project analysis and job analysis for programmers, as many companies are using open source software to attract talent. That's why GitHub expands into new markets, with many changes in documentation and fork being part of the collaborative process. Just like the team on GitHub, using this tool to track projects, architects, musicians and other craftsmen started using the tool, which could provide another wave of growth for GitHub.
GitHub raised $100 million from Andressen Horowitz in the first round of risk in July in 2012, and the second round of financing this July, from Sequoia Capital and Andreessen horowitz,thrive capitals and Ins Titutional Venture Partners raised another $250 million, the company has not yet been publicly available, but given its financing valuations of about $2 billion, and the cash growth of its base, and the expansion of its target market.
Chatops Culture and distributed development
One important innovation of GitHub, strictly speaking, is not the code part, but is definitely part of the company Hubot, which is a chat bot system management interface used by the company. This method is often referred to as chatops, which aliases the deployment operation, and uses chat bots to do DevOps in the same way. It's all used on GitHub.
What open source software is used inside Github's system?