How does Facebook do large-scale code deployment?

Source: Internet
Author: User
Tags commit time zones require versions

(Click the public number above for a quick follow-up)


Compilation: Bole online-WRM, English: code.facebook.com

If you have a good article to submit, please click here to learn more


Over time, the software industry has come up with several ways to deliver code faster, better, and more securely. Most of these efforts are focused on such things as continuous integration, continuous delivery, agile development, DEVOPS, and test-driven development. All of these approaches have a common goal: to enable developers to quickly and correctly deliver code to users in a secure, small, and incremental step.


The development and deployment process of Facebook has been organically developed to cover most of these fast-iterative technologies, while also avoiding the special reliance on any single technology. This flexible and practical approach allows us to successfully publish our network and mobile products in a fast time.


Over the years, we've deployed three times a day to Facebook front-end with a simple main program publishing branching strategy. Engineers will select from the trunk branch a number of code changes that pass a series of automated tests to push to the daily release branch (this process is also called "cherry-picking"). In general, the number of changes (cherry-picks) We choose each day is 500 to 700. The rest of the changes that are not cherry-picked are pushed into the weekly release branch.



From the 2007 engineers to the current thousands of engineers, the system has been very scalable. The good news is that as we add more engineers, we are doing more and more-the speed of code delivery is proportional to the size of the team. However, in addition to the appropriate tools and automation systems, it also requires the release engineer to spend a certain amount of manual labor to complete the daily and weekly code push and release. We know that as the team expands, batch processing of an increasing number of delivery code blocks will not be sustainable.


By the year 2016, we have seen that the Branch/cherry-pick model has reached its limit. We receive more than 1000 changes per day from the push to the main branch, and sometimes as many as 10,000 changes are received per week. So a lot of manual work is needed every week to coordinate and deliver such a large release task, which is unsustainable.


In April 2016, we decided to move the facebook.com to a quasi-continuous "push-from-master" system. Over the next year, we rolled it out, starting with 50% of employees using new code, and then getting 0.1% of the production environment to use new code, then 1%, and then increase it to 10%. Each of these processes is a test of our tools and processes to test their ability to handle the increasing frequency of pushes to get real feedback. Our main goal is to make sure that the new system will make people experience better--or at least not make it worse. After almost a whole year of planning and Development, in the 3 days of April 2017, we enabled 100% of our production environment to run code that was deployed directly from master.


Large-scale continuous delivery


While a true continuous delivery system publishes every commit code change in a timely manner to the production environment, we need to develop a system that can handle dozens of to hundreds of code changes every few hours, based on Facebook's code submission speed. The changes made in this quasi-continuous delivery model are usually small and incremental, and rarely have a noticeable impact on the actual user experience. Each release is deployed in a tiered manner to 100% of the production environment within a few hours, so we can stop publishing any time we find any problems.



First, code changes can be submitted to the main branch after a series of automated internal tests to be pushed to Facebook employees. Any regressions introduced at this stage will cause us to receive push blocking alerts, and an emergency stop button will allow us to prevent further release of the code. If all goes well, we push the changes to 2% of the production environment where we continue to collect signals and monitor alarms, especially for those boundaries where our automated tests and employees ' internal testing are not found. Finally, we deployed these change 100% to the production environment, and the user reports were collected by a tool called Flytrap, and an alert was sent to us when the exception occurred.


Many of the changes were initially controlled by the Gatekeeper system, allowing us to independently publish mobile and web-side code without relying on new functionality, while helping to reduce the risk of any particular update causing the problem. If you do find a problem, we just need to close gatekeeper instead of going back to the previous version or fixing the current version.


This quasi-persistent ("quasi-continuous") release cycle has several advantages:


1. It no longer requires hot patches


Under the strategy of deploying three times a day, if a key change must be released immediately, rather than in its scheduled push time, someone has to hit the hot patch. These out-of-band pushes are disruptive because they usually require some human action and may bump up and down a predetermined push. In the new system, the vast majority of programs that require hot patches can simply be submitted to master and published in the next release.


2. Better support for the global team of Engineers


We try to make the best possible arrangements for three deployment times per day to accommodate our engineering offices around the world. But even with this effort, a weekly release requires all engineers to focus on a specific date and time, but these times are not always convenient in their time zones. The new quasi-sustainable system means that engineers around the world can develop and deliver code as needed.


3. It forces us to develop next-generation tools, automation and processes to enable companies to scale


The project that we do can be used as a stress test across multiple teams and systems. We have improved drive tools, diff review Tools, test infrastructure, capacity management systems, traffic routing systems, and many others. These teams are clustered together because they want to see a faster release cycle of automated deployment systems succeed as quickly as possible. The improvements we have made will help ensure that the company is ready for future development.


4. It makes the user experience better and faster


When it takes days or weeks to observe the health of the Code, the engineers may have turned to a new task. In the case of continuous delivery, engineers do not have to wait a week or more to get feedback on the code they submit. They can learn more quickly about where they are not work and make small enhancements or fixes in a timely manner, rather than wait until the next big release. From an infrastructure perspective, the new system enables us to better respond to rare events that may affect users. Ultimately, this will enable engineers to be closer to the user, not only to help product development but also to improve product reliability.


continuous release to mobile


It is possible to develop a quasi-continuous system on a web platform, because we have a complete technology stack, and we can build and improve the tools we need to make it a reality. Mobile platforms are facing more challenges, as many of the existing mobile platform development and deployment tools make it difficult to iterate quickly.

Facebook is committed to improving this and creating and opening up a complete set of tools for rapid development on mobile platforms, including Nuclide, Buck, Phabricator, various iOS class libraries, React native and infer. In short, this series of build and test stacks enables us to produce high-quality code for rapid deployment to mobile platforms.


Our continuous integration stack is divided into three layers: construction, static analysis and testing.



When the developer submits the code to the Mobile main branch, the code is built on all affected products. For the mobile side, this means rebuilding Facebook, Messenger, Pages Manager, Instagram, and other apps every time you commit. We have also built multiple versions of each product to ensure that all of the chip architectures and simulators supported by these products are covered.


During the build process, we run infer, which aggregates linters (a gadget that examines code styles and errors) and static analysis tools to capture null pointer exceptions, resource and memory leaks, unused variables, and risky system calls, and flag violations of Facebook coding rules.


The third concurrency system is a mobile automation test that includes thousands of unit tests, integration tests, and end-to-end testing driven by tools such as Robolectric, Xctest, JUnit, and Webdriver.


Not only does the build and test stack run on each commit, but it also runs multiple times during the life cycle of the code change. On Android only, we can build 50,000 to 60,000 times a day.



By applying the traditional continuous delivery technology to the mobile stack, we have released a version from the perimeter to the two week release, and then to the current week to release a version. What we are currently using on the mobile platform is the previous web-based strategy: the Branch/cherry-pick model. Although we publish only one version per week, it is still important to test the code as early as possible in a real-world environment, as engineers can get feedback as quickly as possible. We offer a new mobile candidate version every day for our Canary users, including about 1 million Android beta testers.



At the same time, our release frequency has increased, our team of mobile engineers has grown 15 times times, and our code delivery speed has been greatly improved. Nevertheless, from our 2012 to 2016 figures, the productivity of engineers on Android and iOS remains the same, whether measured by the number of lines of code or by the number of pushes. Similarly, no matter how many times you deploy, the number of key issues that occur with mobile versions is almost unchanged, which means that our code quality does not suffer as the size of the code grows.


As existing tools and methodologies continue to make progress, this is an exciting time to work in the field of publishing engineering. I'm very proud of the Facebook team, and they work together to provide us with the most advanced web and mobile deployment systems I think are on this scale. All this could be possible and partly because of a strong central release engineering team, which is the "first class citizen" in infrastructure engineering (first-class citizen). Facebook's release team will continue to drive plans for developers and users to improve the release process, and will continue to share our experience, tools and best practices.



After reading this article has the harvest. Please forward share to more people

focus on "What programmers do" and improve programming skills

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.