Are you hungry? Valuation 60 billion: the architecture design and evolution path of daily order quantity exceeding 9 million

Last Update:2018-08-23 Source: Internet

Author: User

Tags rollback

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Are you hungry? The website was born in 2009, and was founded by the students who were still reading in Tongji University, Zhang Xuhao and their classmates.
Like many people, this bunch of students do not like their own major, rather than in the laboratory without success, stuffy, not as good as playing games. The few began to figure out how to start a business, after trying two projects, finally locked up to do the takeaway business.
The beginning is the telephone orders, then do distribution, and Restaurant commission, so also slowly to do. Later, after a single more busy, reconciliation and trouble prone to error. Zhang Xuhao decided to do the website.
There is no suitable. com domain name, think of a hungry name, registered ele.me domain name, the company also played a name: Lazas, listen to very stick people. He later explained that the word was "passion" in Greek.
Lack of funds, by the Tricycle printing posters to promote the site, slowly accumulated some fame. Zhang to make good customer relations with the restaurant, often with these bosses, and even bath together to ensure that the exclusive.
2011-2015, hungry to get more than VC investment institutions, including Tencent, Ali's capital and flow supply. Its takeaway orders are from hundreds of thousands of to millions a day. From 2015 onwards, orders from around 3.3 million, 2016 to 5 million, 2017 at the beginning of more than 9 million orders.
Have you ever been hungry? App has a portal advantage. Overview
The evolution of the mobile internet era has also been accompanied by the development of technology. Today, app has become the core channel for most Internet companies to access users. At the same time, with the growth of business volume, increasingly large and increasing number of apps are constantly and continuously challenging each and every front and rear research personnel of the depth of knowledge, technical personnel in this constantly accept the challenge of the process, the success of today's mobile internet era.
All technical people, in the site traffic is too large to face such a challenge, multi-user, more business volume, in the acceptance of more and more picky users at the same time, silently, constantly evolving the structure of the front and back.
Leaving business to talk about architecture is purely bullying, and the development of each platform's backend architecture is a mirror of its business development.
Are you hungry? Web site at the beginning of the time is probably just an idea: an industry model, the rapid production of it.
"Fast" is the first place, and it doesn't take much effort to design the architecture.
In the site to expand the period before the need to invest more energy to host the site in the outbreak of traffic. Are you hungry? Website to now set up has been 8 years, now daily order volume breakthrough 9 million, also has a better structure.
First, the website infrastructure
From the start, you've used a framework that makes it easier to expand your SOA. The SOA framework addresses two things:
1. Division and collaboration
The initial development of the website, the programmer may be 1-5, when everyone is busy together one thing can. The work between each other to understand each other, often through the "Roar" can solve the problem.
With the increase in the number of people, this approach is clearly not, it is not possible to update the code one person and all the other people's code back to the line again. Then began to consider the issue of division of labour and collaboration.
2. Rapid expansion
The previous order volume may be from 1000 to 10,000, although the increase of 10 times times, but the total is not very high, for a site pressure, is not so big.
The real challenge is that orders from 100,000 to 1 million, from 1 million to 2 million, are just about 10 times times bigger, but it's a huge pressure on the entire site's architecture.
As mentioned earlier, from the 2014 1 million order breakthrough to the current 9 million, the technical team from the beginning of more than 30 people, to now is more than 900 people's team.
Division of labor is increasingly a challenge. The service is on and off, the team is on and off, and it needs a framework to support it, which is also a function of the SOA framework.
Here is the architecture platform in the middle, the system for the entire architecture, and the right side is some of the basics related to service, including the underlying components or services:
First of all, the development of language, our original site is in PHP began, and then slowly began to convert.
The founders were all college students who started out as the first choice to start with Python. Python is a good choice now, but we chose Java and go, which is why.
Python is written by a lot of people, but there are not many people who can really do it well. As the business grows, more developers are needed. In view of the Java mature ecological environment, as well as the emerging go ecology, we finally chose Python, Java, go multi-language coexistence of a technical ecology.
which
Webapi mainly do some https uninstall, limit flow, as well as security check some common and business logic-independent operations.
Service orchestrator is the orchestration layer, which realizes the Protocol transformation of intranet and the aggregation cropping of services through configuration.
The right side of the schema diagram is a few ancillary systems around these service frameworks, such as job systems for performing a task on a regular basis. We have nearly 1000 services and how these systems are monitored. So there must be a monitoring system. At the beginning only more than 30 people, we are better at running to the machine to search log, but to more than 900 people, you can not go to the machine to search the log, need to have a centralized logging system. The rest of the system is not to be discussed here.
Rome was not built in a day, and the infrastructure was an evolutionary process.

Second, the service split
When the site becomes bigger, the original architecture can't keep pace with the development. The first thing we have to do is:
The big repo into a small repo, the large service into small services, our focus on the basic services, split into different physical machines up.
It took more than a year to complete the service split, which was a long process.
In this process, you first need to make a good definition of the API. Because once your API is online, the cost of making some changes is quite large. There will be a lot of people depending on your API, and many times you don't know who is relying on your API, which is a big problem.
Then we'll abstract some basic services. Many of the original services are actually coupled in the original business code. For example, the payment business, the business is very simple, tightly coupled code does not matter, but the expansion of more and more business is required to pay the service, each of your business (such as the function of payment) to do one. So we need to pull out these basic services, such as payment services, SMS services, push services, and so on.
Dismantling services may seem simple and worthless, but that's exactly what we're starting to do. In fact, in this period, all the previous architectures can be dragged back, because not to do the structure of the adjustment is not dead, but the demolition service you do not do, it will really kill.
Service splitting must be a long process, but this is actually a very painful process, but also need a lot of ancillary systems engineering.
Third, the release system
Release is the biggest destabilizing factor. Many companies have strict restrictions on the time windows of the release, such as: Only two days a week can be released, the weekend is absolutely not to be published, the peak of the business is absolutely not allowed to publish;
We found that the biggest problem with publishing is that there is no simple, executable fallback (Rollback) operation. The rollback operation is who to execute, is the Publisher can execute, or need someone to carry out. If it is the publisher, the publisher is not 24 hours online work, the problem can not find a person to do. If there is someone to perform the rollback, and there is no simple, unified fallback operation, then this person needs to be familiar with the publisher code, which is basically not feasible.
So we need to have a release system, the publishing system defines a unified fallback operation, all services must follow the definition of the release system fallback operation.
Are you hungry? The docking system is mandatory for everyone, and all systems must be fully connected to the release system. It's important to publish the framework of a system, which is really important to the company and needs to be considered in the first priority queue.
Iv. Service Framework
followed by the Hungry Service framework, a large repo split into a small repo, a large service into a small service, so that our services as far as possible to go out, this requires a set of distributed service framework to support.
Distributed Service Framework includes: Service registration, discovery, load balancing, routing, flow control, fuse, downgrade and other functions, here is not a start.
As mentioned earlier, it is a multilingual ecology, Python and Java, and our service framework corresponds to multiple languages.
This has an impact on some of our later middleware selection, such as the DAL layer.
Five, DAL data access layer
When the volume of business becomes larger, the database becomes a bottleneck.
You can improve the performance of your database by upgrading your hardware in the early stages. For example, upgrade to a machine with more CPUs, and convert the hard drive to an SSD or more advanced product.
But the hardware upgrade is ultimately a capacity limit. And a lot of business partners, write code when all direct operation of the database, there have been many times a database on the online explosion of the situation. After the database has been blown up, there is no other opportunity to resume business until the database is restored.
If the data inside the database is normal, the business can actually be compensated. So when we do the DAL service layer, the first thing is to limit the flow, other things can be put aside. Then do the connection multiplexing, our Python framework with the multi-process single Chengga coprocessor model.
It is not possible to share a connection between multiple processes. For example, there are 10 python processes deployed on a single machine, 10 database connections per process. Expand to 10 machines, there are 1000 database connections. For the database, the connection is a very expensive thing, our DAL layer to do a connection reuse.
This connection multiplexing is not the service itself of the connection multiplexing, but said the DAL layer of the connection multiplexing, that is, the service has 1000 connected to the DAL layer, after the connection to the database may only maintain more than 10 connections. Once a database request is found to be a transaction, the DAL helps you to keep the connection. When this transaction is over, the connection to the database is put back into the shared pool for others to use.
Then do the smoke and fuse treatment. Database can also be fused, when the database smoke, we will kill some database requests, to ensure that the database does not crash. Many questions in fact the answer is very simple, but behind the thinking and logic is not simple, to do know it is still to know why. If you want to learn Java engineering, High-performance and distributed, in simple language. Micro-service, Spring,mybatis,netty source Analysis Friends can add my Java Advanced Group: 629740746, the group has a live broadcast of the technology, and Java large Internet technology video free to share.
VI. Service Governance
Service Framework, the issue of service governance begins. Service management is actually a big concept, first of all, to do the buried point, you have to bury a lot of monitoring points.
For example, there is a request, the request succeeded or failed, the request response time is how much, put all the monitoring indicators on the monitoring system.
We have a big monitor screen with a lot of monitoring metrics on it. There are special technical teams to stare at this screen for 24 hours, and if any curves fluctuate, find someone to solve them.
In addition also set up alarm system, monitor screen display of things are limited, only those important key indicators. This time you need to have a warning system.
Rome was not built in a day, and the infrastructure was an evolving process. Our resources and time are always limited, as architects and CTO, how to produce more important things under this limited resource.
We do a lot of systems and feel like we're doing a good job. But actually otherwise, I feel we are back to the Stone Age, because the problem is more and more, the demand is more and more, always feel that your system is still the shortcomings of what things, want to do a lot of functions.
For example, for the process control system, now we still need users to match a concurrent number, then this concurrent number, is not the need for users to match. Is it possible to automatically control concurrent numbers based on a state of our service itself?
Then it's the way to upgrade, and the SDK upgrade is a very painful thing. For example, our Service Framework 2.0 was released last December and there are still 1.0. The SDK can not be done to upgrade the lossless, we ourselves to control the upgrade time and rhythm.
Also, our current monitoring system only supports the same service convergence, is not divided into clusters, no machine, that is not the future indicators can be divided into clusters, machine.
One of the simplest examples, such as a service with 10 machines, may be a problem on only one machine, but all of its metrics will be distributed evenly across the other 9 machines. You just saw the whole service delay increased, but it was possible that only one machine slowed down the entire service cluster. We do not have more dimensions to monitor now.
There are intelligent alarm, the alarm, is to fast, full, quasi, we now do faster, do more complete.
How to achieve more accurate. Daily alarm volume peak time one minute more than 1000 alarm sent out. All 1000 alarms are useful. More than the alarm, it is tantamount to not the police-everyone tired, we will not go to see.
How can I tell the alarm more accurately?

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More