Taobao's big data road

Source: Internet
Author: User
Tags big data data synchronization data warehouse big data platform data application

Since 2003, Taobao has developed rapidly since the beginning of the year. After 13 years of support, the savage growth of the Taobao business is behind a continuous improvement of the technology platform. The Taobao big data platform is a very important part of it. Data , processing, data application responsibilities, Taobao big data platform all the way to today, a total of three major stages, different stages face different challenges, with my understanding back to these The story that big data has experienced.


The first stage: RAC era

The single-node ORACLE before 2008, this time can not be called the data warehouse, can only undertake simple data processing work, and basically no data warehouse architecture, with the rapid development of business, soon single-node ORACLE due to no expansion Ability to calculate storage capacity can not cope.

After 2008, in order to cope with the increasing amount of data, RAC cluster came into being, gradually developed from the first four nodes to 20 nodes, becoming the world's largest RAC cluster at that time, also used as a classic case on the ORACLE official website. At the time, the RAC cluster performed very well in terms of stability, security, storage capacity, and computing power, and the first-generation data warehouse architecture was gradually formed.

The ETL process of data at this stage is mainly realized by the ORACLE stored procedure. A large number of SQL script tasks run on the cluster. The scheduling process of the task run is controlled and managed by Crontab. As the number of tasks continues to grow, the largest face is encountered. The problem is how to ensure that these thousands of scripts are running normally every day. How to find out the solution in time after the error, which was plagued by development every day, and has been in the state of daily firefighting, that is, at this time, in order to solve this problem, the data team Started the independent research and development of the dispatching system, and named it the Skynet dispatching system, forming the architecture and prototype of the following first-generation dispatching system.


The second stage: the HADOOP era

The on-line of the dispatching system solved the state of fire fighting every day, but the good times are not always there; in 2008, Taobao B2C new platform Taobao Mall (Tmall predecessor) went online; in 2009, Taobao became the largest integrated store in China; On January 1st, Taobao released a new homepage. After that, it was put on the line and then launched a Taobao network. The rapid development of the business brought the challenge to the data, that is, the amount of data processed every day is also doubling, first encountering the bottleneck. The RAC cluster has been inaccurate for the access log data of the website. Although the RAC cluster has certain expansion capabilities, it cannot scale linearly without limit, and the expansion means high machine cost and software cost, in order to cope with the growing The amount of data, in 2009, the data team began to explore new areas of technology, while exploring the application of two directions of technology: Greenplum and Hadoop, the main scenario is to solve the massive log data, Hadoop because of its good linear scalability, and It is an open source system that can develop secondary features suitable for Taobao based on the official version.

At the beginning of 2010, I finally decided to abandon Greenplum and RAC and use Hadoop comprehensively. That is, I joined the Taobao data team at this time. Soon after, the data team started the O project. The entire data team went through more than a month, and all the RACs were on the RAC. The stored procedure was rewritten into HIVE and MR scripts, and all the data was moved to Hadoop. The Hadoop cluster was named Ladder 1 and formed the data warehouse architecture of the Hadoop era.

At the end of 2010, there are more and more data application scenarios. Quantum statistics (Taobao official version) was released at the end of 2010. On April 1, 2011, Taobao released the data cube, which opened the data to the outside world. The advertising and search teams also put a lot of data. Applied to the business system, the internal data products are becoming more and more mature. The problem of how to apply a large amount of data is how to ensure the accuracy and stability of the data, from data collection to data processing and final data. Apply the full process guarantee;

At this time, the first link encountered problems, data synchronization, business systems have a variety of data sources, ORACLE, MYSQL, log system, crawler data, there are a variety of synchronization methods, there are also through the SHELL script, there are Through Jdbcdump, there are other ways, the most painful thing for the students responsible for data synchronization at that time, when the business system changes the database, various synchronization tasks need to be constantly adjusted, and it is extremely easy to make mistakes every time adjusting hundreds of tasks. At that time, in order to solve the problem of data synchronization, the data tool team began to develop a special synchronization tool DATAX, which is the predecessor of the synchronization center. At the same time, it also developed the real-time synchronization tool Dbsync for DB and the TT for log, which is now called TT.

Skynet dispatching system has also been continuously improved, and it has begun to support hourly scheduling and even minute scheduling, and integrated a system function such as automatic alarming, and upgraded to the peripheral system in the cloud, related DQC system, data map, blood relationship analysis and so on. With the launch, the data team is not growing.

During this period, the influence of the Double Eleven online shopping carnival has been continuously enlarged, which has become an annual event for the Chinese e-commerce industry, and gradually affects the international e-commerce industry. The constantly updated transaction records stimulate everyone's nerves. At this time, in order to intuitively provide the first line of data to the decision-making layer, the data application of the live data room is generated, and the day of the activity and the statistically relevant data are required. Before 2013, the method adopted is calculated based on Hadoop one hour. The method of data calculation, the data has a certain delay, from 2013, the data team began to invest in the development of real-time computing platform, which is now the galaxy, and in the same year, the first application of the double 11 online, double 11 data live room Real-time version.


The third stage: the era of big data platform independently developed by MaxCompute (formerly ODPS)

At the same time that Hadoop is used in a large number of applications, another project is quietly being carried out, that is, the ODPS system independently developed by the Alibaba Cloud team. All the ODPS code is completed by Ali himself, compared with the unified, secure, manageable and open. Hadoop has done a lot of improvement. The ODPS system is named Yunti II. Since 2010, for a long time, it has been in the state of coexistence of Yuntiyi and Yunti II.

During this period, the Group established CDO, unified data platform business group, and invested in research and development of big data platform related tools, including computing storage platform, surrounding scheduling system, metadata blood system, data quality management. System, and DQC, etc.

This state lasted until April 2013, when a new challenge emerged. The upper limit of the Hadoop cluster was 5,000 nodes. According to the calculation of data growth data at that time, cluster storage was about to hit the wall, but based on the situation at the time, ODPS could not completely replace it. Hadoop, then launched a very large project called "5K Project", and carried out the cross-machine room cluster project of Yunti I and Yunti II. At that time, no company in the world had the ability to cross the computer room, and there was a very large technology. Challenge, the final project has experienced a period of nearly five months, overcome a large number of technical difficulties, and the project has achieved success;

At the same time of the success of the “5K Project”, the ODPS architecture gradually matured, and the Group launched a larger project called “The Moon Landing Project”, which moved all the data processing applications of the Group to ODPS. The project continued until the project continued. In 2015, Hadoop officially went offline, Taobao big data completely entered the ODPS era, and the entire data ecosystem was also richer. At the same time, Alibaba Cloud began to provide cloud services to the outside world, and big data solutions as an important part of it. Start to provide externally;

When the time returned to 2013, every member of the Taobao data team was busy dealing with various needs. Every day, there were various reports that could not be completed. At that time, in order to save themselves, the data team began to explore new data service models. Thinking about how to solve a system problem such as data redundancy, uniform caliber, data exchange, user self-help, etc., finally through a period of thinking and exploration, began to develop Kongming lamp products, forming a complete data solution for different data roles, as follows:

The emergence of Kongming lamp products has upgraded the traditional development model and played a very good management role for the entire big data construction. At that time, within Taobao, most of the business BUs were covered, and the cost of data was reduced. A large amount of manpower, but also attracted external users Gao De map, Ali Health based on this system for big data construction.

In 2014, the Group's public-level project was launched. The data teams in the group began to reconstruct and integrate the data content. At the same time, the CCO was formally established. Seven companies came to CCO to lead the technical team. Xue Kui came to CCO to lead the data warehouse team. CCO also Based on the ODPS launching public layer construction project, the service data including the Tao Department, 1688, ICBU, and AE were integrated. The public landing layer completed the moon landing project, and the service data portal DIGO product was built in cooperation with the DIC team and the RDC team.

Today, data has penetrated into every corner of Alibaba. Alibaba Cloud has a powerful algorithm team, a large number of data interface people, and analysts. Every day, work is related to data. With the continuous deep use of artificial intelligence, business systems Continuous innovation and iterations have put forward new requirements for data collection, processing and application. How to better provide data services, we need to think more about the future, and data will enter a new era - the era of data intelligence.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.