The technical framework of the massive data products of Taobao

Source: Internet
Author: User
Tags exception handling web services

Taobao massive data products of the technical framework is what, but also how to deal with the massive access of the double 11. First look at the picture:


According to the data flow to divide, we put the technology architecture of Taobao data products into five layers (as shown in Figure 1), respectively, are data sources, computing layer, storage layer, query layer and product layer. At the top of the structure is our data source layer, there are Taobao main station users, shops, merchandise and transactions, and other databases, as well as user browsing, search and other behavioral logs. This series of data is the most primitive vitality of data products.

In the data source layer real-time data generation, through Taobao independently developed data transmission components datax, DBSync and Timetunnel real-time transmission to a 1500-node Hadoop cluster, this cluster we call "ladder", is the main component of the computing layer. On the "ladder", we have about 40,000 jobs a day to 1.5PB of raw data according to the product requirements of different mapreduce calculation. This calculation process can usually be completed before two o'clock in the morning. In contrast to the data seen by the front-end product, the result here is likely to be a result of an intermediate state, which is often the result of a proper balance between data redundancy and front-end computing.

It is important to mention that some of the data that are highly effective, such as statistics for search terms, are expected to be pushed to the front of the data product as quickly as possible. This need to use the "ladder" to calculate the efficiency will be relatively low, for this we have done streaming data real-time computing platform, called "Galaxy." "Galaxy" is also a distributed system that receives real-time messages from Timetunnel, makes real-time calculations in memory, and flushes the results to nosql storage devices for front-end product calls in as short a time as possible.

Easy to understand, "ladder" or "Galaxy" is not suitable directly to the product to provide real-time data query services. This is because, for "ladder", its positioning is only done off-line calculation, inability to support higher performance and concurrency requirements; for the galaxy, while all the code is in our hands, it is not possible to integrate the functions of data receiving, real-time computing, storage, and querying into a distributed system to avoid tiering , and it still falls to the current architecture.

To this end, we designed a dedicated storage layer for front-end products. In this layer, we have a distributed relational database cluster based on MySQL MyFox and hbase based NoSQL storage cluster prom, in the following text, I will focus on the implementation of these two clusters. In addition, other third party modules are also included in the storage layer category. The increase of heterogeneous modules in storage layer poses a challenge to the use of front-end products. To this end, we designed a common data middle--glider--to mask this effect. Glider provides an interface for RESTful mode with HTTP protocol. The data product can obtain the data it wants through a unique URL.

The above is Taobao massive data products in the technical framework of a summary of the introduction. Query agent layer of distributed MySQL cluster--myfox

Taobao data Products Choose the MySQL MyISAM engine as the underlying data storage engine. On this basis, in order to deal with the massive data, we designed the distributed MySQL Cluster query agent layer--myfox, makes the partition to the front-end application transparent.


Currently, the data stored in MyFox has reached 10TB, which occupies more than 95% of the total data of the Rubik's Cube, and is growing by more than 600 million a day (Figure 2). These data are distributed approximately evenly to 20 MySQL nodes, which are transparently externally serviced via MyFox when querying:


It is worth mentioning that, of the existing 20 nodes in MyFox, not all nodes are "equal". In general, users of data products are more concerned with the "recent days" of data, the earlier the data, the more likely to be left out. To this end, for hardware cost considerations, we have "hot nodes" and "cold nodes" in these 20 nodes (pictured above)

As the name suggests, "Hot nodes" hold the latest, more frequently accessed data. For this part of the data, we want to provide users with the fastest possible query speed, so in the hard disk, we chose 15000 rpm of the SAS hard drive, according to a node two machines to calculate, unit data storage cost about 4.5W/TB. Correspondingly, "Cold data" We selected 7500 rpm drives per minute, which can store more data on a single disc and cost about 1.6W/TB. Another benefit of separating hot and cold data is that it can effectively improve the memory disk ratio. As you can see from Figure 4, the "Hot Node" on a single computer only 24GB memory, and the disk is full of about 1.8TB (300 * 12 * 0.5/1024), memory disk is about 4:300, far less than a reasonable value of the MySQL server. The consequence of a memory disk being too low is that one day, even if all of the memory is out of the index of the data--at this point, a large number of query requests need to read the index from disk, and the efficiency is greatly compromised. NoSQL is a useful addition to SQL

After MyFox, everything looks so perfect that developers don't even realize the existence of MyFox, a SQL statement without any special modifications can satisfy the requirements. This state lasted a long time, until one day we encountered a problem that the traditional relational database could not solve-the full attribute selector (shown in the following image).


This is a very typical example. In order to illustrate the problem, we still use the idea of relational database to describe. For a notebook computer This category, the filter conditions selected by a user's query may include a series of properties (fields) such as notebook size, notebook positioning, hard disk capacity, and the distribution of property values is extremely uneven on each property that may be used on a filter condition. As we can see in Figure 5, the size of the laptop has 10 enumerated values, and the "Bluetooth feature" property value is a Boolean value and the data is very poorly filtered.

In the case where the user's choice of filter condition is uncertain, there are two ways to solve the problem of full attribute: One is to give all possible filtering conditions, in the "ladder" on the basis of the calculation, stored in the database for the query, the other is to store the original data, in the user query according to filter conditions to screen the corresponding records for field calculation. It is obvious that the first scheme is undesirable in reality because the arrangement of filtration conditions is almost impossible, and in the second scenario, where the raw data is stored. If you're still using a relational database, then how do you want to index the table?

This series of questions leads us to the idea of "creating customized storage, field computing, and the engine that provides query services," which is Prometheus (see below).


As you can see from the diagram, we chose HBase as the underlying storage engine for the prom. The reason for choosing hbase is that it is based on HDFs and has a good programming interface for MapReduce. Although prom is a common service framework for solving common problems, we still take full attribute selection as an example to illustrate the working principle of prom. The raw data here is the transaction details on Taobao the day before, and in the HBase cluster, we store the attribute pairs (the combination of attributes and attribute values) as Row-key. and row-key corresponding value, we designed two column-family, that is, the index field that holds the trading ID list and the data field of the original transaction detail. At the time of storage, we consciously make each element in each field a fixed length, in order to support the rapid retrieval of the corresponding records through the offset, to avoid the complex lookup algorithm and the disk's large number of random read requests.


The above illustration uses a typical example to describe the working principle of a prom in providing a query service, which is not described here in detail. It is worth mentioning that the calculation of prom support is not limited to the sum sum operation, which is supported by common calculations in statistical sense. In field computation, we extend the hbase, the prom requires that the data returned by each node is a local optimal solution that has been "locally computed", and that the final global optimal solution is simply a summary of the local optimal solution returned by each node. Obviously, such a design idea is to make full use of the parallel computing capacity of each node, and avoid a large amount of detail data network transmission overhead. isolating the front and rear ends with the middle layer-glider

As mentioned above, MyFox and Prom provide a solution for data storage and underlying queries for different needs of data products, but the consequent problem is that various heterogeneous storage modules pose great challenges to the use of front-end products. Also, the data required for a front-end product request is often not available only from one module.

For example, we will be in the data cube yesterday to do hot goods, first from the MyFox to get a hot list of data, but here the "merchandise" is only an ID, and no ID corresponding to the description of the product, pictures and other data. This time we want to get the data from the interface provided by Taobao main station, then correspond to the hot list, finally presented to the user.


This is essentially a join operation between heterogeneous "tables" in the broadest sense. So, who's going to be in charge of this? It is easy to think that in the storage layer and front-end products to add a middle tier, it is responsible for the heterogeneous "table" between the data join and union calculation, and isolate front-end products and back-end storage, provide a unified data query services. The middle layer is glider (see picture).

In addition to the separation of the front and back and heterogeneous "table" The role of data integration, glider another role that can not be overlooked is the cache management. As mentioned above, we consider the data in the data product to be read-only during a specific time period, which is the theoretical basis for improving performance using caching.

There are two levels of caching in the glider, respectively, based on a level two cache of heterogeneous "tables" (DataSource) and one-tier cache based on independent requests after consolidation. In addition, each heterogeneous "table" inside may also have its own caching mechanism. Careful readers must have noticed the cache design of the MyFox in Figure 3, we did not choose to cache the final results after the summary calculation, but to cache each fragment, the goal is to increase the cache hit rate, and reduce the data redundancy.

The biggest problem with caching is the problem of data consistency. How to ensure that the underlying data changes in the shortest possible time to be reflected to the end user. This must be a systematic project, especially for more layered systems.

The user's request must be "command" with cache control, which includes the query string in the URL and the "If-none-match" information in the HTTP header. And, the cache control "command" must pass through layers, eventually passed to the underlying storage of heterogeneous "table" module. The heterogeneous table returns its own data cache expiration (TTL), in addition to its own data, and the expiration time of the glider final output is the minimum value of the various isomeric "table" expiration times. This expiration time must also be passed from the underlying storage layer, and eventually returned to the user's browser via HTTP headers.

Another problem that the caching system has to consider is the avalanche effect at the time of cache penetration and expiration. Cache penetration means querying a data that does not exist, because the cache is written passively when it is not hit, and for fault tolerance, if no data is found from the storage layer, the cache is not written, which causes the nonexistent data to be queried at the storage level for each request, losing the meaning of the cache.

There are a number of ways to effectively address cache penetration problems, the most common is the use of a bitmap filter, all the possible data hashes into a large enough, a certain nonexistent data will be intercepted by this bitmap, thereby avoiding the underlying storage system query pressure. In the data cube, we have adopted a more straightforward approach, and if a query returns Null (whether the data does not exist or the system fails), we still cache the empty result, but it will expire in a short period of up to five minutes.

The avalanche effect on the underlying system is terrible when the cache fails. Unfortunately, there is no perfect solution for this problem at the moment. Most system designers consider using a lock or queue to ensure that a single thread (process) of the cache is written to avoid the failure of a large number of concurrent requests falling onto the underlying storage system. In the data cube, the cache expiration mechanism we designed can theoretically distribute the data expiration time of each client evenly on the time axis, to some extent, it can avoid the avalanche effect caused by the simultaneous failure of the cache. Concluding remarks

It is based on the architectural features described in this article, the data cube has been able to provide compressed 80TB of data storage space, the data middle glider support 40 million of the daily query request, the average response time of 28 milliseconds (June 1 data), enough to meet the business growth in the next period of time requirements. However, there are still many imperfections in the system. A typical example is the communication between tiers using the HTTP protocol of the short connection mode. Such a strategy directly leads to a very high number of TCP connections for a single computer during peak traffic times. Therefore, a good architecture is able to greatly reduce the cost of development and maintenance, but it itself must be with the volume of data and traffic changes and constantly changing. I believe that in a few years, the technology structure of Taobao data products will certainly be another look.

Other articles Summary :

"1" massive data domain covers distributed database, distributed storage, real-time data computing, distributed computing and other technical directions.

For mass data processing, from the database level is nothing more than two points: 1, how the pressure allocation, the purpose of allocation is to turn the centralized into distributed. 2, the use of a variety of storage solutions for different business data, different data characteristics, using RDBMS or using KV store, choose different database software, using centralized or distributed storage, or some other storage programs.

"2" splits the database, including horizontal and vertical splits.

Horizontal split mainly solves two problems: 1, the irrelevant of the underlying storage. 2. Increase the pressure by adding machines linearly, supporting data volumes and access requests including TPS (Transaction per Second) and QPS (Query per Second). The way is to split a large data table into different database servers in a certain way. Massive data from centralized to distributed, may involve multiple IDC disaster-tolerant backup features.

"3" Alibaba's data on different geographical data processing methods.

Solved by three products closely: Erosa, Eromanga and Otter. Erosa do MySQL (or other database library) Bin-log from time to time parsing, after parsing put to Eromanga. Eromanga is a product that publishes subscriptions to incremental data. Erosa produces data that is constantly changing and publishes to Eromanga. Then each business end (search engine, data warehouse or associated business party) through the way of subscription, the constantly changed data from time to time through the push or pull to their business end, to do some business processing. And otter is the data synchronization across IDC, the data can be reflected in time to different AA station. Data synchronization can be a conflict, for the time being the site data as a priority, such as a computer room site data is a priority, regardless of how it covers to B.

"4" for caching.

1, attention to segmentation, according to the business choice of segmentation. The finer the cache is divided, the higher the cache hit rate. 2, confirm the effective lifecycle of the cache.

"5" split policy

1, by the field split (the most detailed strength). If you tear down the company field of the table, press company_id to remove it.

2, according to the table to dismantle, a table to the MySQL, the table to be demolished to MySQL cluster, more similar to vertical split.

3, according to the schema split, schema splitting and application-related. If the data of a module service is put into a certain cluster, the data of another module service is put into other MySQL cluster. But the overall service provided externally is the overall combination of these clusters, with Cobar to be responsible for the coordinated treatment. Web Application Architecture Evolution


Single Application Architecture when the site traffic is very small, just one application, all the features are deployed together to reduce deployment nodes and costs. At this point, the data Access Framework (ORM) used to simplify and delete the workload is critical. Vertical Application Architecture when the volume of traffic increases, the single application increases the acceleration of the machine, and the application is split into several separate applications to improve efficiency. At this point, the Web Framework (MVC) that is used to speed up front-end page development is critical. Distributed Service Architecture when the vertical application is more and more, the interaction between applications is unavoidable, the core business is extracted, as an independent service, gradually form a stable service center, so that the front-end application can respond to the changeable market demand more quickly. At this point, the Distributed Service Framework (RPC) , which is used to improve business reuse and consolidation, is critical. Flow Computing Architecture as more and more services, capacity evaluation, small service resource waste and so on gradually emerged, at this time need to add a scheduling center based on access pressure real-time management of cluster capacity, improve the utilization of the cluster. At this point, the resource Scheduling and Governance Center (SOA) for increasing machine utilization is critical. comparison of several communication protocols

Overall Performance comparison:Socket (Bio/nio/netty/mina) > RMI > HTTP invoker >= Hessian > REST >> Burlap > EJB >> Web S Ervice If the protocol design is better, the socket performance is undoubtedly the highest, at the same time flexibility and complexity is also the highest, if the adoption of efficient network framework such as: Mina, Netty, etc. can reduce the development complexity, generally in the performance is very harsh conditions to use. RMI has a relatively low performance, but is still in the same 1 order of magnitude as the socket, and can only communicate between the Java system, if it is based on the Internet, there are problems traversing the firewall. The use of spring encapsulation is slightly higher than the original RMI performance, mainly because spring employs proxies and caching mechanisms, saving time for object retrieval. Httpinvoker is spring-specific and can only be used with the spring framework on both the client and server side, and is the same as RMI, which uses Java serialization technology to transfer objects with less performance difference. Hessian performance in a small amount of data, even higher than RMI, in the data structure of complex objects or a large number of data objects, compared to RMI is 20% or so, the advantage of Hessian is streamlined and efficient, and can be used across languages, currently supports java,c++,. NET, Languages such as Python, Ruby, and so on. In addition, Hessian can make full use of the mature functions of web containers, in processing a large number of user access is very advantageous, in resource allocation, thread queuing, exception handling can be guaranteed by the Web container, and RMI itself does not provide multithreaded servers. Rest architecture is a relatively simple, efficient web services architecture, compared to Hessian performance slightly lower, but still in the same order of magnitude, but also based on HTTP protocol, there are more successful cases. Burlap in a very small amount of data performance, while performance with the increase in the amount of data dramatically reduced, usually performance time is about 3 times times the RMI, the main reason is: Hessian in binary transmission data, and BURLAP in XML format, and XML description content too much, the same structure, Its transmission volume is much larger, at the same time, XML parsing is more resource-consuming, especially in large data. EJB based on the RMI protocol, the performance is not high, but only in the Java system, not cross-language, currently less use, Alibaba has completely abandoned EJB inside. In these remote call protocols, the performance of the Web service is minimal, and in general, the performance of the Web service is 10~20 times slower than Hessian performance, and for the same access request, the Web Service transmission data about 6 times times the amount of Hessian, the network bandwidth consumption is very large, while the XML decoder universal can not high, xml<->java bean encoding, decoding is very resource-intensive, it is not a good choice for concurrent and load-high sites. At the same time, the use of WEB service is not very convenient.

Summary: Hessian and rest architectures are personally considered to be relatively good high-performance communication protocols, if the performance requirements are particularly harsh can be directly using the socket method, at present, Alibaba internal remote calls mainly using Hessian and Dubbo (based on the Mina/netty framework), Withstand the harsh high concurrency, high load test. More References Dubbo: Alibaba SOA Service Governance core framework (some of the technologies used: Netty/nio/mina, Zookeeper, Redis, RMI Protocol, Hession protocol,) Taobao file System (TFS) Introduction Taobao open platform technology process Alipay Large scale SOA system why rest is better than RPC. Scalability Best Practices: experience from ebay

(Original: http://server.51cto.com/taobao2012/)

Reproduced from: http://www.cnblogs.com/Mainz/p/3638370.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.