Taobao has the most commercial value of the vast number of data. Up to now, there are more than 3 billion shops, merchandise browsing records, 1 billion online merchandise, tens of millions of transactions, collection and evaluation data. How to dig out the real business value from these data, and then help Taobao, business enterprise data operation, help consumers make rational shopping decisions, Taobao data platform and product department's mission.
To this end, we conducted a series of data product development, such as the well-known for the quantum statistics, data cube and Taobao index. Although the development of data products is not very difficult from the aspect of business, the calculation, storage and retrieval of data products have risen abruptly in the limit of "mass". This article will take the data Rubik's Cube as an example, to everybody introduces Taobao in the massive data product technical framework aspect exploration.
Taobao massive data products technology Framework
One of the biggest features of data products is the Non-real-time writing of data, which is why we can assume that the entire system's data is read-only for a certain period of time. This lays a very important foundation for our design of caching.
Figure 1 Taobao mass data product technical framework
According to the data flow to divide, we put the technology architecture of Taobao data products into five layers (as shown in Figure 1), respectively, are data sources, computing layer, storage layer, query layer and product layer. At the top of the structure is our data source layer, there are Taobao main station users, shops, merchandise and transactions, and other databases, as well as user browsing, search and other behavioral logs. This series of data is the most primitive vitality of data products.
In the data source layer real-time data generation, through Taobao independently developed data transmission components datax, DBSync and Timetunnel real-time transmission to a 1500-node Hadoop cluster, this cluster we call "ladder", is the main component of the computing layer. On the "ladder", we have about 40,000 jobs a day to 1.5PB of raw data according to the product requirements of different mapreduce calculation. This calculation process can usually be completed before two o'clock in the morning. In contrast to the data seen by the front-end product, the result here is likely to be a result of an intermediate state, which is often the result of a proper balance between data redundancy and front-end computing.
It is important to mention that some of the data that are highly effective, such as statistics for search terms, are expected to be pushed to the front of the data product as quickly as possible. This need to use the "ladder" to calculate the efficiency will be relatively low, for this we have done streaming data real-time computing platform, called "Galaxy." "Galaxy" is also a distributed system that receives real-time messages from Timetunnel, makes real-time calculations in memory, and flushes the results to nosql storage devices for front-end product calls in as short a time as possible.
Easy to understand, "ladder" or "Galaxy" is not suitable directly to the product to provide real-time data query services. This is because, for "ladder", its positioning is only done off-line calculation, inability to support higher performance and concurrency requirements; for the galaxy, while all the code is in our hands, it is not possible to integrate the functions of data receiving, real-time computing, storage, and querying into a distributed system to avoid tiering , and it still falls to the current architecture.
To this end, we designed a dedicated storage layer for front-end products. In this layer, we have a distributed relational database cluster based on MySQL MyFox and hbase based NoSQL storage cluster prom, in the following text, I will focus on the implementation of these two clusters. In addition, other third party modules are also included in the storage layer category.
The increase of heterogeneous modules in storage layer poses a challenge to the use of front-end products. To this end, we designed a common data middle--glider--to mask this effect. Glider provides an interface for RESTful mode with HTTP protocol. The data product can obtain the data it wants through a unique URL.
The above is Taobao massive data products in the technical framework of a general introduction, Next I will focus on four aspects of the data cube design features. relational databases are still kingly
The relational database (RDBMS) has been widely used in industrial production since it was introduced in the 1970s. After more than 30 years of rapid development, the birth of a group of excellent database software, such as Oracle, MySQL, DB2, Sybase and SQL Server.
Figure 2 The data growth curve in MyFox
Although relational databases have disadvantages in partitioning tolerance (tolerance to network partitions) relative to non relational databases, they still occupy an irreplaceable role in data products because of their powerful semantic expressive ability and the ability of relationship expression between data.
Taobao data Products Choose the MySQL MyISAM engine as the underlying data storage engine. On this basis, in order to deal with the massive data, we designed the distributed MySQL Cluster query agent layer--myfox, makes the partition to the front-end application transparent.
Figure 3 MyFox The data query process
Currently, the data stored in MyFox has reached 10TB, which occupies more than 95% of the total data of the Rubik's Cube, and is growing by more than 600 million a day (Figure 2). The data is distributed approximately evenly to 20 MySQL nodes, which are transparently externally serviced via MyFox (as shown in Figure 3) when querying.
Figure 4 MYFOX Node structure
It is worth mentioning that, of the existing 20 nodes in MyFox, not all nodes are "equal". In general, users of data products are more concerned with the "recent days" of data, the earlier the data, the more likely to be left out. To do this, for hardware costs, we have "hot nodes" and "cold nodes" in these 20 nodes (as shown in Figure 4).
As the name suggests, "Hot nodes" hold the latest, more frequently accessed data. For this part of the data, we want to provide users with the fastest possible query speed, so in the hard disk, we chose 15000 rpm of the SAS hard drive, according to a node two machines to calculate, unit data storage cost about 4.5W/TB. Correspondingly, "Cold data" We selected 7500 rpm drives per minute, which can store more data on a single disc and cost about 1.6W/TB.
Another benefit of separating hot and cold data is that it can effectively reduce the memory disk ratio. As you can see from Figure 4, the "Hot Node" on a single computer only 24GB memory, and the disk is full of about 1.8TB (300 * 12 * 0.5/1024), memory disk is about 4:300, far less than a reasonable value of the MySQL server. The consequence of a memory disk being too low is that one day, even if all of the memory is out of the index of the data--at this point, a large number of query requests need to read the index from disk, and the efficiency is greatly compromised. NoSQL is a useful addition to SQL
After MyFox, everything looks so perfect that developers don't even realize the existence of MyFox, a SQL statement without any special modifications can satisfy the requirements. This state lasted a long time until one day we encountered a problem that the traditional relational database could not solve-the full attribute selector (shown in Figure 5).
Figure 5 Full-Property Selector
This is a very typical example. In order to illustrate the problem, we still use the idea of relational database to describe. For a notebook computer This category, the filter conditions selected by a user's query may include a series of properties (fields) such as notebook size, notebook positioning, hard disk capacity, and the distribution of property values is extremely uneven on each property that may be used on a filter condition. As we can see in Figure 5, the size of the laptop has 10 enumerated values, and the "Bluetooth feature" property value is a Boolean value and the data is very poorly filtered.
In the case where the user's choice of filter condition is uncertain, there are two ways to solve the problem of full attribute: One is to give all possible filtering conditions, in the "ladder" on the basis of the calculation, stored in the database for the query, the other is to store the original data, in the user query according to filter conditions to screen the corresponding records for field calculation. It is obvious that the first scheme is undesirable in reality because the arrangement of filtration conditions is almost impossible, and in the second scenario, where the raw data is stored. If you're still using a relational database, then how do you want to index the table?
This series of questions leads us to the idea of "creating customized storage, field computing, and the engine that provides query services," which is Prometheus (see Figure 6).
Figure 6 The storage structure of the prom
As you can see from Figure 6, we have chosen HBase as the underlying storage engine for the prom. The reason for choosing hbase is that it is based on HDFs and has a good programming interface for MapReduce. Although prom is a common service framework for solving common problems, we still take full attribute selection as an example to illustrate the working principle of prom. The raw data here is the transaction details on Taobao the day before, and in the HBase cluster, we store the attribute pairs (the combination of attributes and attribute values) as Row-key. and row-key corresponding value, we designed two column-family, that is, the index field that holds the trading ID list and the data field of the original transaction detail. At the time of storage, we consciously make each element in each field a fixed length, in order to support the rapid retrieval of the corresponding records through the offset, to avoid the complex lookup algorithm and the disk's large number of random read requests.
Figure 7 The Prom query process
Figure 7 is a typical example of how the prom works when it comes to providing query services, and is not described here in detail. It is worth mentioning that the calculation of prom support is not limited to the sum sum operation, which is supported by common calculations in statistical sense. In field computation, we extend the hbase, the prom requires that the data returned by each node is a local optimal solution that has been "locally computed", and that the final global optimal solution is simply a summary of the local optimal solution returned by each node. Obviously, such a design idea is to make full use of the parallel computing capacity of each node, and avoid a large amount of detail data network transmission overhead. isolating the front and rear ends with the middle layer
As mentioned above, MyFox and Prom provide a solution for data storage and underlying queries for different needs of data products, but the consequent problem is that various heterogeneous storage modules pose great challenges to the use of front-end products. Also, the data required for a front-end product request is often not available only from one module.
For example, we will be in the data cube yesterday to do hot goods, first from the MyFox to get a hot list of data, but here the "merchandise" is only an ID, and no ID corresponding to the description of the product, pictures and other data. This time we want to get the data from the interface provided by Taobao main station, then correspond to the hot list, finally presented to the user.
Fig. 8 The technical architecture of glider
An experienced reader must be able to imagine that, essentially, this is a join operation between heterogeneous "tables" in the broadest sense. So, who's going to be in charge of this? It is easy to think that in the storage layer and front-end products to add a middle tier, it is responsible for the heterogeneous "table" between the data join and union calculation, and isolate front-end products and back-end storage, provide a unified data query services. This middle tier is glider (shown in Figure 8). caching is a systematic project.
In addition to the separation of the front and back and heterogeneous "table" The role of data integration, glider another role that can not be overlooked is the cache management. As mentioned above, we consider the data in the data product to be read-only during a specific time period, which is the theoretical basis for improving performance using caching.
In Figure 8, we see a two-tier cache in glider, which is based on a single level of cache based on independent requests after each heterogeneous "table" (DataSource) two cache and consolidated. In addition, each heterogeneous "table" inside may also have its own caching mechanism. Careful readers must have noticed the cache design of the MyFox in Figure 3, we did not choose to cache the final results after the summary calculation, but to cache each fragment, the goal is to increase the cache hit rate, and reduce the data redundancy.
The biggest problem with caching is the problem of data consistency. How to ensure that the underlying data changes in the shortest possible time to be reflected to the end user. This must be a systematic project, especially for more layered systems.
Figure 9 Caching control system
Figure 9 shows us the design idea of the data cube in cache control. The user's request must be "command" with cache control, which includes the query string in the URL and the "If-none-match" information in the HTTP header. And, the cache control "command" must pass through layers, eventually passed to the underlying storage of heterogeneous "table" module. The heterogeneous table returns its own data cache expiration (TTL), in addition to its own data, and the expiration time of the glider final output is the minimum value of the various isomeric "table" expiration times. This expiration time must also be passed from the underlying storage layer, and eventually returned to the user's browser via HTTP headers.
Another problem that the caching system has to consider is the avalanche effect at the time of cache penetration and expiration. Cache penetration means querying a data that does not exist, because the cache is written passively when it is not hit, and for fault tolerance, if no data is found from the storage layer, the cache is not written, which causes the existing data to be queried at the storage level for each request, losing the meaning of the cache.
There are a number of ways to effectively address cache penetration problems, the most common is the use of a bitmap filter, all the possible data hashes into a large enough, a certain nonexistent data will be intercepted by this bitmap, thereby avoiding the underlying storage system query pressure. In the data cube, we have adopted a more straightforward approach, and if a query returns Null (whether the data does not exist or the system fails), we still cache the empty result, but it will expire in a short period of up to five minutes.
The avalanche effect on the underlying system is terrible when the cache fails. Unfortunately, there is no perfect solution for this problem at the moment. Most system designers consider using a lock or queue to ensure that a single thread (process) of the cache is written to avoid the failure of a large number of concurrent requests falling onto the underlying storage system. In the data cube, the cache expiration mechanism we designed can theoretically distribute the data expiration time of each client evenly on the time axis, to some extent, it can avoid the avalanche effect caused by the simultaneous failure of the cache. Concluding remarks
It is based on the architectural features described in this article, the data cube has been able to provide compressed 80TB of data storage space, the data middle glider support 40 million of the daily query request, the average response time of 28 milliseconds (June 1 data), enough to meet the business growth in the next period of time requirements.
However, there are still a lot of imperfections in the entire system. A typical example is the communication between tiers using the HTTP protocol of the short connection mode. Such a strategy directly leads to a very high number of TCP connections for a single computer during peak traffic times. Therefore, a good architecture is able to greatly reduce the cost of development and maintenance, but it itself must be with the volume of data and traffic changes and constantly changing. I believe that in a few years, the technology structure of Taobao data products will certainly be another way