With hundreds of millions of items stored on ebay, and millions of of new products are added every day, the cloud system is needed to store and process PB-level data, and Hadoop is a good choice.
Hadoop is a fault-tolerant, scalable, distributed cloud computing framework built on commercial hardware, and ebay uses Hadoop to build a massive cluster system-athena, which is divided into five layers (as shown in Figure 3-1), starting with the bottom up:
1 Hadoop core layer, including the Hadoop runtime environment, a number of common facilities and HDFs, where the file system to read and write chunks of data made some optimizations, such as the size of the block from 128MB to 256MB.
2 MapReduce layer, providing APIs and controls for development and execution of tasks.
3 The Data acquisition layer, now the main framework of the data acquisition layer is hbase, pig and Hive:
HBase is a database of multidimensional space, developed by Google BigTable, that provides ordered data by maintaining the partition and scope of data, and its data is stored on HDFS.
Pig (correlation) is a process-oriented language that provides loading, filtering, transformation, extraction, aggregation, connectivity, grouping, and so on, and developers use Pig to build data pipelines and data factories.
Hive is a declarative language used to build data warehouses using SQL syntax. For developers, product managers, and analysts, the SQL interface makes hive a good choice.
4 tool and loading library layer, UC4 is an enterprise-level scheduler that automatically loads data from multiple data sources on ebay. Loading libraries are: Statistical library (R), Machine Learning Library (Mahout), Math related library (Hama) and ebay's own library (Mobius) for parsing blog.
5 Monitoring and warning layer, ganglia is a distributed cluster monitoring system, Nagios is used to warn some critical events such as Server unreachable, hard drive full, etc.
ebay's corporate server runs 64-bit Redhat Linux:
Namenode is responsible for managing HDFs's primary server;
Jobtracker is responsible for the coordination of tasks;
The hbasemaster is responsible for storing the root information of the hbase storage and facilitating coordination with the data block or access area;
Zookeeper is a distributed lock coordinator that guarantees hbase consistency.
The nodes used for storage and calculation are 1U-sized machines running cent OS, each with 2 four cores and 2TB of storage space, each 38~42 node unit as a rack, which builds high-density grids. With respect to the network, the bandwidth of the top rack switch to the node is 40GPBS to the 1gbps,rack switch to the core switch.
This cluster is used by multiple teams on ebay, including products and one-time tasks. Here, the Hadoop Fair Scheduler (Fair Scheduler) is used to manage assignments, define Team task pools, assign permissions, limit parallel tasks for each user and group, set priority deadlines, and delay scheduling.
(Click to view larger image) Figure 3-2 Data flow
The specific processing of the data flow as shown in Figure 3-2, the system needs to deal with new data from 8TB to 10TB per day, while Hadoop is mainly used to:
Based on machine learning sorting, the use of Hadoop to compute the sorting function of several factors (such as price, list format, seller record, relevance) and the need to add new factors to verify the extended functionality of the hypothesis to enhance the relevance of ebay search.
The excavation of the description data of the articles, the use of data mining and machine learning techniques in a completely unsupervised manner to convert the item description list into a key/value pair related to the item, to enlarge the coverage of the classification.
The challenges faced by ebay researchers in building and using systems and some of the initial plans are as follows:
Scalability, the current main system Namenode has extended functionality, as the cluster file system is growing, the need to store a large number of metadata, so memory occupancy is also growing. 1PB of storage requires nearly 1GB of memory, the possible solution is to use hierarchical structure of the namespace partition, or use HBase and zookeeper federated metadata Management.
effectiveness, the effectiveness of Namenode is important to the workload of the product, and the open source community proposes alternative options, such as using checkpoints and backup nodes, moving from secondary namenode to avatar nodes, log metadata replication technology, and more. Ebay researchers built their own clusters of products based on these methods.
Data mining, establishing systems that support data management, data mining, and schema management on systems that store unstructured data. The new plan proposes to add hive metadata and owl to the new system, called Howl. Ebay researchers are trying to connect the system to the analysis platform so that users can easily mine data in different data systems.
Data movement, ebay researchers consider publishing data-transfer tools that can support the replication of data between different subsystems such as data warehouses and HDFs.
Policies, with quotas for better archiving, backup, and other policies (the quotas for existing versions of Hadoop need to be improved). The ebay researchers set quotas for different clusters based on workload and cluster characteristics.
Standards, ebay researchers develop robust tools to measure data sources, consumption, budget, usage, and so on.
And ebay is changing the way it collects, transforms, and uses data to provide better business intelligence services.