In order to meet the ever-changing business changes, Jingdong's Jingmai team has adopted a popular open source big data calculation engine such as Hadoop on the basis of Jingdong Big Data Platform to create a decision-making data product for JD operations and products. - Beidou platform.
I) Hadoop application business analysis
Big data is a collection of large data sets that cannot be processed using traditional computing techniques. It is not a single technology or tool, but rather involves many areas of business and technology.
The current mainstream three distributed computing systems are: Hadoop, Spark, and Strom:
One of Hadoop's current big data management standards is used in many current commercial applications. Structured, semi-structured, and even unstructured data sets can be easily integrated.
Spark uses in-memory computing. Starting from multiple iterations of batch processing, data is loaded into memory for repeated queries. In addition, various computing paradigms such as data warehouse, stream processing and graphics computing are combined. Spark is built on HDFS and works well with Hadoop. Its RDD is a big feature.
Storm is a distributed real-time computing system for high-speed, large data streams. Added reliable real-time data processing capabilities to Hadoop
Hadoop is an open source framework for Apache written in Java that allows distribution of large data sets in a cluster, using a simple programming model. Hadoop Framework Application Engineering provides an environment for distributed storage and computing across computer clusters. Hadoop is designed to scale from a single server to thousands of machines, each providing local computing and storage.
Hadoop is suitable for massive data, offline data and responsible data. The application scenarios are as follows:
Scenario 1: Data analysis, such as Jingdong massive log analysis, Jingdong commodity recommendation, Jingdong user behavior analysis
Scenario 2: Offline calculation, (heterogeneous calculation + distributed calculation) astronomical calculation
Scenario 3: Massive data storage, such as Jingdong storage cluster
Three practical scenarios based on Jingmai business
-
Jingmai user analysis
-
Jingmai traffic analysis
-
Jingmai order analysis
All of them belong to offline data, and Hadoop is decided to use the data calculation engine of Jingmai data products. According to the development of the business, the calculation engine of flow calculation such as Storm will be added.
II) Talk about the basic principles of Hadoop
Hadoop distributed processing framework core design
HDFS : (Hadoop Distributed File System) distributed file system
MapReduce: is a computing model and software architecture
2.1 HDFS
HDFS (Hadoop File System) is a distributed file storage system of Hadoop.
Decompose large files into multiple blocks, each holding multiple copies. Provides a fault-tolerant mechanism that automatically recovers when a copy is lost or down. By default, each block holds 3 copies, and 64M is 1 block. Map the block to memory by key-value.
2.2 MapReduce
MapReduce is a programming model that encapsulates details such as parallel computing, fault tolerance, data distribution, and load balancing. The MapReduce implementation begins with mapping the map, mapping the operations to each document in the collection, grouping them according to the generated keys, and placing the resulting list of key values into the corresponding keys. Reduce is to simplify the value in the list into a single value, the value is returned, and then the key grouping again, until the list of each key has only one value. The advantage of this is that after the task is decomposed, parallel computing can be performed by a large number of machines, reducing the time of the entire operation. But if you want me to introduce it again, then, to put it bluntly, the principle of Mapreduce is a divide and conquer algorithm.
Algorithm: The MapReduce program is implemented in three phases, the mapping phase, the shuffle phase, and the reduction phase.
Mapping phase: The job of a map or mapper is to process input data. The general input data is in the form of a file or directory and is stored in Hadoop's file system (HDFS). The input file is passed to the line by the line mapper function. The mapper processes the data and creates several small pieces of data.
Reduction phase: This phase is a combination of the Shuffle phase and the Reduce phase. The job of the reducer is to process the data from the mapper. After processing, it produces a new set of outputs that will be stored in HDFS.
2.3 HIVE
Hive is a data warehouse tool based on Hadoop. It can map structured data files into a database table and provide complete SQL query function. It can convert SQL statements into MapReduce tasks. This set of SQL is referred to as HQL. Make it easy for users unfamiliar with mapreduce to query, summarize, and analyze data using SQL language. Mapreduce developers can use the mapper and reducer they write to support Hive for more complex data analysis.
As you can see from the above figure, hadoop and mapreduce are the foundation of the hive architecture. The Hive architecture includes the following components: CLI (command line interface), JDBC/ODBC, Thrift Server, WEB GUI, metastore, and Driver (Complier, Optimizer, and Executor).
III) The pits that Hadoop came over
When the HIVE operation is performed, HQL is improperly written, which is easy to cause data skew. It is roughly divided into several categories: null data skew, different data type associations, data skew, and Join data skew. Only by understanding the principles of Hadoop and using HQL skillfully, it will avoid data skew and improve query efficiency.