Based on Hadoop big data analysis application scenario and actual combat

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In order to meet the ever-changing business changes, Jingdong's Jingmai team has adopted a popular open source big data calculation engine such as Hadoop on the basis of Jingdong Big Data Platform to create a decision-making data product for JD operations and products. - Beidou platform.

I) Hadoop application business analysis

Big data is a collection of large data sets that cannot be processed using traditional computing techniques. It is not a single technology or tool, but rather involves many areas of business and technology.

The current mainstream three distributed computing systems are: Hadoop, Spark, and Strom:

One of Hadoop's current big data management standards is used in many current commercial applications. Structured, semi-structured, and even unstructured data sets can be easily integrated.

Spark uses in-memory computing. Starting from multiple iterations of batch processing, data is loaded into memory for repeated queries. In addition, various computing paradigms such as data warehouse, stream processing and graphics computing are combined. Spark is built on HDFS and works well with Hadoop. Its RDD is a big feature.

Storm is a distributed real-time computing system for high-speed, large data streams. Added reliable real-time data processing capabilities to Hadoop

Hadoop is an open source framework for Apache written in Java that allows distribution of large data sets in a cluster, using a simple programming model. Hadoop Framework Application Engineering provides an environment for distributed storage and computing across computer clusters. Hadoop is designed to scale from a single server to thousands of machines, each providing local computing and storage.

Hadoop is suitable for massive data, offline data and responsible data. The application scenarios are as follows:

Scenario 1: Data analysis, such as Jingdong massive log analysis, Jingdong commodity recommendation, Jingdong user behavior analysis

Scenario 2: Offline calculation, (heterogeneous calculation + distributed calculation) astronomical calculation

Scenario 3: Massive data storage, such as Jingdong storage cluster

Three practical scenarios based on Jingmai business

Jingmai user analysis
Jingmai traffic analysis
Jingmai order analysis

All of them belong to offline data, and Hadoop is decided to use the data calculation engine of Jingmai data products. According to the development of the business, the calculation engine of flow calculation such as Storm will be added.

II) Talk about the basic principles of Hadoop

Hadoop distributed processing framework core design

HDFS : (Hadoop Distributed File System) distributed file system

MapReduce: is a computing model and software architecture

2.1 HDFS

HDFS (Hadoop File System) is a distributed file storage system of Hadoop.

Decompose large files into multiple blocks, each holding multiple copies. Provides a fault-tolerant mechanism that automatically recovers when a copy is lost or down. By default, each block holds 3 copies, and 64M is 1 block. Map the block to memory by key-value.

2.2 MapReduce

MapReduce is a programming model that encapsulates details such as parallel computing, fault tolerance, data distribution, and load balancing. The MapReduce implementation begins with mapping the map, mapping the operations to each document in the collection, grouping them according to the generated keys, and placing the resulting list of key values into the corresponding keys. Reduce is to simplify the value in the list into a single value, the value is returned, and then the key grouping again, until the list of each key has only one value. The advantage of this is that after the task is decomposed, parallel computing can be performed by a large number of machines, reducing the time of the entire operation. But if you want me to introduce it again, then, to put it bluntly, the principle of Mapreduce is a divide and conquer algorithm.

Algorithm: The MapReduce program is implemented in three phases, the mapping phase, the shuffle phase, and the reduction phase.

Mapping phase: The job of a map or mapper is to process input data. The general input data is in the form of a file or directory and is stored in Hadoop's file system (HDFS). The input file is passed to the line by the line mapper function. The mapper processes the data and creates several small pieces of data.

Reduction phase: This phase is a combination of the Shuffle phase and the Reduce phase. The job of the reducer is to process the data from the mapper. After processing, it produces a new set of outputs that will be stored in HDFS.

2.3 HIVE

Hive is a data warehouse tool based on Hadoop. It can map structured data files into a database table and provide complete SQL query function. It can convert SQL statements into MapReduce tasks. This set of SQL is referred to as HQL. Make it easy for users unfamiliar with mapreduce to query, summarize, and analyze data using SQL language. Mapreduce developers can use the mapper and reducer they write to support Hive for more complex data analysis.

As you can see from the above figure, hadoop and mapreduce are the foundation of the hive architecture. The Hive architecture includes the following components: CLI (command line interface), JDBC/ODBC, Thrift Server, WEB GUI, metastore, and Driver (Complier, Optimizer, and Executor).

III) The pits that Hadoop came over

When the HIVE operation is performed, HQL is improperly written, which is easy to cause data skew. It is roughly divided into several categories: null data skew, different data type associations, data skew, and Join data skew. Only by understanding the principles of Hadoop and using HQL skillfully, it will avoid data skew and improve query efficiency.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More