Big data is no doubt, in the development and application of Hadoop technology sharing meeting, chairs, ticket has completely solved the problem, staff had to set up two venues to meet more participants and lecturers face-to-face communication opportunities.
This time the CSDN Cloud Computing Club invited to the Hadoop Big data red Elephant Cloud Teng company founder Long, Shanghai Bao Xin Senior engineer Wang Zhenping and Zhaopin senior engineer Lee, to the Hadoop and the big data practice has made the deep share.
Long: Hadoop principle, Application scenario and core idea
Long, founder of the Easyhadop community, the original Storm audio platform research and development manager, the first in the country to obtain the United States Cloudera company Apache Development Engineer (CCDH) certification examination); Red Elephant Cloud Teng founder & chief architect, many times in the China CIO Annual meeting, Aliyun Congress, the Beijing University CIO Forum published a large data speech, but also data Wis large numbers Hadoop experts. In this big data salon, the first speaker was delivered.
How Hadoop works
The Hadoop market is growing fast, even as banks and telecommunications have begun to try. and Long mainly from the following 3 aspects of Hadoop analysis:
Hadoop principle, working principle and working mechanism
Proven and yet to be tested and explored
Actual Use Cases
Long A collection of Easyhadop community and Redhadoop (start-up) practices that describe the tight links between Hadoop, large data, and cloud computing:
1. New Data Services: Similar to Baidu, Tencent, Aliyun and other large companies, through the platform such as Hadoop to build a larger data platform, data collection for analysis, and other ways to push out, that is, the concept of data services.
2. Cloud computing brings competitiveness: essentially, it is the openness of data. Compared to traditional databases, you can perform individual analysis better, and Hadoop does that.
A comparison between Hadoop and the old platform
The core of the big Data technology concept is divided into two parts: virtualization technology and technology like Hadoop. It's also two opposites, and virtualization is more about putting resources into a mainframe, and Hadoop, on the contrary, pooling all sorts of resources. Non-Hadoop platform systems, are core business systems, such as Representative IoE, the following will be perlocution the pros and cons of two systems:
Mainframe: stability, high source quality, IO capability is very strong, can manage more disk and data resources, the number of CPUs is also dominant. Of course, there is a limit to the transfer between machines, and storage and the kernel require common bandwidth. The mutual transmission between machines results in a large number of disk IO, causing disk bottlenecks, and the same bandwidth is problematic. At the same time, the problem of poor use of multiple CPUs is also exposed, in general IO becomes the bottleneck of the whole system.
Hadoop: Fragmented, files are cut to different levels, the calculation is moved to the node on the data, the node implementation of the parallel IO, so need to hang a lot of layers. The number of map reduce tasks is tied to CPU cores, so the more CPU cores, the faster the map configuration. Moving the computation instead of moving the data to get higher Io is the meaning of large data.
In this section, Long in order to and other examples to start, more detailed analysis of the MapReduce operating mechanism, but also explained the role and function of HBase.
Hadoop Application Scenario
Long that the main applications of Hadoop today are archiving, search engines (the home of the House), and data warehouses, where organizations use different components of Hadoop to implement their own use cases. In addition to these 3 scenarios there is a relatively unpopular scenario-stream processing, which stems from the features of Hadoop 2.0 that can be combined with other frameworks, and in the future, Hadoop will certainly evolve to online data processing.
Hadoop Core Ideas
The Hadoop platform is a process that drives the internal data open and enables everyone to participate in reporting and data development. Can realize the enterprise data sharing, especially the Hadoop queue, resource pool, queue, Task Scheduler mechanism, can let the whole model switch to multiple resources, rather than the previous database, layer by layer of isolation to use. Finally, the Long also explained several practices from the reality.
Wang Zhenping: Architecture and challenges based on Hadoop log trading platform
Shanghai Bao Xin Senior engineer Wang Zhenping from the financial industry, from the background, needs and objectives, problems, system architecture and other Hadoop knowledge of 5 aspects of the Hadoop based log trading platform for in-depth sharing:
Background
Use the scene: credit card consumption delay, transaction failure and failure of the reasons and types, not standardized trading institutions and merchants to find and produce reasons.
Data characteristics: In the amount of data, nearly 300 million transactions per day, in the data state, currently only store the fitted transactions, the original transaction log is not available.
Requirements and Objectives: transaction log of the second level query, transaction failure analysis, the analysis of irregular transactions, user self-help analysis, and other data combination, to identify the reasons for the failure of transactions and analysis reports, reports.
The challenge: How to get the log has minimal impact on the production system, how to quickly translate the 300 million + transaction log per day into a Hadoop cluster, how to manage a large number of jobs, and how to implement a second-level query.
System Building and architecture
The system is to create a problem and solve the problem of the process, based on the needs and background, to solve the problem, Wang Zhenping shared his valuable experience:
1. Minimizing the impact of data collection: In general, it is simply based on the business to choose the right point of time and manner, the actual situation here is: Every morning 1:00~5:00, because the data stored in a binary way in the local file, and involved in multiple machines, but also in order to be able to quickly obtain data, Using the client and the same business data source one by one corresponding relationship, each client can be configured to obtain different business system data at the same time.
2. Quickly translate 300 million + transaction log into Hadoop cluster
Here Wang Zhenping abandoned the MapReduce, chose the independent research and development mainly because: HDFs to the file to cut distribution, and the file is 2 in the form of storage. Based on the factors such as file cutting, demarcation between packets, incomplete messages, and the availability of the log in the parsing process is not controllable, but also due to the complexity of the log resolution specification.
3. Management of a large number of operations
The image above is the job management structure within the company, mainly involves 4 components: job choreographer, mainly responsible for scheduling operations, job manager, mainly responsible for job scheduler, job Status Manager, to audit and identify the problem; job triggers, triggering jobs, triggering dependency jobs, or other jobs.
Second-level query: Wang Zhenping through HBase storage, Level Two index, parallelregionquery, support data interval query, for HBase Access API encapsulation, improve the development efficiency and the cluster tuning to achieve a wonderful level of query.
Finally, Wang Zhenping also shares the Shanghai Bao Xin's cluster status, Hadoop related knowledge and the use of Hadoop and learning related experience, in the use of experience he believes that the initial stage to do a good job of scale, network, server hardware configuration of the environment, and other planning, while using the process to pay attention to cluster monitoring, The collection and analysis of the running log and the common tuning of the operating system, in which the emergency process is an indispensable link. In terms of learning, he believes that it is necessary to read the source code and understand the operating principle of the system, but it does not need to be modified early.
Lee: The practice of Hadoop in Zhaopin, and related points of attention
Zhaopin Senior engineer Lee says there are dozens of data nodes in the corporate cluster and shares the use of Hadoop in Zhaopin:
Web Log Analysis
Reasons for not using GA
Log data: User generated log, CDN push over log, load balance log
Main analysis User generated log: May 30 log (with gzip compression) traversal, spents 1 minutes 21 seconds, the load function in Piggybank with regular expression, and the field is separated, the same data time 2 minutes 18 seconds
Log collection.
Recommendation System (a very easy to understand recommendation algorithm: excluding noise data, or a certain number of pairs of rules)
How to solve the cold start of the recommendation system (using a soil method)
Planning for future recommendation Systems (machine learning)
Then he shared some of the points that Hadoop had to pay attention to in the use of Hadoop in the wisdom of the use of his experience:
The number of single CPU cores is map and reduce slots (when memory is limited, you can consider reducing the number of reduce slots)
Datanode JVM Heap not exceed 2GB. DN number of disks =cpu and no raid
Namenode, SNN best do raid;namenode heap see HDFs scale, with 8GB memory can guarantee 800TB data amount (excluding extreme cases, there are many small files, because no matter how much the size of the file, a file, directory, block Requires 150 bytes of memory
If the cluster is relatively small, you can consider uploading the data source before all compression processing. At present, the wisdom of the use of GZ (this is an indivisible format, but save a lot of disk space, is very cost-effective)
Shuffle configured snappy to save network bandwidth
Then Lee more in-depth technology sharing with the code, while also sharing the main users of Pig, respectively:
yahoo!:90% above mapreduce jobs are pig generated
twitter:80% above mapreduce jobs are pig generated
Linkedin: Most mapreduce jobs are pig generated
Other major users: Salesforce, Nokia, AOL, ComScore
Finally, Lee also explained the main developers of pig, which involved Hortonworks, Twitter, Yahoo! and Cloudera.