Business classification for large data processing

Source: Internet
Author: User
Keywords Level large data can

With the rapid development of the Internet, mobile Internet and IoT, we have entered a huge data age, and the current data analysis and processing has become a very urgent and big need.

Hadoop's scalability, robustness, computational performance and cost have irreplaceable advantages, in fact, has become the most mainstream Internet enterprises in the large data processing platform.

Analysis and classification of large data processing

The Hadoop platform is very specific to the business, in order to let you understand whether and your business compliance, now from a few angles to the large data processing business classification, according to the different requirements of the choice of different data analysis architecture.

According to the real time of data analysis, it is mainly divided into off-line data and real time data analysis.

The real-time data analysis processing mainly uses in the finance, the Internet and so on industry, in the demand generally is returns the billions of data processing, achieves does not affect the user experience the goal. In order to meet this requirement, it is possible to design the data Group library that is provided and to compose the parallel processing cluster, and also to use some memory computing platforms, such as the HDD architecture, but this will increase the cost of hardware and software. At present, the analysis tools of real-time mass data are mainly EMC's Greenplum, SAP's Hana and so on.

For the data processing return time requirements do not have such a high application, covering the day off-line statistical analysis processing, machine learning, search engine response index calculation, and so on, generally using off-line analysis. The log data is imported into the data analysis platform by collecting data tools. However, in the face of massive data, the traditional ETL tools are often completely ineffective, mainly because the data format conversion cost is too large, in the performance can not meet the needs of massive data collection. Internet enterprise's massive data acquisition tool, has the Facebook open source Scribe, the LinkedIn open source Kafka, the Taobao Open source Timetunnel, the Hadoop Chukwa and so on, all can meet hundreds of MB per second log data collection and transmission demand, and upload the data to the Hadoop central system.

According to the data storage of large data, it is divided into three kinds of memory level, BT level and mass level.

The memory level is primarily the maximum amount of data in the range of cluster memory. Do not underestimate the capacity of memory, Facebook cache in the memory of the data is 320TB, and the current PC server memory can be more than 100 g. So for the memory level using the memory database, the hotspot database can be resident in the memory, so as to obtain rapid analysis capabilities, for real-time business analysis is very appropriate. The following figure is a practical and feasible MongoDB analysis architecture.

  

There are some problems in the stability of mongdb large cluster, some periodic blockage and synchronization failure, but still can be a great potential and can be used to tell the data processing NoSQL.

BT level is mainly for those memory too large amount of data, you can generally put it into the traditional bi products and specially designed database for analysis. The current mainstream BI products have support for TB-level data analysis solutions. A wide variety, not specifically enumerated.

The mass level refers to the amount of data that has been completely invalidated or cost prohibitive for the database and BI products. There are also many excellent enterprise-class products with massive data levels, but based on the cost of hardware and software, most Internet companies currently use Hadoop's HDFs Distributed File system to store data and use MapReduce for analysis. A multidimensional data analysis platform based on MapReduce on Hadoop is introduced later in this article.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.