How to Analyze and Process Big Data

Source: Internet
Author: User
Keywords big data analyze big data process big data mining
1. Visual analysis
The users of big data analysis include big data analysis experts and ordinary users, but the most basic requirement of both of them for big data analysis is visual analysis, because visual analysis can intuitively present the characteristics of big data and can be easily Accepted by the reader.

2. Data mining algorithms

 The theoretical core of big data analysis is data mining algorithms. Various data mining algorithms are based on different data types and formats in order to more scientifically present the characteristics of the data itself.


3. Predictive analysis 

One of the final application areas of big data analysis is predictive analysis, mining features from big data, building models through science, and then you can bring in new data through the model to predict future data .


4. Semantic engine
The diversification of unstructured data brings new challenges to data analysis. We need a set of tools to analyze and refine data. The semantic engine needs to be designed with enough artificial intelligence to be able to actively extract information from the data.

5. Data quality and data management

Big data analysis is inseparable from data quality and data management. High-quality data and effective data management, whether in academic research or in commercial applications, can ensure that the analysis results are true and valuable.

Big data processing I: Collection

Big data collection refers to the use of multiple databases to receive data sent from the client (Web, App, or sensor format, etc.), and users can use these databases for simple query and processing work, In the process of collecting big data, the main feature and challenge is the high number of concurrency, because at the same time, there may be thousands of users to access and operate


Big data processing II: Import / Preprocessing

Although the collection end itself will have many databases, but if you want to analyze these massive data effectively, you should still import the data from the front end into a centralized large distributed database, or distributed storage cluster, and can be based on the import Do some simple cleaning and pretreatment work. The characteristics and challenges of the import and preprocessing process are mainly the large amount of imported data, and the amount of imports per second often reaches 100 megabytes or even gigabit levels.

Big data processing III: Statistics / Analysis 

Statistics and analysis mainly use distributed databases or distributed computing clusters to perform ordinary analysis and classification and summary of the massive data stored in it to meet most common analysis needs. In this regard, some real-time requirements will use EMC's GreenPlum, Oracle's Exadata, and MySQL-based columnar storage Infobright, etc., and some batch processing, or based on semi-structured data requirements can use Hadoop. The main characteristics and challenges of this part of statistics and analysis are the large amount of data involved in the analysis, which greatly consumes system resources, especially I / O.


Big data processing  IV: Mining 

It is mainly based on the calculation of various algorithms on the existing data, so as to play the role of prediction (Predict), so as to meet the needs of some high-level data analysis. The main tools used are Hadoop Mahout and so on. The characteristics and challenges of this process are mainly that the algorithms used for mining are very complicated, and the amount of data and calculation involved in the calculation are very large. The commonly used data mining algorithms are mainly single-threaded.


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.