The process of data processing is divided into mining and data analysis, broadly speaking, the data analysis refers to the whole process, but the data analysis process is much the same,
Data mining is usually filtered, rinsed, and matched by three processes:
1. Filtering: The data is not suitable for analysis of the data filtered out, like the defective of the product pipeline, the data to the group granularity of filtering, its rules can be data size, character length;
2. Rinse: Also known as format, the data are divided, the data are also composed of time, data sources, data bodies and so on, like head, body, feet. Turning data into the format we want, this process is also a label process, meaning the data classification processing.
3. Match: The match is the extraction of the field, the data in the useful areas extracted. (regular processing) because there are too many categories of data to complete the matching of all data, this requires automatic machine recognition. Note that the results of machine learning are not accurate, so data is stored separately.
The process of data mining is the process of formatting unformatted and semi-formatted data, in other words, the data is rule-making.
After the data mining process is over, it is the data analysis phase, the process
Data analysis is SQL aggregation operations, the data format is to be able to use the SQL language to process the data, in other words, how to analyze how to analyze, as long as you will operate the database.
However, the analysis of data is multidimensional: it is divided into one-dimensional, two-dimensional and three-dimensional analysis by dimension.
One-dimensional analysis is mainly based on table queries, multiple fields, individual fields, TopN, grouping, etc. aggregate functions
Two-dimensional analysis is mainly based on time, why so, time-based analysis will be complex, and more with the prediction of the relationship (prediction that must not be people think, the machine think)
Three-dimensional analysis is mainly based on the object, how to say, is to model data, data modeling is like Java class, the construction of virtual entities, based on the analysis of entities.
The above dimensions are based on the previous dimension.
There is no four-dimensional, five-D, wood must have wood, to give an example of operation and maintenance:
Example: server operating condition
Server A 2016-07-09 12:00:00 cpu:90% mem:90%
Application a 2016-07-09 12:00:00 cpu:40% mem:40% (men>60% to run properly)
Application b 2016-07-09 12:00:00 cpu:40% mem:40% (men>30% to run properly)
Server A system 2016-07-09 12:00:00 cpu:10% mem:10%
So application A will not run properly
Complete flowchart of the entire data processing process:
Enterprise-Class Big Data processing solution 03-Data Flow