Data mining processing in large data age

Source: Internet
Author: User
Keywords Large data data mining

In recent years, with the emergence of new forms of information, represented by social networking sites, location-based services, and the rapid development of cloud computing, mobile and IoT technologies, ubiquitous mobile, wireless sensors and other devices are generating data at all times, Hundreds of millions of users of Internet services are always generating data interaction, the big Data era has come. In the present, large data is hot, whether it is business or individuals are talking about or engaged in large data-related topics and business, we create large data is also surrounded by the big data age. While the market outlook for large data is hopeful, in the eyes of the public, there are still a number of challenges to the analysis, management, and use of large data, such as large data that has already exceeded TB, has an alarming growth rate and is highly real-time.

At present, there is not a complete consensus on the definition of large data. Wikipedia's definition of large data is: The amount of information involved is too large to pass the current mainstream software tools, within a reasonable period of time to acquire, manage, process, and organize to help business decision-making more positive purposes of information. The Internet Data Center defines large data as a new generation of architectures and technologies designed to gain value from high frequencies, large capacity, different structures and types of data more economically. All definitions of large data are based on the characteristics of large data, and the definition is given through the elaboration and generalization of these characteristics. In these definitions, the characteristics of large data can be summed up as follows: Scale (volume), multiplicity (produced), high speed (velocity) and value.

1. Visualization Analysis of large data

From the initial data integration to the data analysis, until the final interpretation of the data, the ease of use of data should run through the whole process of data analysis. In large data age, the data volume is large and the data structure is diversified, and its complexity has already surpassed the traditional relational database. In addition, as large numbers have penetrated all areas of people's lives, many industries have started to increase demand for large data. But ordinary users tend to be more concerned with the presentation of results, and the complexity of the data limits the average user's access to knowledge directly from large data. Therefore, the visualization of data in the analysis of large data should be paid attention to by researchers and further improved.

(1) Visualization technology. Visualization technology is one of the most effective means to interpret a large number of data, and the results are presented to the user visually, and the graphical way is easier to understand and receive than the traditional text display methods. In data visualization, data results show images, mapping relationships or tables from data mining results processed from the underlying platform, which are presented to users for analysis using simple, friendly, medical graphical and intelligent forms. At present, the most common visualization techniques for large data include tagged cloud (Tag cloud), Historical stream (history flow), spatial information flow (space information), etc. For the series to reach PB or even larger large data, the traditional chart method has been difficult to achieve its visualization, it is necessary to introduce a rapid and accurate processing of massive data of the scientific calculation method. Scientific calculation can use 2d,3d graphics to realize the visualization of data, which provides a more intuitive representation for data distraction and research, which involves many fields such as computer graphics, image processing, computing vision and graphical user interface. Visualization of data one of the world's largest commercial websites ebay selected tableau's data visualization software to enable all employees to see graphical search links and monitor customer feedback and emotional analysis for a given period of time, bringing business insights to ebay.

Article Web visualization. With the rapid development of network and the continuous improvement of network performance, web-based data visualization technology has become a hotspot. There are already a lot of web charting tools on the network that are used to show stocks, weather data, and so on. The most widely used JavaScript, Flash, Java applets, etc., these technologies can be implemented on the Web graphics. For the need to deal with more than the large amount of data in the scientific calculation data, you can use Ejschart or Jfreechart, its drawing speed, compatibility and good interaction, can be used as the preferred drawing tool; For the development of drawing tools, you can choose JavaScript And Flash, the two drawings are faster and less varied. Many browsers now support HTML5, including mobile phones and tablets, and JavaScript is a good choice if it requires better cross-platform compatibility.

2. Data Mining Common methods

Data mining is the most critical work in the large data age. The mining of large data is a process of discovering valuable and potentially useful information and knowledge from large, incomplete, noisy, fuzzy and random large-scale databases, which is also a decision support process. It is mainly based on artificial intelligence, machine learning, pattern learning, statistics and so on. By automating the analysis of large data and making inductive inference, it can help enterprises, merchants and users to adjust market policy, reduce risk, rationally face the market and make correct decision. At present, data mining can solve many problems, including marketing strategy, background analysis, enterprise management crisis and so on in many fields, especially in the field of business, such as banking, telecom, electricity, etc. The common methods of mining large data are classification, regression analysis, clustering, association rules, Neural network method, WEB data mining, etc. These methods are used to excavate data from different angles.

(1) Classification. Classification is to find the common characteristics of a set of data objects in a database and classify them into different classes according to the classification pattern, the purpose of which is to map the data items in the database into a given category through the classification model. Can be applied to the application of classification, trend prediction, such as Taobao shops will be users in a period of time to divide the purchase into different classes, according to the situation to recommend related categories of goods, thereby increasing the sales of shops.

(2) regression analysis. The regression analysis reflects the attribute value of the data in the database, and discovers the dependency relationship between the attribute values through function expression data mapping. It can be applied to the study of the prediction and correlation of data series. In marketing, regression analysis can be applied to all aspects. If you have a regression analysis of this quarter's sales, forecast the next quarter's sales trends and make targeted marketing changes.

(3) Clustering. Clustering is similar to classification, but the purpose of classification is to classify a set of data into several categories for the similarity and difference of data. The similarity between data belonging to the same category is very large, but the similarity between different categories is very small, and the data correlation between classes is very low.

(4) Association rules. Association rules are the associations or interrelationships that are hidden between data items, that is, the appearance of other data items can be deduced from the appearance of a data item. The mining process of association rules mainly includes two stages: the first stage is to find all the High-frequency project groups from the massive raw data; the second extreme is to generate association rules from these high-frequency project groups. Association rules Mining technology has been widely used in financial industry enterprises to predict the needs of customers, banks on their own ATM machines by bundling customers may be interested in information for users to understand and obtain the appropriate information to improve their marketing.

(5) Neural network method. As a kind of advanced artificial intelligence technology, the neural network is suitable for dealing with non-linear and those with fuzzy, incomplete and imprecise knowledge or data, because of its self processing, distributed storage and high fault-tolerant characteristics, which is very suitable for solving the problem of data mining. The typical neural network model is divided into three main categories: The first is a feedforward neural network model for classification prediction and pattern recognition, which is mainly represented as functional network and Perceptron. The second type is a feedback neural network model for associative memory and optimization algorithm, which is represented by Hopfield discrete model and continuous model. The third category is the self-organizing mapping method for clustering, which is represented by the art model. Although there are many models and algorithms for neural networks, there is no uniform rule for the models and algorithms used in data mining in specific fields, and it is difficult to understand the learning and decision-making process of the network.

(6) Web Data Mining. Web Data Mining is a comprehensive technology, refers to the Web from the document structure and use of the collection C found implied mode p, if C as input, p as the output, then the Web mining process can be seen from the input to the output of a mapping process. Its process is shown in the figure:

Figure 1 Web Data Mining flowchart

At present, more and more web data are in the form of data stream, so it is very important for web data stream mining. At present, the commonly used web data mining algorithms are: PageRank algorithm, hits algorithm and Logsom algorithm. These three algorithms refer to users who are general users and do not differentiate between users. At present, web data mining is confronted with some problems, including: User's classification problem, the timeliness of website content, the user's time in the page, the link of the page and the number of links. Today, with the rapid development of web technology, these problems still deserve to be studied and solved.

3. Data analysis Technology

The analysis of data is the core of large data processing. The traditional data analysis is mainly aimed at the structured data, and the general process is as follows: firstly, the database is used to store the structured data, then the Data Warehouse is constructed, and then the corresponding cubes are constructed and the on-line analysis is processed according to the need. This process is very efficient when dealing with relatively small structured data. However, for large data, the analysis technology faces 3 intuitive problems: large-capacity data, multi-format data and analysis speed, which makes the standard storage technology can not store large data, so it is necessary to introduce a more reasonable analysis platform for large data analysis. At present, open source Hadoop is a widely used data processing technology, it is also the core technology to analyze and deal with large data.

Hadoop is a Java-based distributed data processing and analysis of the software framework, users can not understand the distributed low-level details of the situation, the development of distributed programs, take full advantage of the power of cluster high-speed operation and storage. Its basic working principle is: the large-scale data decomposed into small, easy access to bulk data and distributed to multiple servers to analyze. Mainly including file system (HDFS), Data Processing (MapReduce) two parts of the functional modules, the bottom is HDFS to store all the storage nodes in the Hadoop cluster files, HDFS the upper layer is the MapReduce engine, the engine by the job trackers and task Trackers composition. The structure is as shown in the figure:

  

Figure 2 Hadoop composition architecture diagram

Given the commercial hardware cluster. The so-called commercial hardware is not low-end hardware, its failure rate is much lower than low-end hardware. Hadoop does not need to run on expensive and highly reliable hardware, even if a large cluster with a high probability of node failure, HDFs can continue to run without a noticeable disruption to the user in the case of a failure, this design reduces the maintenance cost to the machine, Especially when the user manages hundreds or even thousands of machines.

Hadoop is designed with an efficient access pattern based on one write and multiple reads. Each analysis of the data involves the entire dataset in which the data is located, and the high data throughput creates a high latency, and hbase is a better choice for low latency data access. HDFS uses the Master/slave architecture, a HDFS cluster consisting of a namenode (master) and multiple Datanode (slave). Namenode is a central server responsible for managing HDFS namespaces and maintaining HDFS files and directories. This information is persisted to the local disk in the form of a namespace-mirroring file and an edit log file. It also records the Datanode information of each block in each file, but does not permanently save the location information of the block because Datanode will re-establish the new location information at system startup. At the same time, Namecode is also responsible for controlling access by external client.

DataNode is a HDFS working node, typically a machine node in the cluster, responsible for managing the storage that comes with the node. They store and retrieve blocks of data based on client needs or Namenode, execute commands to create, delete, and copy blocks, and periodically send dynamic information to Namenode to store lists of blocks of data, Namenode obtain each Datanode and verifies block mappings and file system metadata accordingly.

3.2 MapReduce

MapReduce is a software framework for processing large data. Its core design idea is to divide the problem into chunks, to push the calculation to the data rather than to push the data to the calculation. The simplest MapReduce application consists of at least 3 parts: The map function, the Reduce function, and the main function, whose model is relatively simple, blocks the user's raw data, and then hands the map function to the different map task areas to process the output intermediate results. The reduce function reads the list of data and sorts the data and outputs the final result. Its process is shown in the figure:

Advantages and problems of 3.3 Hadoop

Hadoop is a software framework capable of distributed processing of large amounts of data, and is handled in a reliable, efficient and scalable manner. It is reliable because it assumes that the compute element and store will fail, so it maintains multiple copies of the work data, ensuring that the nodes are able to be distributed for the failed node, efficient because it works in parallel, speeding up processing through parallel processing, and scalability is that it can handle PB-level data.

But like other emerging technologies, Hadoop also faces problems that need to be addressed. (1) At present, Hadoop lacks enterprise-class data protection, developers must manually set HDFS data replication parameters, and rely on developers to determine that replication parameters are likely to lead to waste of storage space. Article Hadoop needs to invest in building a dedicated computing cluster, but this usually results in isolated storage, computational resources, and storage or CPU utilization issues, and this storage has compatibility issues with other programs sharing problems.

4. Predictive analysis capacity

Data mining enables users to better understand the data, and predictive analysis of large data allows users to make predictive judgments based on visual analysis and data mining results.

Compared with the traditional data analysis, one of the important goals of large data analysis is to find hidden rules in the database of massive and data, and make the database play the greatest value. The value of data is much more than the data itself, but the hidden knowledge of the relationship between the data. For example, enterprises and customers are now in contact with the road and the interface is more and more rich, and these ways to carry the customer and the enterprise, between customers and products, customers and brand a large number of interactive information and data. If these data can be consolidated, organizations will have more opportunities to accurately understand existing users and to tap into potential user groups.

In order to make full use of the value of large data, the visualization analysis and data mining results are forecasted. In the large data age, the predictive analysis of the data provides the enterprise with insight into the customer's opportunities, more comprehensive and in-depth understanding and grasp of customer demand characteristics, hobbies, consumption tendencies and consumer psychology, to help enterprises improve operational management capabilities and performance.

5. Conclusion

As the data exploded, we were being surrounded by various data. The correct use of large data will bring great convenience to people, but at the same time to the traditional data analysis brings a technical challenge. In this paper, the key technologies of large data analysis are analyzed in detail, and the visualization technology, mining technology and analysis technology of large data analysis are mainly expounded. Overall, although we have entered the big data age, but "big data" technology is still in the initial stage, further development to improve the large data analysis technology is still a hot topic of large data research.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.