According to the IDC survey, global electronic device storage data will be 30 times times as fast as 2020, to 35ZB (equivalent to 1 billion 1TB of hard disk capacity). The arrival of large data waves has also brought a new challenge to the enterprise. For the prepared enterprise this is undoubtedly an information gold mine, can reasonably transform large data into valuable information to become the necessary skills of future enterprises. Coincides with this time, CSDN specifically for enterprise-related personnel conducted a large-scale questionnaire survey, and in thousands of of the survey report summed up the current enterprise data business. Here we will also show the results of the survey for your reference.
Data format characteristics in large data age
First, let's take a look at the data format features of the big data age. From an IT perspective, information structure types have roughly gone through three of waves. It must be noted that the new wave does not replace the old wave, which is still evolving, with three types of data structures always exist, but one type of structure is often dominant in other structures:
Structured information-This information can be found in relational databases and has dominated it applications for years. This is the key task OLTP system business depends on the information, in addition, the structure of database information can be sorted and queried;
Semi-structured information-This is the second wave of it, including e-mail, word processing files, and information stored and posted on the web. Semi-structured information is based on content, can be used for search, which is the reason for Google's existence;
Unstructured information-This information can be considered essentially a bit-mapped data in its essential form. Data must be in a perceptible form (such as being able to be heard or seen in audio, video, and multimedia files). Many large data are unstructured, and their sheer size and complexity require advanced analysis tools to create or leverage a structure that is easier to perceive and interact with.
Large-size data processing infrastructure in enterprises is lagging behind
From the results of the survey can be seen, nearly 50% of the number of enterprise servers in 100 units, and 100 to 500 units occupy 22% of the proportion. 500 to 2000 servers occupy the remaining 28.4% percentage. It can be seen that most companies have not yet perfected their hardware infrastructure facilities in the face of big data. With the current situation of large data processing infrastructure in enterprises, 50% of enterprises face the problem of large data processing (small and medium-sized enterprises in the face of the solution to large numbers should follow the collection, import/processing, query, mining process).
But this is only temporary situation, "cheap" server facilities will gradually be phased out with the development of enterprise business history stage, in the future enterprise Infrastructure system hardware selection, multi-core multi-channel processor and SSD and other equipment will become the preferred enterprise. With Facebook's open Compute project setting an example in the industry, open Compute Project uses the open source community concept to improve server hardware and rack design. Its datacenter Pue value is also one of the leading competitors in the industry.
And in the enterprise with large data processing needs of 52.2% of daily data generation under 100GB, the daily data generation 100GB to 50TB occupied 43.5%, and surprisingly, the daily data generation 50TB above also has a share of 4.4%. As data volumes continue to grow, companies will be forced to increase the deployment of infrastructure. Patent costs will continue to increase, and open source technology, which has been saving the continuing patent fees. For enterprises that urgently need to change their traditional it architectures, the integration of traditional structured data and unstructured data has become a matter of concern for all.
Challenges and problems faced by enterprises in dealing with large data processing
Today's large data presents a "4V + 1C" feature. Produced: generally includes the structure, the semi-structured and the unstructured and so on many kinds of data, moreover they processing and the analysis method to have the difference; Volume: through a variety of equipment produced a large number of data, PB level is the normal; Velocity: Requires fast processing, timeliness Vitality: analysis and processing models must change rapidly, as demand changes; complexity: processing and analysis are very difficult.
We can see from the graph that the resource utilization is low, the scalability is poor and the application deployment is too complex is the main problem that enterprise Data system architecture faces today. In fact, the primary need for large data infrastructure to consider is forward-looking, as the data continues to grow, users need to think from the hardware, software level, what kind of architecture to implement. The file system with high resource utilization, high scalability and good file storage will be the future development trend.
The application deployment is too complex and spawned a large data processing system administrator This new career, it is mainly responsible for daily Hadoop cluster normal operation. For example, direct or indirect management hardware, you need to add hardware to ensure that the cluster can still run stably. Also responsible for system monitoring and configuration, to ensure the integration of Hadoop and other systems.
And the data of multiple format, read-write speed (read-write speed refers to the speed of data moving from endpoint to processor and storage), and massive data is the technical challenge that enterprise faces the big processing urgently. As we all know, with the advent of large capacity data (TB, petabyte and even EB level), business data poses greater challenges to IT systems, and data storage and security, as well as the future access and use of these data, have become difficult. The big data is not just about the amount of data. Large data includes more and more different formats of data, and these different formats require different processing methods. The most important application of data mining technology is to make full use of useful data and discard false and useless data.
The application of internal data analysis and mining tools in enterprises
Enterprise data mining in cloud era faces the following three challenges. Mining efficiency: After entering the cloud computing era, BI's thinking has been transformed. Previously based on closed enterprise data mining, in the face of the large amount of heterogeneous data after the introduction of the Internet, the efficiency of parallel mining is very low. Multi-Source data: After the introduction of cloud computing, the location of enterprise data may be on the platform of providing public cloud services, or on the private cloud built by the enterprise, How to face different data sources is also a challenge; heterogeneous data: The most important feature of web data is semi-structured, such as documents, reports, Web pages, sounds, images, video, and so on, and cloud computing brings a large number of SaaS applications based on Internet models, and how to comb effective data is a challenge.
The price factor can be seen in addition to the slow reaction speed, inconvenient operation, inaccurate data, analysis of inaccurate these four is the enterprise data analysis and data mining is facing the main problem. Commercial solutions are mature, but costs are obvious. Data scientists with the ability to process and analyze large data on open source platforms are another option. The data scientist has the specialized domain knowledge and has the research to use the corresponding algorithm to analyze the corresponding problem the ability, may help to create to drive the business development the corresponding big data product and the big Data solution.
We can see from the results of the survey that Hadoop occupies half of the hbase, and the same as open source has a share of nearly one-fourth. and commercial data analysis and mining platform (such as Teradata, Netezza, Greenplum, etc.) total only 13.9% of the share. In the short term, open source analytics will be used more and more widely and grow rapidly. In the long run, the application of hybrid technology will be in a highly competitive market, the two will also have a huge demand. Predictably, Hadoop, as the core technology for enterprise-class Data Warehouse architecture, will continue to grow over the next 10 years.
With the advent of the cloud era, enterprises are faced with more diversified application methods, through the means of cloud to provide a large number of data mining method, improve the efficiency of mining, increase the precision of mining, more conducive to the extension of mining applications and professional knowledge of the construction industry. While it is challenging to collect and store large new data, the new method of analyzing these data is the tool that helps the most successful companies to get rid of their competitors.