By clearly defining the relevant concepts of large data, enterprises can plan their own data system correctly, and locate the traditional technology and new technical methods appropriately.
With the rapid development of it technology and the emergence of new technologies, the industry has generally confused many basic concepts. This is also the case in today's most popular large data fields. The concepts of structured data and unstructured data are frequently cited, but the parties are often diverging. The confusion of the concept of data has greatly influenced the enterprise to plan the data system clearly and correctly. Based on the practical work, the author tries to give a clear definition of some key data related concepts and make a brief analysis.
Classification by Data feature
Structured data
Definition: At present, it is specifically about relational model data, that is, data managed in the form of relational database tables. Most enterprise business data is stored in this format.
Simple analysis: Although from a professional point of view, the structure of the relationship model is not accurate. However, it is the most appropriate to define the relational model data for the current industry situation, because it clearly and accurately represents the business data which is traditionally most familiar to us, and there is no ambiguity.
Semi-structured data
Definition: semi-structured and unstructured are often mentioned together, both of which actually refer to all other "non" structured data. However, if you want to describe it more clearly, you can define "semi-structured data" as those that are not relational models, with basic fixed-structure patterns, such as Application log files, XML documents, JSON documents, and e-mail.
A brief analysis: from a professional point of view, the above structured and semi-structured data are structured data, but the proposal is still the use of this definition is appropriate, otherwise it will cause greater confusion.
Unstructured data
Definition: Removes all structured and semi-structured data, that is, data without a fixed structure pattern, such as word, PDF, PPT, EXL documents, and pictures and videos in various formats.
Analysis: The significance of distinguishing between semi-structured and unstructured data is that the processing methods (including storage, access and analysis) of the two are different in the enterprise at present. Most of the unstructured data adopts content management method, but there is no effective management method for the semi structured data.
In fact, the distinction between structured, semi-structured, and unstructured data is actually just sorted by data format and has a long history. Strictly speaking, both structured and semi-structured data are data with a basic fixed structure pattern (i.e., structured data in a professional sense). However, the current situation in the industry is that the relationship model data is defined as structured data separately, which is desirable for enterprise data management and has certain practical significance.
In addition, there is only a field overlap between semi-structured and unstructured data and the large data currently prevalent. In essence, the two are not necessarily related. The industry has the idea of identifying large data as semi-structured and unstructured data, simply because large data technologies are the first to play a role in semi-structured data. The misunderstanding of the above is that it is not correct to confuse the data processing technology with the concept of the date format.
II. classification by data processing technology
Large data (technology)
Definition: Large data is the concept that has arisen in recent years, and it is generally defined by the industry as having 4 V (large data volume, fast velocity of change velocity, multiple types of produced and low value density) features. In fact, the concept of large data should accurately refer to large data technologies, which refer to new, low-cost processing techniques that are different from the SQL system for massive data, rather than data formats or other.
Analysis: The industry's definition of big data is the most confusing, there are a number of misconceptions: to equate large data with semi-structured/unstructured data, whereas large data technologies are the first to function in semi-structured data, and are now infiltrated into multiple structures; there is a big data equivalent to Hadoop, In fact, Hadoop has played a huge role in raising the tide of data, but there are many companies that use large data methods to efficiently analyze and store certain business data.
In addition, how to define the data managed by the content management method? Some say big data is the data of Internet characteristic, that traditional enterprise has no big data? Some say big data is a large amount of data, which is not defined. In fact, careful analysis, or to define it as the most accurate data processing technology. In addition to SQL system and content management technology, large data technology is currently rich in content. In addition, large data technology must emphasize low cost.
relational database Technology
Definition: This refers to the SQL processing system compared to the relational model in the data format classification.
The relational database technology is still the core of enterprise data management, and the positioning of large data technology needs to be considered and studied further.
Content Management Technology
Definition: Mainly refers to the enterprise to unstructured data, but also includes some structured data according to the "content" characteristics of the organization, management and access to the processing method, is currently in addition to the relational database technology, the most commonly used another important technical methods and tools.
Analysis: It is the most important means for the enterprise to deal with the unstructured data at present, and the enterprise has no effective management and utilization of the structure data.
Other technical
Definition: The enterprise may also adopt such other data management technologies as Low-cost Distributed file systems, MySQL federation, massive memory data management technology, and new technologies between Hadoop and SQL systems (for historical data management), which can be subsumed under the scope of large data technology.
The MySQL federation used in Internet industry uses the structured data of relational model, but it is not equivalent to distributed relational database, because it sacrifices the global consistency and integrality guarantee of complete relational database, but obtains better expansibility. Therefore, it also belongs to large data technology.
In addition, the Distributed file system enables enterprises to have a large number of small file management has a new method, but also belongs to the big data technology; the low cost of massive memory data management technology makes the ability of the trading system to further achieve Low-cost upgrade, but also belong to large data technology. Thus, large data technology is oriented towards structured, semi-structured and unstructured data, not just semi-structured data.
Therefore, the larger data refers to the technical method, rather than the data format, refers to the SQL system and Content management technology, in addition to some of the emerging data management technology. But the big data so-called "big" does not have the related standard. In large data technology, Hadoop is the most important one, not the only one, others are NoSQL, distributed file system, MySQL relational database federation, massive memory data management technology.
In the data types that large data technology can handle, the enterprise has not yet managed and processed the semi-structured data is only its earliest function. In fact, it can also handle all data types, including structured, semi-structured, and unstructured.
In addition, it needs to be clarified and clear that large data technology must be low-cost, otherwise there is no foothold. At the same time, relational database technology is still the core of enterprise data management. Content management is the main means to organize, store and access unstructured data at present. If the introduction of large data technology to deal with unstructured data, in addition to lower cost considerations, it should refer to the Content management technology has not yet been involved in the field of data analysis, such as image, video analysis, but this for banks and other industries should be relatively distant demand.
The enterprise IT personnel should define the above concept clearly, can divide its data type into "structured", "semi-structured" and "unstructured" three kinds. At the same time, the enterprise can divide the data processing method into "relational database Technology", "Big Data Technology", "Content management technology" and other. On the basis of clear concept definition, enterprises can not only correctly plan their own data system, but also can locate the traditional technical methods and new technical methods properly.
(Responsible editor: The good of the Legacy)