What is structured data? What is semi-structured data?

Last Update:2016-07-24 Source: Internet

Author: User

Tags ibase web database

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Overview

Compared to structured data (data that is stored in a database, which can be logically expressed by a two-dimensional table structure), data that is inconvenient to be represented by a two-dimensional logical table in a database is called unstructured data, including all forms of Office documents, text, images, XML, HTML, various reports, images, and audio/ Video information and so on.

Unstructured database refers to a database whose field length is variable, and the records of each field can be made up of repeatable or non-repeatable sub-fields, which not only can deal with structured data (such as numbers, symbols, etc.) and is more suitable for processing unstructured data (full text, images, sounds, movies, hypermedia, etc.).

Unstructured Web database is mainly for unstructured data, compared with the previous popular relational database, its biggest difference is that it breaks through the relationship database structure definition is not easy to change and data length limit, support repeating fields, Sub-fields and variable-length fields are implemented to deal with variable-length data and repeating fields and variable-length storage management of data items, which can not be compared with traditional relational databases in the processing of continuous information (including full-text information) and unstructured information (including various multimedia information).

Structured data (rows data, stored in a database, can be used to logically express the implemented data using a two-dimensional table structure)

Unstructured data, including Office documents, text, images, XML, HTML, various reports, images and audio/video information in all formats, and more

The so-called semi-structured data is the data between fully structured data (such as relational databases, data in an object-oriented database) and completely unstructured data, such as sound, image files, and so on, and HTML documents are semi-structured data. It is generally self-describing, the structure and content of the data mixed together, there is no obvious distinction.

Data Model:

Structured data: Two-dimensional table (relational)
Semi-structured data: trees, graphs
Unstructured data: None

Rmdbs data models include: Mesh data model, hierarchical data model, relational

Other:

Structured data: First structure, then data
Semi-structured data: First data, then structure

With the development of network technology, especially the rapid development of Internet and intranet technology, the number of unstructured data is increasing. At this point, the limitations of the relational database, which is mainly used to manage structured data, are exposed more and more clearly. Therefore, the database technology has entered into the "post-relational database era" and developed into the unstructured database era based on network application.

China's unstructured database is represented by Beijing IBase Software Co., Ltd. 's iBase database. IBase database is an end-user oriented unstructured database, which is in the international advanced level in the fields of dealing with unstructured information, full-text information, multimedia information and mass information and internet/intranet application, and has achieved breakthroughs in the management of unstructured data and full-text retrieval. It mainly has the following advantages:

(1) In Internet applications, there are a large number of complex data types, ibase through its external file data type, can manage a variety of document information, multimedia information, and for a variety of retrieval significance of document information resources, such as HTML, DOC, RTF, TXT, etc. also provides a powerful full-text retrieval capabilities.

(2) It employs sub-fields, multivalued fields, and the mechanism of variable-length fields, allowing the creation of many different types of unstructured or arbitrary-format fields, thus breaking through the very strict table structure of relational databases, allowing unstructured data to be stored and managed.

(3) IBase The unstructured and structured data are defined as resources, so that the basic elements of the unstructured database are the resources themselves, and the resources in the database can contain both structured and unstructured information. Therefore, the unstructured database can store and manage a variety of unstructured data, and realize the transformation of database system data management to content management.

(4) IBase uses an object-oriented cornerstone that tightly combines enterprise business data with business logic, especially for the presentation of complex data objects and multimedia objects.

(5) IBase is a database that adapts to the needs of Internet development, it is based on the idea that the Web is a massive database of WAN, provides an online resource management system IBase Web, the network server (WebServer) and the database server Server) directly set as a whole, make database system and database technology become an important part of the web, break through the database only as a background role of web system, realize the organic seamless combination of database and web, so as to internet/ It opens up a broader area for information management and even e-commerce applications on the intranet.

(6) IBase is fully compatible with various large and medium-sized databases and provides import and link support capabilities for traditional relational databases such as Oracle, Sybase, SQL Server, DB2, Informix, etc.

From the above analysis we can predict that with the rapid development of network technology and network application technology, completely based on Internet application of unstructured database will become the following hierarchical database, mesh database and relational database after another focus, hot technology.

Data classification semi-structured (semi-structured data)

In the design of an information system will certainly involve the storage of data, in general we will keep the system information in a specified relational database. We categorize the data by business, design the appropriate tables, and then save the corresponding information to the appropriate table. For example, we do a business system, to keep the basic information of employees: work number, name, gender, date of birth, etc., we will create a corresponding staff table.

But not all the information in the system can be as simple as the fields in a table can correspond.

Structured data

Just like the example above. This type of data is best handled, as long as the simple establishment of a corresponding table can be.

Unstructured data

Like pictures, sounds, videos and so on. This kind of information we usually do not know his content directly, the database can only be saved in a BLOB field, for later retrieval is very troublesome. As a general practice, create a table with three fields (numbered number, content description varchar (1024), content blob). The reference is numbered and retrieved through the content description. Now there are a lot of unstructured data processing tools, a common content manager on the market is one of them.

Semi-structured data

This data is different from the above two categories, it is structured data, but the structure varies greatly. Because we want to understand the details of the data, we cannot simply organize the data into a file according to unstructured data processing, because the structure changes very much can not be simple to establish a table and his counterpart. This paper mainly discusses two common methods for semi-structured data storage.

Let's start with an example of a semi-structured data, such as storing an employee's CV. Unlike employees ' basic information, each employee's CV varies greatly. Some employees have a simple resume, such as only education, and some of the staff's resume is very complex, such as the work, marital status, immigration situation, migration status, membership status, technical skills and so on. There may be some information that we do not anticipate. It is not always easy for us to keep this information intact because we do not want the structure of the tables in the system to change during the run of the system.

The storage method is dissolved into structured data

This approach is usually a rough statistical collation of the information in an existing CV, summarizing all the categories of information in the CV and taking into account the information that the system really cares about. Create a sub-table for each category, such as in the previous example, we can set up the education situation sub-table, the work situation sub-table, the party status sub-table and so on, and add a Memo field in the main table, the other system does not care about the information and the information that has not been taken into account in the comments.

Advantages: Query statistics is more convenient.

Disadvantage: Can not adapt to the expansion of data, can not be extended information retrieval, the project design phase is not considered while the system is concerned about the storage of information is not very good processing.

Organize and save into the Clob field in XML format

XML is probably the best place to store semi-structured data. It is possible to keep different categories of information in different nodes of the XML.

Pros: The ability to scale flexibly allows information to be extended as long as the corresponding DTD or XSD is changed.

Disadvantage: Query efficiency is relatively low, to use XPath to complete the query statistics, with the database support for XML to improve the performance of the problem is expected to be a good solution.

What is structured data? What is semi-structured data?

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More