What is structured data? What is semi-structured data?

Source: Internet
Author: User
Keywords Unstructured data can semi-structured data we

Relative to structured data (the data is stored in the database, it is possible to use two-dimensional table structure to express the implementation data logically, the data that is not convenient to use the database two-dimensional logical table to represent is called unstructured data, including all format Office documents, text, picture, XML, HTML, various kinds of reports, images and audio/ Video information and so on.

An unstructured database is a database with a variable field length and a record of each field that can be made up of repeatable or not repeatable child fields. It can not only deal with structured data (such as numbers, symbols, etc.) but also more suitable for the processing of unstructured data (full text, image, sound, film, hypermedia, etc.).

Unstructured Web database is mainly for unstructured data, and compared with the previous popular relational database, its biggest difference is that it breaks the relational database structure definition is not easy to change and data fixed-length restrictions, support repeat fields, Sub-fields and variable-length fields and realize variable-length storage management of variable-length and repetitive fields and data items, which have advantages over traditional relational databases in processing continuous information (including full-text information) and unstructured information (including various multimedia information).

Structured data (data, stored in a database, that can be logically expressed in a two-dimensional table structure)

Unstructured data, including all forms of Office documents, text, pictures, XML, HTML, various reports, images and audio/video information, etc.

Semi-structured data is the data between fully structured data (such as relational databases, data in object-oriented databases) and completely unstructured data (such as sound, image files, etc.), and HTML documents are semi-structured data. It is generally self-describing, the structure and content of the data mixed together, there is no obvious distinction.

Data Model:

Structured data: Two-dimensional tables (relational)

Semi-structured data: trees, graphs

Unstructured data: None

Rmdbs data models include: Mesh data model, hierarchical data model, relational

Other:

Structured data: First structure, then data

Semi-structured data: first with data, then structure

With the development of network technology, especially the rapid development of Internet and intranet technology, the number of unstructured data is increasing. At this point, the limitations of relational databases, which are primarily used to manage structured data, are becoming more apparent. Therefore, the database technology has entered the era of "post-relational database" and developed into the unstructured database based on network application.

The unstructured database in China is represented by the IBase database of Beijing National Faith Bass (ibase) Software Co., Ltd. IBase database is a kind of unstructured database oriented to end users, which is in the international advanced level in the fields of dealing with unstructured information, full-text information, multimedia information and mass information, and in intranet application, and obtains breakthroughs in the management of unstructured data and full-text retrieval. It mainly has the following several advantages:

Reference Internet applications, there are a large number of complex data types, ibase through its external file data types, can manage a variety of document information, multimedia information, and for a variety of documents with the meaning of information resources, such as HTML, DOC, RTF, TXT, etc. also provide a powerful full-text search capabilities.

(2) It adopts the mechanism of sub fields, multivalued fields and variable-length fields to allow the creation of many different types of unstructured or free-form fields, thus breaking through the very strict table structure of relational databases, so that unstructured data can be stored and managed.

(3) IBase and structured data are defined as resources so that the basic elements of the unstructured database are the resources themselves, while the resources in the database can contain both structured and unstructured information. Therefore, unstructured database can store and manage all kinds of unstructured data, and realize the transformation of database system data management to content management.

(4) IBase uses the object-oriented cornerstone to combine enterprise business data and business logic closely, especially for expressing complex data objects and multimedia objects.

(5) IBase is to adapt to the needs of the Internet development of the database, which is based on the WEB is a wide-area network of the idea of a massive database, to provide an online resource management system IBase WEB, the network server (WebServer) and database server (db Server) Direct set becomes a whole, make database system and database technology become an important organic part of the web, break through the limitation that the database acts as the background of web system only, realize the organic seamless combination of database and web, thus for the internet/ Intranet has opened up a broader field of information management and even the application of E-commerce.

(6) IBase fully compatible with a variety of large and medium-sized databases, traditional relational databases, such as Oracle, Sybase, SQL Server, DB2, Informix, etc. to provide import and link support capabilities.

From the above analysis, we can predict that with the rapid development of network technology and Network application technology, the unstructured database based on Internet application will become another key point and hotspot technology after hierarchical database, net database and relational database.

Data classification

Semi-structured data (semi-structured)

In the design of an information system will certainly involve the storage of data, generally we will keep the system information in a specified relational database. We will classify the data by business and design the corresponding table and then save the corresponding information to the corresponding table. For example, we do a business system, to maintain basic staff information: work number, name, sex, date of birth, etc., we will create a corresponding staff table.

But not all the information in the system can be as simple as the fields in a table can correspond to.

Structured data

Just like the example above. This category of data is best handled, as long as the simple establishment of a corresponding table can be.

Unstructured data

Like pictures, sounds, videos, etc. This kind of information we usually can not directly know his content, the database can only save it in a BLOB field, for later retrieval very troublesome. As a general practice, create a table with three fields (numbering number, content description varchar (1024), content blob). References are numbered and retrieved by content description. There are still a lot of unstructured data processing tools, the most common content managers on the market is one of them.

Semi-structured data

This data is not the same as the above two categories, it is structured data, but the structure changes very much. Because we have to understand the details of the data so we can not simply organize the data into a file in accordance with unstructured data processing, due to the large structure changes can not simply build a table and his corresponding. In this paper, we mainly discuss two kinds of methods used for semi-structured data storage.

First, give an example of a semi-structured data, such as storing an employee's resume. Not as consistent as basic employee information. Each employee's CV is quite different. Some employees have simple resumes, such as education only, and some employees have complex resumes, including work, marital status, immigration, migration, membership, technical skills, etc. There may be some information that we didn't anticipate. It's not always easy to keep this information intact because we don't want the structure of the tables in the system to change during the system's operation.

Storage mode

Dissolve into structured data

This method is usually a rough statistic of the information in the existing resume, summarizing all the categories of the information in the resume and considering the information that the system really cares about. Create a child table for each category, for example, in the previous example, we could create educational tables, work tables, party membership, and so on, and add a Memo field to the main table to keep information not cared about by other systems and information that has not been taken into account in the notes.

Advantages: Query statistics is more convenient.

Disadvantage: Can not adapt to the expansion of data, can not be retrieved on the extended information, the project design phase is not considered at the same time is the system concerned about the storage of information can not be well handled.

Organize and save in XML format to CLOB field

XML is probably the best place to store semi-structured data. It's OK to keep different categories of information in different nodes in the XML.

Advantages: Flexibility to expand, information extension as long as changes to the corresponding DTD or XSD.

Disadvantage: Query efficiency is low, to use XPath to complete query statistics, as the database on the support of XML to improve performance problems can be well resolved.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.