Daily knowledge (1)-semi-structured data

Source: Internet
Author: User

Compared with plain text, it is structured, but compared with the data of relational databases with strict theoretical models. OEM (Object exchange Model) is a typical semi-structured data Model.

Data storage is certainly involved in the design of an information system. Generally, we store system information in a specified relational database. We classify data by business, design the corresponding table, and save the corresponding information to the corresponding table. For example, if we create a business system and want to save basic information about employees: employee ID, name, gender, and date of birth, we will create a corresponding staff table.

However, not all information in the system can be easily matched with fields in a table. I divide the data into three types by form:

1. Structured Data, just like the example above. It is best to process data of this type, as long as a corresponding table is created.

2. unstructured data, such as images, sounds, and videos. This type of information is usually not directly known, and the database can only store it in a BLOB field, which is very troublesome for future retrieval. Generally, create a table containing three fields (number, content description varchar (1024), and content blob ). Reference by number, search by content description. There are still a lot of unstructured data processing tools, and the common content manager on the market is one of them.

3. semi-structured data. Such data is different from the above two types. It is structured data, but the structure changes a lot. Because we need to understand the details of the data, we cannot simply organize the data into a file for unstructured data processing. Because of the great structural changes, we cannot simply create a table to correspond to it. This article mainly discusses two common methods for semi-structured data storage.

Here is an example of semi-structured data, such as storing employee resumes. Unlike the basic information of employees, the resumes of each employee are very different. Some employees have very simple resumes, for example, including only educational information. Some employees have complicated resumes, such as work conditions, marital status, entry and exit conditions, account migration, membership status, and technical skills. There may be some unexpected information. Generally, it is not easy to store the complete information, because we do not want the structure of the table in the system to be changed during system operation. I usually use two methods to store this type of data.

Method 1: Resolve to structured data. This method is usually used to make rough statistics on the information in the current resume, summarize all types of information in the resume, and consider the information that the system really cares about. Create a word table for each category. For example, in the above example, we can create a sub-table for education, a sub-table for work conditions, and a sub-table for Party membership, and add a remarks field to the main table, save the information that other systems do not care about and that has not been considered in the remarks.

Advantage: it is convenient to query statistics.

Disadvantages: it cannot adapt to the expansion of data, nor search for extended information. It cannot be well processed even if it is not taken into account in the project design phase and the storage of information concerned by the system.

Method 2: organize and save the data to the CLOB field in XML format. XML may be the most suitable for storing semi-structured data. Save different types of information to different nodes in XML.

Advantage: It can be flexibly extended. For information extended type, you only need to change the corresponding DTD or XSD.

Disadvantage: the query efficiency is relatively low. XPATH is required for query statistics. The performance problem is expected to be well solved as the database supports XML.

<X
  • Copy (C) Ctrl + C
  • Google
  • Bing
  • Yahoo
  • Wikipedia

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.