Analysis of file storage and search technology

Source: Internet
Author: User

Zhu Ligu, China Communications University

With the continuous development of information technology, the storage requirements for various types of information such as text files, pages, emails, music, and movies have exploded. The storage system is more effective in terms of data storage reliability and performance. However, as the storage system capacity increases, the number and type of storage information increases, information Retrieval and management will become increasingly difficult, which is in stark contrast with the Internet environment. With the development of search engines, It is very convenient to search for information in the Internet environment. It is more difficult for users to find desired information in the storage system than to search for information on the Internet.

Inefficient traditional file systems

Almost all file storage systems are based on file systems, and file systems and operating systems are inseparable. A file system consists of files and directories. Data consists of several named files based on their content, structure, and purpose. Directory to build a hierarchical structure of the file system. You can create sub-layers to classify objects. The file system can effectively organize files. The subdirectories or file names in a directory are unique, this ensures that the full path name of the file does not point to two or more subdirectories or files at the same time.

There is no effective way to establish a hierarchy based on what it is. When considering file security and file sharing, the defects of the hierarchy are more obvious. For example, when a file is shared on the network, the file is copied to a public directory, and the access permission of the public directory is set, in this way, a file will have two copies in two different hierarchies, which will cause great inconvenience to file management, especially when the number of files increases.

In addition, the hierarchical structure makes File Access less efficient. For example, a directory hides the content it contains, and there may be subdirectories at another layer in the directory. It is difficult for users to know what is under a directory, to access a file, you must use the hierarchical directory tree structure to reach the file storage location. If you do not know the file storage location, you must traverse the entire directory or use the operating system's search function, however, the operating system can only search for data by file name.

Efficient and reliable semantic File System

In the past 10 years, file system technology has not undergone major changes, and new data types (such as multimedia and email) have emerged, including rich metadata. Metadata is not given an important position, and data stored in the file system lacks semantic support. Therefore, the storage system cannot provide high-level semantic-based associated data access. Due to the lack of existing file systems, academia and industry have done a lot of work to study how to improve the efficiency of file management and search. Among them, the most important research result is the semantic file system, which can make full use of the metadata information of the file for file browsing and search.

The semantic file system uses metadata extraction tools to obtain more metadata, record user activities, and manually or in other ways to mark files. Finally, the information is combined to achieve unified metadata. By establishing a link between unstructured files and database data, the storage system can quickly access file systems based on file attributes.

The semantic file system provides new rules, that is, relevance access methods. Relevance access is a feature based on content access, allowing you to access files in a flexible way. The file attributes are automatically extracted by the converter from a specific type of files and expressed as hkey and valuei pairs. Meanwhile, the semantic file system introduces the concept of virtual folders. In a virtual folder, a user can perform attribute-based search. In the result set, the system creates a symbolic connection for a group of files to provide a file access path that spans the directory level. For example, the virtual folders of winfs and spotlight can be represented by text files in XML format. The content is a list of results returned after the database is queried, contains links to files or folders that comply with certain rules. When more storage space is occupied, the semantic file system can easily put a file under several different directory levels at the same time.

The semantic file system can efficiently classify files. For example, the tagfs Based on the filesystem in userspace, fuse, uses the smart tagging mechanism to dynamically enable specific tags for data files, data Files with tags can be classified based on users' preferences and intentions, and sorted by weights.

The semantic file system allows you to search data files efficiently. The logical File System (lisfs) uses a database to provide the search function for system files. Database tables are composed of mappings from keywords to objects. The content of a directory is an object set that meets the query conditions. The spotlight of Apple Computer is a metadata and Content Indexing System that is integrated into the HFS file system. Winfs metadata is stored in a database, while spotlight's index content and search results are also stored in the database. Linux also has a system similar to spotlight, called beagle. Beagle uses the File System Event Service inotify in a kernel to provide a plug-and-play infrastructure for the new file type.

Integrated data management and search

Although the semantic file system has made a lot of optimization work in file storage and retrieval, the relevant methods have also been widely recognized, but its hierarchical nature has not changed, semantic file systems are only an important complementary technology for hierarchical file systems.

A new idea is to combine file storage with the web, and the Web transmits information by Adding links. In general, in Web and ultra-text files, links allow users to automatically jump from one file to another. The link can be extended through semantic web.

To make Semantic Web possible, W3C has developed various standards to provide a feasible way for HTML and HTTP standardization to some extent. Standard Semantic Web groups are divided into different levels. Uri and Unicode are at the bottom, XML, namespace, and style are in the middle of the Self-describing file layer, and RDF is at the top. RDF provides a general metadata framework for various applications.

In addition, semantic web provides the ability to process content and introduces two other concepts: Knowledge navigators and federated knowledge or databases. Therefore, semantic web may become an accessible universal library.

If you make file storage a part of the web, the storage and search of files may change. Based on this idea, we are developing a semantic network storage (snstor) system to provide a rich metadata structure and build an online file system. To solve the performance problem of the Web-based file storage system, we plan to adopt a faster data structure-the Balance Tree to replace multiple link lists, and adopt compressed files to achieve efficient storage. In addition, we are still studying the use of fault-tolerant data structures to increase storage reliability and availability, such as development consistency checks.ProgramTo improve availability.

The rapid growth in the number of files indicates that the market demand for efficient file storage systems will increase sharply. Those file storage systems that can integrate data management and search functions can improve storage efficiency and reduce storage costs, which will be welcomed by users.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.