Research on storage strategy of small text corpus on Hadoop platform

Source: Internet
Author: User
Keywords Merging Hadoop platform small text corpus HSCs storage strategy
Tags .mall directory directory structure distributed distributed storage file files hadoop

Research on storage strategy of small text corpus on Hadoop platform

Huazhong Normal University Zheng Lijie

A new HSCs (Hadoop smalltexts Corpus Storage) storage strategy is proposed to solve the contradiction between the distributed storage and retrieval speed of the small text corpus in the Hadoop platform storage. This strategy first uses the small text merging technology to add a layer of merge_client in the HDFs architecture, merging several small text files into the large text file of the directory structure, effectively reducing the memory pressure and the number of accesses to the Datanode, and then using the small file retrieval technology, Add a two-level index structure to the merged large text file and design the data structure for the index record. and add an index file threshold, when the threshold value, the use of virtual memory technology, the use of the lowest frequency index files to swap, reduce the space complexity of file management, In order to solve the problem of small text in the large text file merged into the directory structure, the retrieval speed of small text is improved effectively. Finally, the experimental design compares the writing speed and the text preprocessing speed of the small text corpus before merging with the combination, compares the speed of text retrieval using HSCs method and Sequencefile method, and compares the retrieval speed when the virtual storage technology is not needed. The experimental results show that the new HSCs storage strategy proposed in this paper is feasible and effective in the process of small text corpus.


Research on storage strategy of small text corpus on Hadoop platform

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.