Research on storage strategy of small text corpus on Hadoop platform
Huazhong Normal University Zheng Lijie
A new HSCs (Hadoop smalltexts Corpus Storage) storage strategy is proposed to solve the contradiction between the distributed storage and retrieval speed of the small text corpus in the Hadoop platform storage. This strategy first uses the small text merging technology to add a layer of merge_client in the HDFs architecture, merging several small text files into the large text file of the directory structure, effectively reducing the memory pressure and the number of accesses to the Datanode, and then using the small file retrieval technology, Add a two-level index structure to the merged large text file and design the data structure for the index record. and add an index file threshold, when the threshold value, the use of virtual memory technology, the use of the lowest frequency index files to swap, reduce the space complexity of file management, In order to solve the problem of small text in the large text file merged into the directory structure, the retrieval speed of small text is improved effectively. Finally, the experimental design compares the writing speed and the text preprocessing speed of the small text corpus before merging with the combination, compares the speed of text retrieval using HSCs method and Sequencefile method, and compares the retrieval speed when the virtual storage technology is not needed. The experimental results show that the new HSCs storage strategy proposed in this paper is feasible and effective in the process of small text corpus.
Research on storage strategy of small text corpus on Hadoop platform
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.