Research on storage strategy of small text corpus on Hadoop platform
Huazhong Normal University Zheng Lijie
A new HSCs (Hadoop smalltexts Corpus Storage) storage strategy is proposed to solve the contradiction between the distributed storage and retrieval speed of the small text corpus in the Hadoop platform storage. This strategy first uses the small text merging technology to add a layer of merge_client in the HDFs architecture, merging several small text files into the large text file of the directory structure, effectively reducing the memory pressure and the number of accesses to the Datanode, and then using the small file retrieval technology, Add a two-level index structure to the merged large text file and design the data structure for the index record. and add an index file threshold, when the threshold value, the use of virtual memory technology, the use of the lowest frequency index files to swap, reduce the space complexity of file management, In order to solve the problem of small text in the large text file merged into the directory structure, the retrieval speed of small text is improved effectively. Finally, the experimental design compares the writing speed and the text preprocessing speed of the small text corpus before merging with the combination, compares the speed of text retrieval using HSCs method and Sequencefile method, and compares the retrieval speed when the virtual storage technology is not needed. The experimental results show that the new HSCs storage strategy proposed in this paper is feasible and effective in the process of small text corpus.
Research on storage strategy of small text corpus on Hadoop platform