I have always understood the mistake. I always thought that distributed indexing and distributed search are two different things, but they are actually the same. does Index Distribution on multiple computers achieve distributed search? Since the index has been stored in a distributed manner, because the search is based on the index, the search is naturally distributed .. I have read some understanding on the Internet. I always thought that distributed search and distributed search are two independent processes. I Don't Know What To understand, do I?
After investigation, we found that the data structure of the index file is quite complex. It seems that every time an index is submitted, the previously generated index will be reorganized and a new file will be generated, therefore, if you append an index file to HDFS, the workload will be quite large. You must have a clear understanding of the data structure of the index file and the association of the index file. The following three articles describe the Lucene index structure, I don't know much about it. If you are interested, take a look.
1. http://www.cnblogs.com/mikehe1117/archive/2006/08/02/466264.html2.http://blog.csdn.net/pangliyewanmei/article/details/57211133.http://www.cnblogs.com/forfuture1978/archive/2009/12/14/1623597.html
Similarly, it is unreasonable to use hbase to store search index files because index files are managed in a distributed manner and these small files are associated with each other. Each time, all these small files must be retrieved, I now know why distributed indexing is so difficult? It is because of the overall relevance of the index small file and cannot be split at will. I tried it. If the generated segment is removed, it cannot be searched normally and an error will be reported, delete the deleted file. There are three feasible distributed index storage solutions:
1. directly use the distributed feature provided by SOLR, that is, deploy multiple SOLR instances. Select master and slaves.
2. Use HDFS distributed storage Index
3. Use katta to manage Indexes
1. Because the first index file is stored in the physical storage space of multiple machines, rather than in HDFS, the first solution is not suitable because HDFS is required for mining using mahout.
2. Because HDFS is not suitable for storing a large number of small files, it brings additional computing overhead. The solution of nutch + SOLR also directly stores indexes on HDFS, without considering that indexes are small files, therefore, the second method of directly storing indexes in HDFS and querying in HDFS is also not advisable.
References: 1. How does hadoop process small files?
Since index files must be stored on HDFS and small files must be avoided, there are only two solutions:
1. Export the SOLR index file directory to HDFS through Java API directly, and then use katta to slice the index
2. Deploy the index file to HDFS with nutch, and then use katta to slice the index
The advantage of the first solution is that you do not need to build a nutch platform. For the second solution, hadoop + Lucene = nutch and SOLR are also based on Lucene. Is it a bit repetitive? An important part of nutch is Web Crawler. If it is only used to crawl local disk files, isn't it a little clever? In addition, it is quite troublesome to update the index of nutch. You need to modify the script. Currently, our system has ready-made data and does not need to modify the script. There are two methods for distributed search: one is to set the index directory of the search on HDFS, and the other is to copy the index to the local disk, then, you can use the nutch server to set up multiple machines to listen to the local index directories. Then, you need to manually copy the indexes to the local machine. However, this method is more efficient, it is not recommended that you directly query data in HDFS.
The disadvantage of the first solution is that the index file needs to be re-imported to HDFS every time an index is submitted, and the index file is generated by solrj. How is this performance unknown?
The advantages and disadvantages of the second solution are naturally the opposite of the first solution. The second solution also needs to solve the problem of how to update the index when uploading a new document to gluster.
The open-source projects Lily and nut did not mention how to store and distribute search indexes. Since we have been working on SOLR-based indexing, And we have implemented that when new documents are uploaded to glusterfs, solrj is used to update the index. Therefore, we want to adopt the first solution. After the index is updated, the index file is submitted to HDFS at the same time, and then katta is used to split and manage the index.
References:
1. About the nutch splitting index (this articleArticleConsumption of various queries)
2. Distributed Query deployment of nutch-1.0 (This article explicitly states that it is impractical to use HDFS for queries, using the copy of the index from HDFS to the local, multiple machines listen to the local index directory)
3. How to setup nutch (V1.1) and hadoop (this article builds a distributed search for nutch)
4. nutch + SOLR 5. Distributed configuration and installation of nutch 6. Local index update policy of nutch