Lucene concurrent indexing Solution

Source: Internet
Author: User

Background: it takes 10 minutes for a single thread to index 0.3 million data entries. To improve efficiency, multithreading is adopted.

At first, I used multiple threads to share an indexwriter instance (which also means to write an index to the same directory). This is a recommended practice of luceneinaction and lucenewiki. I don't know why I always report filenotfoundexception, this is confusing. Once in a while. This error reminds me of another problem, that is, when the index is created, the search will report this problem.
Error: luceneinaction clearly says there is no problem during index creation and reading.

To put it bluntly, I tried to use each thread to own its own indexwriter instance for the second time, but I wrote indexes to the same directory.
The mistake of writing a lock is very consistent with what is mentioned in the book.

Finally, I couldn't help it. I used each thread to use my own instance to write an index to my directory. The last finished thread merged all the indexes. For example, I opened four threads, there are five directories: build_index, build_index1, build_index2, build_index3, and build_index4. Thread 1 is written to build_index1, and thread is written to build_index2 ,... And so on, the last completed index of the build_index1-4 directory is merged into build_index.

I opened four threads and tried to find that it would take about 7-8 minutes. The index merging process was very fast for about 20 seconds.
With 10 threads enabled, the entire process took more than six minutes, and it took only 21 seconds to merge indexes.

It seems that the effect is not obvious, because the data size is not big enough. The larger the data size, the more obvious the advantage of parallel processing.

It can be seen that the process of merging indexes is very fast, which provides another benefit. We usually use build_index as the search directory, as mentioned above, the index creation process will affect the search (although it is not affected according to the book). If we adopt this solution, most of the index creation processes are not related to the build_index directory, build_index is only required for final merge, but the process is very fast, so it can greatly ease the problems caused by index creation.

If conditions permit, you can extend this solution to upgrade multi-threaded indexes to multiple machines for simultaneous creation.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.