Talk about Hadoop and distributed Lucene

Source: Internet
Author: User
Keywords This nbsp search service
Tags apache control directory distributed document files hadoop how to

Lucene is the most used open source search engine. This article does not discuss how Lucene updates (http://issues.apache.org/jira/browse/LUCENE-1313) in real time, and how to modify the Lucene scoring mechanism to add such as PageRank scoring factor, This article only discusses distributed Lucene.

When it comes to Lucene, it is generally mentioned that Nutch,hadoop was the first Doung cutting to crawler two indexer for Nutch Nutch and package. The role of Hadoop in Nutch is to crawl pages and build indexes. It is crawled and indexed as detailed on the page. Because of Hadoop's seek capability limitations, Nutch distributed search uses a manual configuration mechanism that lacks the ability to manage indexing and servers. Concrete steps: Modify the Search-servers.txt in webserver to add the server address and service port of the search service, and then refer Searcher.dir in Nutch-site.xml to the directory saved by Search-servers.txt, manually from h on the server providing the search service Copy index files to local in DFS. Launch Distributedsearch.server to provide the search service. Nutch node failure is notified by searching for a timeout to request an IPC call.

Lucene another kind of distributed search is using SOLR (I am not very familiar with SOLR). All updates are automatically distributed to the search server through cron in SOLR's primary server. The search is distributed to each search server through a shards host:port/base_url. URL Example: HTTP://LOCALHOST:8983/SOLR/SELECT?SHARDS=192.168.1.27:8983/SOLR,192.168.1.28:8983&Q=SOLR. The disadvantage is that there is no mechanism of the lengthnorm in the global Lucene scoring mechanism, and no node failure treatment. Scalability is greatly compromised by the distribution of document to shards using the Uniqueid.hashcode ()% numservers mechanism. Recently, Rackspace combines Sorl,hadoop and tomcat to search for mail log data, and the document does not see what mechanism is used to disable the processing mechanism.

The distributed Lucene implemented by HP Lab is mentioned in a wiki on Hadoop, but there is no context since the source was submitted on May 18, 08.

Katta distributed Search is an Open-source project that 101tec.com contributes. The primary purpose is to provide a highly efficient search service and to provide load balancing. Katta uses zookeeper to ensure the validity of the primary node and the search node, assigning index files to the search node and perceiving the failure of the search node. Each search node writes a short znode to the zookeeper "/nodes" node at startup. The master node sets the Watch event to detect this znode change. That is, when the node and Zookeeper server connection is disconnected, zookeeper automatically deletes the Znode and notifies the master node. Similarly, the same program handles the failure of the master node. Currently, only one active master node writes "/master" to the zookeeper Znode. The standby master node sets the Watch event to detect this znode change and turn itself into the active master node. When a new index is deployed, a znode is added under "/index" Znode in Zookeeper, and the primary node assigns the index to the search node. The "/nodes-to-shards" directory holds the Znode of each search node, and under each znode is the list of indexed files that this search node is assigned to. The "/shards-to-nodes" directory holds the Znode of each search node, and under each znode is the list of indexed files that this search node has deployed.

Katta No real-time updates at this stage. (being planned, may resemble Dynamo, update consistency, adopt a consistent protocol implementation similar to Quorum system), no LRU or LFU caching policy. Its distributed TF solution: divided into two times to send requests. First send a request for document frequency (only the Tis file) to each search node, and then send a search request to each search node and send document frequency along with query.

The index in the contrib of Hadoop is used MapReduce to establish the Lucene index, not for searching.

Through the software above, we can establish an automated search service. Establish a Web control server to monitor the entire process. Using the MapReduce of Hadoop to build the index, the index can be assigned to the Katta to provide the search service when the submission job is set Job.end.notification.url to our control server and the control server accepts the task of indexing.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.