SOLR integrates with MongoDB, real-time incremental indexing __java

Source: Internet
Author: User
Tags solr win32


Original link: http://www.656463.com/article/ZJZvIv.htm









Fiirst. Overview






A lot of data is stored on the MongoDB, need to quickly search out the target content, so build SOLR service.



Another point, after using SOLR index data, you can use the data in different projects, directly to the SOLR service to send requests, return XML, JSON and other forms of content, make the use of data more flexible.






The description of MongoDB and SOLR is not explained here, this article is intended to provide you with the entire SOLR and MongoDB joint approach to achieve real-time incremental indexing.



MongoDB's official website: http://www.mongodb.org/



Page of the SOLR project: http://lucene.apache.org/solr/






Second. Find Solutions






Since the goal is to tell SOLR to get together with MongoDB, start thinking about the solution.






After searching some information on the Internet, the following three scenarios are summarized:



1. Use SOLR's DataImport function (Data Import)



Let's take a look at the description of the DataImport feature on the SOLR wiki Http://wiki.apache.org/solr/DataImportHandler



Most applications store data in relational databases or XML files and searching over such the data is a common use-case. The Dataimporthandler is a SOLR contrib so provides a configuration driven way to import this data into SOLR in both "Fu LL builds "and using incremental delta imports.



For data stored in relational databases and XML, SOLR provides dataimporthandler to implement a full and incremental index.






What. People did not say support NoSQL ah, do not believe, I see a little clearer.



The content in the wiki only gives the Usage with RDBMS and Usage with Xml/http Datasource, and it appears that SOLR does not currently support NoSQL for DataImport.



Interested friends can try to add a MONGO Dataimporthandler to SOLR, possibly writing a MONGO drive at the bottom, which can be very large.



The key is that the scheme is not controllable, the cost can be very large, so I do not take this approach.






To share an article here, SOLR and MySQL integration guide



Indeed DataImport this function is still relatively strong, and for MySQL also support very well, I try to the SOLR with MySQL integration, configuration process is also very simple.



However, MySQL is not the focus of this article, digress, so just tried a bit, not in-depth.






2. Read the data in the MongoDB using the scripting language (script Update)



The white is to read the entire collection data, traversal.



This scheme is the most intuitive, but not elegant, reusability, low maintainability,



The most serious problem is performance, which is acceptable when the order of magnitude is below million, and the performance problem is highlighted once the data continues to grow.



However, if you still want to use this kind of solution, then there is a question to consider, you are going to iterate each time to the full volume or incremental index of SOLR.



The full amount of words directly overwrite, things are good to do; incremental, MONGO deleted in the data you do it.






In short, it is not recommended, and its complexity is obvious, both in time and space.






3. Use the MongoDB oplog function (Oplog Update)



MongoDB supports clustering, where instances of the cluster communicate, and it is natural to think that they log logs, called Oplog (Operation Log) in MongoDB, similar to the binlog of MySQL.



We can take a look at MongoDB's description of Oplog on the Internet http://docs.mongodb.org/manual/reference/program/mongooplog/






If you still want to use the above scenario 2, the existence of the oplog is necessarily a great convenience to your work.



First, Oplog is real-time recording, with tailable cursor, can realize real-time update SOLR index, see http://derickrethans.nl/mongodb-and-solr.html



Second, to achieve elegance, the increment of the new deletion of the judgment time complexity changed to O (1)






See here, you want to use Oplog to achieve the integration of SOLR and MongoDB, that need to clarify the following questions:



(1) Mongooplog How to open, how to configure the appropriate



(2) Mongo tailable cursor What's the matter



(3) What language to use, choose the appropriate SOLR Client



(4) Processing after server downtime recovery









Third. Final programme, Mongo-connector






I saw this http://blog.mongodb.org/post/29127828146/introducing-mongo-connector when I was sexually joyfully to implement Project 3.



Unexpectedly found a MONGO-SOLR of connector, at that time the mood really called ecstatic Ah.



It is exactly the implementation of program 3 AH. It's all solved, and using Python is right for the project, and it's all too sudden.






Git address: https://github.com/10gen-labs/mongo-connector



But the configuration process has been done for me for a long time, after the entire process recorded









Fourth. Project environment and tool version






In local test, server: Windows7 32-bit



mongodb:mongodb-win32-i386-2.4.5



Tomcat 6



python:2.7.4



solr:4.5.1



Mongo-connector: No version number provided



Python PYSOLR Module



Python Pymongo Module



Python lxml module: lxml-3.2.3.win32-py2.7






Some modules may be required, but since I have installed them before, they are not listed. If the process of running the report module not found, go to install it ~









Fifth. SOLR Side Preparation






Here by default you have deployed SOLR successfully, and the detailed deployment process itself is Google.



This is mainly about the configuration associated with this test.






Using the multicore example in SOLR example, take core0 as an example



The Schema.xml file is as follows: Modify _id corresponds to MONGO, leaving only a Name field, String type







Other configurations do not need to be modified



Run it in Tomcat and check to see if it's configured successfully.









Sixth. MongoDB End Preparation






See the description in the Mongo-connector project,



Since the connector does real time syncing, it's necessary to have MongoDB, running the although would connector with Both sharded and non sharded configurations. It requires a replica set setup.



Even if we turn on the Oplog, we also need to start a replica set in MONGO.






1. Configure replica set



(1)



My mongo_home for D:\mongodb.



The directory tree is as follows:



-rs (d)



|----db (d) MONGO data file files stored in the directory



|----Rs1 (d) rs1 directory where instance data files are stored



|----RS2 (d) rs2 directory where instance data files are stored



----log (d) log files stored in the directory



|----Rs1.log (f) rs1 log file for instance



|----Rs2.log (f) rs2 log file for instance



|----Mongod-rs1.bat Rs1 instance's startup script



|----Mongod-rs2.bat RS2 instance's startup script






Mongod-rs1.bat contents are as follows:



--port 27001--oplogsize--dbpath db\rs1--logpath log\rs1.log--replset rs/127.0.0.1:27002 D:\mongodb\bin\mongod--j Ournal
Pause



Mongod-rs2.bat contents are as follows:



--port 27002--oplogsize--dbpath db\rs2--logpath log\rs2.log--replset rs/127.0.0.1:27001 D:\mongodb\bin\mongod--j Ournal
Pause






(2) Execute two scripts and start two Mongod instances






(3) But at this time they have not formed a replica set, but also need to configure, open MONGO, connected to the localhost:27001, that is, instance rs1































At this point, the configuration is complete.









Seventh. Mongo-connector Preparation






If you are modifying the multicore default configuration in MONGO example, visit the Http://localhost:8080/solr/core0/admin/luke?show=Schema&wt=json



Should be a schema that can see the JSON form of CORE0.






Open mongo_connector/doc_managers/solr_doc_manager.py



Make the following modifications: 1. Introduce verify_url;2 from Util. Admin_url modified to obtain the second half of the URL of the JSON-form schema for the SOLR kernel core0 because it is indexed according to the fields in the schema










Starting mongo-connector in the case of SOLR multicore will report a SOLR URL access error, which expects you to pass in the HTTP://LOCALHOST:8080/SOLR,



But Http://localhost:8080/solr/core0 is actually working, so we need to pass this on as Base_url



The solution is as follows: Screen out the URL check on the line










The next step is to start the mongo-connector, start the command as follows: C:\users\gmuser\desktop\mongo_connector>python mongo_connector.py-m localhost : 2 7001-t http://localhost:8080/solr/core0-o oplog_progress.txt-n test.test-u _id-d./doc_managers/solr_doc_manage r.py







The access path of the-M Mongod instance



-T SOLR Base_url



-O record oplog processing timestamp file



-N MONGO namespace, which is listening to which database which collection settings, separated multiple namespaces with commas, this is the test collection in the test library



-D is the handle Doc's py file






The starting result is as follows: Your configuration has been successful.













Eighth. Test Incremental index






First look at the status of Core0 in SOLR: It's not logged, Num docs is 0









Insert a piece of data into the MongoDB: you need to include the Name field, remember the schema.xml on top of us.









To view Mongo-connector output: Update a record









Look at the status of SOLR now: we saw what we just inserted.








Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.