Using MONGO Connector and Elasticsearch to implement fuzzy matching

Source: Internet
Author: User
Keywords Large data Mongodb elasticsearch

"Editor's note" This blog author Luke Lovett is the MongoDB company's Java engineer, he demonstrated MONGO connector after 2 years of development after the metamorphosis-complete connector at both ends of the synchronization update. , Luke also shows how to implement fuzzy matching by Elasticsearch.

The following is the translation:

Introduced

Let's say you're running MongoDB. Great, now that you have an exact match for all the queries that are based on the database. Now, imagine that you're creating a text-search function in your application that must remove the noise from the spelling mistake and end up with a similar result. For this daunting task, you need to choose one in Lucene, Elasticsearch and SOLR. But now you're faced with the question of how these search tools will query documents stored in MONGODB. And how do you keep the search engine content up to date?

Mongo connector fills the gap between MongoDB and some of the best search tools (for example: Elasticsearch and SOLR). This can support not only the export of data from MongoDB replica sets or fragmented clusters of these systems, but also the consistency between these systems: if you insert, update, and delete files in MongoDB, these changes will quickly pass through MONGO connector at the other end. You can even use MONGO connector to stream operations to other associated replica sets to simulate a "multi-master" cluster.

When Mongo Connector was released in August 2012, it was easy to function and lacked fault tolerance. I've been using the MONGO connector since November 2013 and I've been helped by the MongoDB Python team, and I'm excited to say that it has made great strides in its functionality and stability. This article describes these new features and how to use MONGO Connector to sync MongoDB operations to Elasticsearch (an Open-source search engine). At the end of this article, we also show how to implement fuzzy matching of text queries to data flowing into Elasticsearch.

Get DataSet

This article, we will come to a popular link aggregation website Reddit. We recently added a data type security code that is supported by MongoDB to handle external database drives well. This enables the security of copy documents that are not fully controlled. Use the following script to transfer the new post Reddit, using the stream to transfer the newly generated Reddit post to the MongoDB.

/reddit2mongo--mongo-host localhost--mongo-port 27017

Since post is processed, you should be able to see the first 20 words of the title. This process mimics the operation of your development application and writes the data to MongoDB.

Start MONGO Connector

Next, we will start MONGO Connector. To download and install MONGO Connector, you can use the PIP:

pip Install Mongo-connector

For the example to work properly, let's assume you have Elasticsearch installed and running on a local machine with Port 9200. You can use the following command to copy from MongoDB to Elasticsearch.

mongo-connector-m localhost:27017-t localhost:9200-d mongo_connector/doc_managers/elastic_doc_manager.py

Of course, if you want to search for text only in post titles and content, we can use the Elasticsearch field options to limit the fields. In this way, we can minimize the amount of data being replicated:

mongo-connector-m localhost:27017-t localhost:9200--fields title,text-d doc_manager.py

Just as you see Reddit2mongo Reddit post to stdout output, you can also see the log output from MONGO connector-all documents are sent to ES at the same time.

Flexible search

Now we're going to use elasticsearch to implement a fuzzy matching query on our dataset because it comes from MongoDB. Because we output content directly from the Reddit Web site, we cannot predict the results obtained from the dataset at all. As an example of a "kitten" search, the following is the implementation code:

curl-xpost ' http://localhost:9200/reddit.posts/_search '-d ' {"Query": {"match": {"title": {"Query": "Kitten", " Fuzziness ": 2," Prefix_length ": 1}}} '

Since we're doing a fuzzy search, we can even search for a word that doesn't exist, such as Kiten. Since most people simply don't pay attention to their spelling, they can directly search for text that users randomly input, so you can imagine how powerful this function is. The following is the implementation code:

curl-xpost ' http://localhost:9200/reddit.posts/_search '-d ' {"Query": {"match": {"title": {"Query": "Kiten", " Fuzziness ": 2," Prefix_length ": 1}}} '

The fuzzy parameter determines the maximum "edit distance" for the next query field match, and the Prefix_length parameter requires the result to match the first letter of the query. This article details the way this function is implemented, outputting the same results as the correct spelling.

Not just inserting

Although we've only demonstrated how to take advantage of continuous file flow from MongoDB to Elasticsearch, MONGO Connector is more than just an input/output tool. When you update or delete files in MongoDB, those actions are also recorded in other systems, keeping up with the current master node. If the primary node is in the process of failover and produces a rollback, Mongo connector can delete the operation and take the correct approach to maintain consistency.

Summary

The real meaning of this thing is that we can operate at the same time in MongoDB and Elasticsearch. Without a tool like MONGO connector, we had to use a similar mongoexport tool to periodically dump data from MongoDB to JSON and then upload the data to an idle elasticsearch. Causes us to be unable to delete files in advance when we are idle. This is probably a very troublesome thing, at the same time lost elasticsearch near real-time query ability.

Although MONGO Connector has made significant improvements since its first release, it is still an experimental product and has no official support from MongoDB. However, I will always be committed to answering questions, summarizing feature requests, and submitting bug reports on the GitHub Mongo Connector page, as well as checking all GitHub Wikipedia pages about Mongo connector.

SOURCE Links: How to perform fuzzy-matching with Mongo Connector and Elasticsearch (translations/Hongye Zebian/Zhonghao)

Free Subscription "CSDN cloud Computing (left) and csdn large data (right)" micro-letter public number, real-time grasp of first-hand cloud news, to understand the latest big data progress!

CSDN publishes related cloud computing information, such as virtualization, Docker, OpenStack, Cloudstack, and data centers, sharing Hadoop, Spark, Nosql/newsql, HBase, Impala, memory calculations, stream computing, Machine learning and intelligent algorithms and other related large data views, providing cloud computing and large data technology, platform, practice and industry information services.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.