Data synchronization in distributed cluster environment of Elasticsearch and MongoDB

Source: Internet
Author: User
Tags connect mongo curl mongodb sharding representational state transfer solr

What is 1.ElasticSearch?

ElasticSearch is an open source, distributed, restful search engine built on Lucene. Its service is to provide additional components (a searchable repository) for applications with databases and Web front ends. Elasticsearch provides search algorithms and related infrastructure for applications, and users can interact with them through restful URLs simply by uploading the data in the application to the Elasticsearch data store. Elasticsearch's architecture is significantly different from its previous search engine architecture, because it is built in a way that scales horizontally. Unlike SOLR, its goal at the beginning of the design was to build a distributed platform that would perfectly match the rise of cloud technology and big data technology. Elasticsearch is built on the more stable open source search engine Lucene, which works much like modeless JSON document data.

Key Features of Elasticsearch

    • RESTful style

In all the Elasticsearch's introduction, it is inevitable to mention that it is a restful feature of the search engine. So what is restful? REST (representational state transfer representational status transfer) is a design and development approach to network applications that reduces the complexity of development and increases the scalability of the system. Rest has design concepts and guidelines that are restful in the applications that follow these guidelines. In a rest-style structure, all requests must be made on an object that has a specific address set by the URL. For example, if you use/schools/to represent a range of schools,/SCHOOLS/1 represents the school with the ID 1, and so on. This design style provides a simple and convenient way for users to interact with Elasticsearch through restful APIs such as curl, which avoids the hassle of managing XML configuration files. Here's a quick introduction

Use the Curl tool to elasticsearch the CRUD (add-and-revise) operation.

L Index Construction

In order to index a JSON object, a put request needs to be submitted to the rest API, and a URL consisting of the index name, type name, and ID is specified in the request. That

Http://localhost:9200/<index>/<type>/[<id>]

Example: Curl-xput "HTTP://LOCALHOST:9200/MOVIES/MOVIE/1"-d '

{

"title": "The Godfather",

"Director": "Francis Ford Coppola",

"Year": 1972

}‘

L Get index data by ID

Send a GET request to an already built index, which is http://localhost:9200/<index>/<type>/<id>

Example: Curl-xget "HTTP://LOCALHOST:9200/MOVIES/MOVIE/1"-D "

When no parameters are followed,-d ' Don't also

L Delete a document

Deletes a single document by the index specified by the ID. The URL is the same as the index creation and fetch.

Example: Curl-xdelete "HTTP://LOCALHOST:9200/MOVIES/MOVIE/1"-D "

    • Elasticsearch uses the gateway concept to make full backups easier.

Since Elasticsearch is specifically designed for distributed environments, it is a problem how to persist indexing information for all nodes. Of course, in addition to the index information, there are cluster information, mapping and transaction logs, etc. need to be persisted. This information becomes very important when your node fails or the cluster restarts. The Elasticsearch has a dedicated gateway module that is responsible for the persistence of meta-information storage. (Is SOLR inside the management of this part through zookeeper?) )

    • Elasticsearch supports facetting (facetedsearch, faceted search) and precolating

Facet refers to the multidimensional attributes of a thing. For example, a book contains topics, authors, eras, and so on. Faceted search is a method of filtering and filtering search results through these attributes of things. Of course this has been achieved in Lucene, so SOLR also supports faceted searching. As for the precolating characteristic, it is a bright spot in the Elasticsearch design. Precolator (filters) allow you to perform the opposite of normal operations in Elasticsearch, such as the previous document, indexing, and query execution. With the Precolate API, you can register many queries on the index, and then send a prelocate request to the specified document, returning a registration query that matches the document. To give a simple example, suppose we want to get all the tweets that contain the word "elasticsearch", you can register a query on the index, filter the user-registered queries on each tweet, and get the queries that match each tweet. The following is a simple example:

First, create an index:

Curl–xput Localhost:9200/test

Next, register a precolator query for the test index, and make the name Kuku

---The local test was unsuccessful, and no reason was found---

Curl–xput Localhost:9200/_precolator/test/kuku–d ' {

"Query": {

"term": {

"Field1": "Value1"

}

}

}’

Now, you can filter a text to see which queries match it.

Crul–xgetlocalhost:9200/test/type/_precolate–d ' {

"Doc": {

"Filed1": "Value1"

}

}’

The resulting return structure is as follows

{"OK": true, "matches": ["Kuku"]}

--end--

    • Distributed features of Elasticsearch

Unlike SOLR, Elasticsearch is designed to be a distributed application environment, so it has many features that make it easy to build distributed applications. For example, an index can be divided into multiple shards, each of which can have multiple replicas, and each node can hold one or more shards, automatically implementing load balancing and routing of fragmented replicas. In addition, Elasticsearch has the self-contained feature and does not have to use a servlet container such as Tomcat. The Elasticsearch cluster is self-discovering, self-managing (implemented with the built-in Zen Discovery module) and is simple to configure, as long as the same cluster.name is configured in CONFIG/ELASTICSEARCH.YML.

    • Support for multiple data sources

Elasticsearch has a plug-in module called River that can import data from an external data source into a elasticsearch and index it. The river is a singleton pattern on the cluster, it is automatically assigned to a node, and when the node is hung, the river is automatically assigned to another node. Currently supported data sources include: Wikipedia, MongoDB, CouchDB, RabbitMQ, RSS, Sofa, JDBC, Filesystem,dropbox, etc. River has a number of specified specifications that can be used to develop plug-ins suitable for their own application data.

2 Elasticsearch How to establish a data source connection?

Elasticsearch establishes a connection to each data source through the river. For example, MongoDB, this connection is mostly in the way of third-party plug-ins, by a number of open source contributors to create plug-ins with various types of data management systems and MQ, such as the establishment of river, index data. This paper mainly studies the combination of MongoDB and ES, using the river developed by Richardwilly98.

Https://github.com/richardwilly98/elasticsearch-river-mongodb

3 MONGODB cluster environment construction SEE: http://blog.csdn.net/huwei2003/article/details/404531594 elasticsearch How to build a river for a truly distributed MongoDB cluster , and index data
1. First download and unzip Elasticsearch
    1. Unzip elasticsearch-0.90. 5.zip

2 Download and unzip Elasticsearch-servicewrapper-master.zip
    1. Unzip Elasticsearch-servicewrapper-master.zip
    1. CD Elasticsearch-servicewrapper-master
    2. MV service/root/gy/elasticsearch-0.90. 5/bin
3 Start Elasticsearch
    1. SH elasticsearch start
4 Download the river Plugin
    1. ./plugin--install com.github.richardwilly98.elasticsearch/elasticsearch-river-mongodb/1.7. 1
It is worth mentioning that the river version must match MongoDB and Elasticsearch, if not match, then the river can not be in the MongoDB all the data index into ES. The matching rules are shown below: This test is es 1.1.2 + mongodb 2.4.6
Https://github.com/richardwilly98/elasticsearch-river-mongodb
5 Establishing River
  1. Curl-xput "Http://localhost:9200/_river/mongodb/_meta"-d '
  2. {
  3. "type":"MongoDB",
  4. "MongoDB": {
  5. "Servers": [{"host": "192.168.225.131","port": 37017}],
  6. " db":"dbname",
  7. " Collection":"CollectionName",
  8. "Gridfs":false,
  9. "Options": {
  10. "Include_fields": ["_id","VERSION","accession","file"]
  11. }
  12. },
  13. "index": {
  14. "name":"IndexName",
  15. ' type ':' meta '
  16. }
  17. }‘
Note: Index name is in lowercase, and meta in type is collection name because this test uses the MongoDB sharding cluster environment, when the river connection, the use of MONGOs routing, you can normally MONGO All data in the cluster is indexed. Gridfs,options Don't set

#curl Way to build river (and build a resume index)
Curl-xput "Localhost:9200/_river/tbjobresume/_meta"-d '
{
"Type": "MongoDB",
"MongoDB": {
"Host": "192.168.225.131",
"Port": "37017",
"DB": "Mongomodeljobresume",
"Collection": "Tbjobresume"
},
"Index": {
"Name": "Resume",
' type ': ' Tbjobresume '} '

Description: _river/tbjobresume tbjobresume I use the table name, the best time to create each index is different from the ' content ' after-D two single quotes do not lose
The type is followed by MongoDB because the MongoDB database is used

MongoDB: ip,port,db (name), collection don't have to explain.

Index:name the name of the index to be established, preferably lowercase (should be required)

Index:type collection name, which is the data collection name for the index

Verify:
Curl "Http://localhost:9200/_river/tbJobResume/_meta"
This will set up the resume index, and MongoDB will synchronize with the data.

Special Note: If a field in the Tbjobresume table is a geographic coordinate, you need to map to Geo_point type, set mapping before creating the index, as follows:

Curl-xput ' Http://localhost:9200/resume '-d '
{
"Mappings": {
"Tbjobresume": {
"Properties": {
"Location": {
"Type": "Geo_point"
}
}
}
}
}‘

When you are finished setting up the index


---The following is another index built---
Curl-xput "Localhost:9200/_river/tbjobposition/_meta"-d '
{
"Type": "MongoDB",
"MongoDB": {
"Host": "192.168.225.131",
"Port": "37017",
"DB": "Mongomodeljob",
"Collection": "Tbjobposition"
},
"Index": {
"Name": "Position",
' type ': ' Tbjobposition '} '

Curl "Http://localhost:9200/_river/tbJobPosition/_meta"

---------------
#curl put index data

Curl-xput "HTTP://LOCALHOST:9200/CUSTOMER/TBCUSTOMER/1"-d '
{
"_id": 1,
"Name": "Francis Ford Coppola 1",
"Sex": 1
}‘
The method creates a customer index and puts a piece of data, Tbcustomer is the type

Curl-xput ' http://192.168.225.131:9200/dept/employee/32 '-d ' {"EmpName": "Emp32"} '
Curl-xput ' Http://192.168.225.131:9200/dept/employee/31 '-d ' {"EmpName": "Emp31"} '

The method also creates a Dept index and puts a data into it, and the employee is type

The variable template that creates the river and index is as follows:

$ curl-xput "Localhost:9200/_river/${es.river.name}/_meta"-d '
{
"Type": "MongoDB",
"MongoDB": {
"Servers":
[
{"Host": ${mongo.instance1.host}, "Port": ${mongo.instance1.port}},
{"Host": ${mongo.instance2.host}, "Port": ${mongo.instance2.port}}
],
"Options": {
"Secondary_read_preference": true,
"Drop_collection": ${mongo.drop.collection},
"Exclude_fields": ${mongo.exclude.fields},
"Include_fields": ${mongo.include.fields},
"Include_collection": ${mongo.include.collection},
"Import_all_collections": ${mongo.import.all.collections},
"Initial_timestamp": {
"Script_type": ${mongo.initial.timestamp.script.type},
"Script": ${mongo.initial.timestamp.script}
},
"Skip_initial_import": ${mongo.skip.initial.import},
"Store_statistics": ${mongo.store.statistics},
},
"Credentials":
[
{"DB": "Local", "User": ${mongo.local.user}, "Password": ${mongo.local.password}},
{"db": "admin", "User": ${mongo.db.user}, "Password": ${mongo.db.password}}
],
"DB": ${mongo.db.name},
"Collection": ${mongo.collection.name},
"Gridfs": ${mongo.is.gridfs.collection},
"Filter": ${mongo.filter}
},
"Index": {
"Name": ${es.index.name},
"Throttle_size": ${es.throttle.size},
"Bulk_size": ${es.bulk.size},
' Type ': ${es.type.name}
"Bulk": {
"Actions": ${es.bulk.actions},
"Size": ${es.bulk.size},
"Concurrent_requests": ${es.bulk.concurrent.requests},
"Flush_interval": ${es.bulk.flush.interval}
}
}
}‘

--template end--

--url--

This plugin git address: https://github.com/laigood/elasticsearch-river-mongodb

6 test examples connect MONGO cluster, meta collection data volume has 22,394,792 data

View the amount of ES data

I finally built elasticsearch on Master1 master2 Master3, and 3 es rebalance succeeded, and the total number of data was still 22394792.

 

    • Previous Elasticsarch and plug-in installation
    • Next Linux Redis installation and configuration start

Data synchronization in distributed cluster environment of Elasticsearch and MongoDB

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.