Elasticsearch and MongoDB data synchronization and distributed cluster Setup

Source: Internet
Author: User
Tags curl mongodb

River can be synchronized with a variety of data sources, Wikipedia, MongoDB, CouchDB, RABBITMQ, RSS, Sofa, JDBC, Filesystem,dropbox, etc., and the company's business is to use MongoDB, Today, the test environment virtual machine configured Elasticsearch and MongoDB synchronization, make a general process record, mainly using Richardwilly98/elasticsearch-river-mongodb.
River by reading MongoDB's oplog to synchronize the data, oplog the table to synchronize the different machine data in the cluster to ensure that the data in the ES is the same as in the MongoDB. So mongdb must be a cluster to have oplog. Note: This plugin only supports the MongoDB in the cluster environment, because MONGODB in the cluster environment Oplog this
Elasticsearch and MongoDB need to install the corresponding version to achieve synchronization, I use the latest Elasticsearch 1.4.2 and MongoDB 3.0.0, the corresponding version requirements refer to the following table


Mongdb is a replica set of clusters, the specific replica set of the cluster is not written in detail, Elasticsearch installation configuration is also omitted.
1. Install Elasticsearch-river-mongodb
#./elasticsearch-1.4.4/bin/plugin-install elasticsearch/elasticsearch-mapper-attachments/2.4.1

#./elasticsearch-1.4.4/bin/plugin-i com.github.richardwilly98.elasticsearch/elasticsearch-river-mongodb/2.0.5

2. Establishment of River
Curl-xput "Http://10.253.1.70:9200/_river/threads_mongo_river/_meta"-d '

{
"Type": "MongoDB",
"MongoDB": {
"Servers":
[
{"Host": "10.253.1.71", "Port": 27017}
],
"DB": "Threads",
"Collection": "Threads",
"Gridfs": false
},
' Index ': {
' Name ': ' Test ',
' Type ': ' Threads '
}
}'
Here is simply configured to establish the connection of the MongoDB and the corresponding db,collection for the Elasticsearch index and type, there are detailed configuration is not used, such as options, depending on the business can be configured, The following is a detailed configuration sample:
$ curl-xput "Localhost:9200/_river/${es.river.name}/_meta"-d '
{
"Type": "MongoDB",
"MongoDB": {
"Servers":
[
{"Host": ${mongo.instance1.host}, "Port": ${mongo.instance1.port}},
{' Host ': ${mongo.instance2.host}, ' Port ': ${mongo.instance2.port}}
],
"Options": {
"Secondary_read_preference": true,
"Drop_collection": ${mongo.drop.collection},
"Exclude_fields": ${mongo.exclude.fields},
"Include_fields": ${mongo.include.fields},
"Include_collection": ${mongo.include.collection},
"Import_all_collections": ${mongo.import.all.collections},
"Initial_timestamp": {
"Script_type": ${mongo.initial.timestamp.script.type},
"Script": ${mongo.initial.timestamp.script}
},
"Skip_initial_import": ${mongo.skip.initial.import},
"Store_statistics": ${mongo.store.statistics},
},
"Credentials":
[
{"DB": "Local", "User": ${mongo.local.user}, "Password": ${mongo.local.password}},
{"db": "admin", "User": ${mongo.db.user}, "Password": ${mongo.db.password}}
],
' DB ': ${mongo.db.name},
"Collection": ${mongo.collection.name},
"Gridfs": ${mongo.is.gridfs.collection},
' Filter ': ${mongo.filter}
},
' Index ': {
' Name ': ${es.index.name},
"Throttle_size": ${es.throttle.size},
"Bulk_size": ${es.bulk.size},
' Type ': ${es.type.name}
"Bulk": {
"Actions": ${es.bulk.actions},
' Size ': ${es.bulk.size},
"Concurrent_requests": ${es.bulk.concurrent.requests},
"Flush_interval": ${es.bulk.flush.interval}
}
}
}'
Some of the configuration items are interpreted as follows, and you can view the GitHub wiki:
DB is the database name for synchronization,
Host MongoDB IP address (default is localhost)
Ports on Port MongoDB
Collection The name of the table to sync
Fields the name of the field to sync (separated by commas, all by default)
Gridfs is a Gridfs file (set to True if collection is Gridfs)
Local_db_user the user name of the local database (no words need not be written)
Local_db_password the local database password (no words need not be written)
Db_user the password of the database to be synchronized (no words need not be written)
Db_password the password of the database to be synchronized (no words need not be written)
Name Index name (cannot exist before)
Type types
Maximum number of bulk_size bulk additions
Bulk_timeout Batch-added timeout
3. Test success
I'm testing the library for less data, so just check it out and see if you can find it.
$ curl-xget "Http://10.253.1.70:9200/test/threads/_search"-d '

{
"Took": 20,
"Timed_out": false,
"_shards": {
"Total": 5,
"Successful": 5,
"Failed": 0
},
"Hits": {
"Total": 4,
"Max_score": 1,
"Hits": [
{
"_index": "Test",
"_type": "Threads",
"_id": "54fa32b22c44cf67cb6a9d1b",
"_score": 1,
"_source": {
"_id": "54fa32b22c44cf67cb6a9d1b",
"title": "Where is I Car",
"Content": "Ask Yourself"
}
},
{
"_index": "Test",
"_type": "Threads",
"_id": "54fa2f5c2c44cf67cb6a9d19",
"_score": 1,
"_source": {
"_id": "54fa2f5c2c44cf67cb6a9d19",
"title": "This is title",
"Content": "What is the fuck"
}
},
{
"_index": "Test",
"_type": "Threads",
"_id": "54fa2f892c44cf67cb6a9d1a",
"_score": 1,
"_source": {
"_id": "54fa2f892c44cf67cb6a9d1a",
"title": "Are you OK",
"Content": "Yes,i am OK"
}
},
{
"_index": "Test",
"_type": "Threads",
"_id": "54fa49ccc104e2264e02deea",
"_score": 1,
"_source": {
"_id": "54fa49ccc104e2264e02deea",
"title": "Hello word",
"Content": "Hello hello haha"
}
}
]
}
}

It appears that the data has been synchronized, and then add a record to the MongoDB, perform the same action to find the record or total is +1 so sync completed.

We have configured a elasticsearch with MongoDB data synchronization is highly available, scalable, and distributed as an advantage and feature of ES, extending either vertically or upwards, Vertical scale/scaling up, or horizontally or outward expansion, Horizontal scale/scaling out.
One node runs an instance of ES, and a cluster contains one or more nodes with the same cluster.name that work together to accomplish data sharing and load sharing. As nodes are added to the cluster or removed from the cluster, the cluster distributes the data evenly through its own tuning. A node in a cluster is selected as the Master node, which manages the entire cluster's changes, such as creating or deleting an index, adding or removing nodes to the cluster. Any node can be a master node. There is only one node in our example, so it assumes the function of the master node. ES distribute data in clusters through fragmentation. You can think of fragmentation as a container for data. The document is stored in a fragment, and the fragment is assigned to the node in the cluster. As the cluster expands and is small, ES automatically migrates the fragments between nodes to ensure that the cluster maintains a balance. A fragment can be either a primary fragment (Primary Shard) or a replica fragment (Replica Shard). Each document in the index belongs to a primary fragment, so the number of primary slices determines the maximum amount of data your index can store. A copy fragment is just a copy of the primary fragment. Replicas are used to provide data redundancy to protect data from loss in the event of a hardware failure and to process read requests such as searching and obtaining documents. The number of primary slices is determined at the beginning of the index establishment, and the number of copies fragmented can be changed at any time.

The specific principle can refer to Official document: "Life inside a Cluster"

Demo horizontal extension, here Add a new ES instance of the virtual machine, so that our previous ES instance is: 10.253.1.70, now add a new node: 10.253.1.71., you need to ensure that these two nodes can communicate with each other.

Configure CONFIG/ELASTICSEARCH.YML
10.253.1.70 Related configuration is:
Cluster.name:elasticsearch_ryan
Node.name: "Cluster-node-1"
10.253.1.71 Related configuration is:
Cluster.name:elasticsearch_ryan
Node.name: "Cluster-node-1"
In fact, to ensure that there is a common cluster.name
Start the 10.253.1.71 ES service, and then view the status of the next node cluster:

Curl-xpost "Http://10.253.1.70:9200/_cluster/health"
{
"Cluster_Name": "Elasticsearch_ryan",
"Status": "Green",
"Timed_out": false,
"Number_of_nodes": 2,
"Number_of_data_nodes": 2,
"Active_primary_shards": 9,
"Active_shards": 18,
"Relocating_shards": 0,
"Initializing_shards": 0,
"Unassigned_shards": 0
}

You can see now there are 2 nodes, status table cluster state, specific state meaning:
Green: All primary slices (Primary Shard) and replica fragments (Replica Shard) are active
Yellow: All primary slices are active, but not all replica fragments are active
Red: Not all primary slices are active
Here in the way recommend an ES distributed cluster Management tool Elasticsearch-head, plug-in mode installation can be
sudo elasticsearch/bin/plugin-install mobz/elasticsearch-head
Open the Admin interface after installation http://10.253.1.70:9200/_plugin/head/


We can see the detailed information of the nodes in the distributed cluster, and also can perform the information of the index and the function of the query, it is convenient, the state of the cluster is also very intuitive. You can continue to add some data to the MONGO to test

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.