Elasticsearch is a great search engine, flexible, fast and fun. So what can I get started with it? This post would go through how to get contents from a SQL database into Elasticsearch.
Elasticsearch has a set of pluggable services called Rivers. A River runs inside a Elasticsearch node, and imports content into the index. There is rivers for Twitter, Redis, files, and of course, SQL databases. The River-jdbc plugin connects to SQL databases using JDBC adapters. In this post we'll use PostgreSQL, since it's freely available, and populate it with some contents that also be freely Available.
So let ' s get started
- Download and install Elasticsearch
- Start Elasticsearch by running bin/elasticsearch from the installation folder
- Install the RIVER-JDBC plugin for Elasticsearch version 1.00RC
1 |
. /bin/plugin -install river-jdbc -url <em><< Span class= "crayon-i" >a href= "http://bit.ly/1dKqNJy" >http://bit.ly/1dkqnjy</a> </em> |
- Download the PostgreSQL JDBC jar file and copy into the plugins/river-jdbc folder. You should probably get the latest version which are for JDBC 41
- Install PostgreSQL http://www.postgresql.org/download/
- Import the Booktown database. Download the SQL file from Booktown database
- Restart Elasticsearch
- Start PostgreSQL
By this time you should has Elasticsearch and PostgreSQL running, and river-jdbc ready for use.
Now we need to put some contents into the database, using Psql, the PostgreSQL command line tool.
1 |
Psql -U postgres -f booktown. SQL |
To execute commands to Elasticsearch we'll use a online service which functions as a mixture of Gist, the code snippet Sharing service and sense, a Google Chrome plugin developer console for Elasticsearch. The service is hosted by Http://qbox.io, who provide hosted Elasticsearch services.
Check that everything is correctly installed by opening a browser tohttp://sense.qbox.io/gist/ 8361346733fceefd7f364f0ae1ebe7efa856779e
Select the top most of the left-hand pane, press Ctrl+enter on your keyboard. Also click on the little triangle that appears to the right, and if you is more than a mouse click kind of person.
You should now see a status message, showing the version of Elasticsearch, node name and such.
Now let's stop fiddling around the porridge and create a river for our database:
12345678910111213 |
curl -xput " Http://localhost:9200/_river/mybooks/_meta " -d ' { "type": " JDBC ", " jdbc ": { " driver ":" Org.postgresql.Driver ", "url": "Jdbc:postgresql://localhost:5432/booktown", "User": "Postgres", "index": " Booktown ", " type ":" Books ", " SQL ":" SELECT * FROM authors " } } ' |
This would create a "one-shot" River of connects to PostgreSQL in Elasticsearch startup, and pulls the contents from the Authors table into the Booktown index. The index parameter controls what index the data would be put to, and the type parameter decides the type in the Elastics Earch Index. To verify the correctly uploaded execute
1 |
GET /_river/mybooks/_meta |
Restart Elasticsearch, and watch the log for status messages from River-jdbc. Connection problems, SQL errors or other problems should appear in the log. If everything went OK, you should see something like ... Simplerivermouth] Bulk [1] success [items]
Time has come to check out the what we got.
You should now see all the contents from the authors table. The number of items reported under "hits", "Total" is the same as what we just saw in the log:19.
But looking more closely at the data, we can see that the _id field have been auto-assigned with some random values. This means then the next time we run the river and all the contents would be re-added.
Luckily, RIVER-JDBC support somespecially labeled fields, which let us control how the contents should is indexed.
Reading up on the docs, we change the SQL definition
12 |
Select ID as _id, first_name, last_name from authors |
We need to start afresh and scrap the index we just created:
Restart Elasticsearch. Now you should see a meaningful ID in your data.
At this time we could start toying around with queries, mappings and analyzers. But, that's not a much fun with this little content. We need to join in some tables and get some more interesting data. We can join in the books table, and get all of the books for all authors.
123 |
select authors. ID as _id, authors. Last_name, authors first_name, books. Id, books.title, bookssubject_id from public. Authors left join public books on books.< Span class= "crayon-v" >author_id = authors. ID |
Delete the index, restart Elasticsearch and examine the data. Now, see that we are only get one book per author. Executing the SQL statement in Pgadmin returns $ rows, while in Elasticsearch we get 19. This was on account of the _id field, on each attempt to index an existing record with the same _id as a new one, it would b E overwritten.
RIVER-JDBC supports structured objects, which allows us to create arbitrarily structured JSON documents simply by using SQ L aliases. The _id column is used for identity, structured objects would be appended to existing data. This was perhaps best shown by an example:
1234 |
select authors. ID as _id, authors. Last_name, authors first_name,&NBSP; books. ID as \ "Booksid\ ", books. Title as \ "Books< Span class= "Crayon-sy". title\ ", books. subject_id as \ " Books. Subject_id\ " From public . Authors left joins public. Books on books. author_id = authors. ID Order by authors. ID |
Again, delete the index, restart Elasticsearch, wait a few seconds before you search, and you'll find structured data in The search results.
Now we had seen that it was quite easy-to-get data into Elasticsearch using RIVER-JDBC. We have also seen how it can handle updates. That's gets us quite far. Unfortunately, it doesn ' t handle deletions. If A record is deleted from the database, it won't automatically be deleted from the index. There had been some attempts to create support for it, and the latest release it has been completely dropped.
The due to the plugin system has some serious problems, and it'll perhaps be deprecated some time after the 1.0 release, at least not actively promoted as "the". (See the "semi-offical statement" at Linkedin Elasticsearch Group). While it's extremely easy-to-use rivers-get data, there is a lot of problems in has a data integration process run Ning in the same space as Elasticsearch itself. Architecturally, it's perhaps more correct to leave the search engine to itself, and build integrations systems on the SI De.
Among the recommended alternatives are:
>use an ETL tool like Talend
>create your own script
>edit the source application to send updates to Elasticsearch
Jörg Prante, who was the man behind River-jdbc, recently started creating a replacement calledgatherer.
It is a gathering framework plugin for fetching and indexing data to Elasticsearch, with scalable components.
Anyway, we have the data in our index! Rivers May has their problems when used on a large scale, but you would is hard pressed to find anything easier to get St Arted with. Getting data into the index easily are essential when exploring ideas and concepts, creating POCs or just fooling around.
This post have run out of space, but perhaps we can look at some interesting queries next time?
elasticsearch:indexing SQL databases. The easy-to-do