Choosing between ElasticSearch, MongoDB & Hadoop

Source: Internet
Author: User
Tags hadoop ecosystem

An interesting trend have been developing in the IT landscape over the past few years.  Many new technologies develop and immediately latch onto the "Big Data" buzzword. And as older technologies add "Big Data" features in an attempt to keep up with the Joneses, we is seeing a blurring of t  He boundaries between various technologies. Say you have search engines such as ElasticSearch or SOLR storing JSON documents, MongoDB storing JSON documents, or a PIL E of JSON documents stored in HDFS on a Hadoop cluster. Interestingly enough, you can fulfill many of the same use cases with any of these three configurations.

ElasticSearch as a NoSQL database?  Superficially, that's doesn ' t, but nonetheless it's a valid scenario.  Likewise, MongoDB with support for MapReduce over sharded collections can accomplish many of the same things as Hadoop. And, of course, with the many tools, so can layer on top of a Hadoop base (Hive, HBase, Pig, and the like) can Qu ery data from your Hadoop cluster in a multitude of ways.

Given that, can we have say that Hadoop, MongoDB and ElasticSearch is all exactly equivalent?  Of course not. Each tool still have a niche for which it's most ideally suited, but each have enough flexibility to fulfill multiple roles  . The question now becomes ' What's the ideal use for each of these technologies ', and that's my friends are what we'll explo Re now.

ElasticSearch have begun to spread beyond their roots as a "pure" search engine and now adds some features for analytics and Visualization-but at it core, remains primarily a full-text search engine for locating documents by keyword.  elasticsearch builds on top of Lucene and supports extremely fast lookup and a rich query syntax.  if you are millions (or more) of the text documents and you need to locate documents by keywords located on that text, ElasticSearch fits the bill perfectly.  yes, if your documents is JSON you can treat ElasticSearch as a sort of lightweight "NoSQL database".  but ElasticSearch is isn't quite a "database engine" and provides less support for complex calculations and Aggregatio N as part of a Query-although the "statistics" facet does provide some ability to retrieve calculated statistical inform ation scoped to the given query.   facets in ElasticSearch is intended mainly to support a "faceted navigation" facility.

If you is looking to return a (usually small) collection of documents in response to a keyword query, and want the Abilit  Y to support faceted navigation around those documents, then ElasticSearch was probably your best, first choice. If you need to perform more complex calculations, run server-side scripts against your data, and easily run MapReduce jobs On your data, then MongoDB or Hadoop enter the picture.

If your goal is to get a collection of documents (usually smaller) by specifying a keyword query, and you have a support for document-based aspect navigation, Elasticsearch is a good fit, but if you want to support more complex computations, run scripts on your data on the server side, It's easy to run MapReduce on your data, and MongoDB and Hadoop are within the scope of the consideration.

mongodb is a "NoSQL" database which are designed from the ground up to being highly scalable, with automatic shard ing support and a number of additional performance optimizations.  mongodb is a document oriented database which stores document in a "JSON like" format (technically BSON) with some E Xtensions Beyond Plain json-for example, a native date type.  MONGODB provides a text index type for supporting Full-text search against fields which contain text, so we can see That there was overlap between what can do with ElasticSearch and MongoDB, in terms of basic keyword search against a C Ollection of documents.  

where MongoDB goes beyond ElasticSearch are its-support for features like Server-side scripts in Javascript, AG Gregation pipelines, MapReduce support and capped collections.   with MongoDB, you can use aggregation pipelines to process documents in a collection, streaming them through a Sequence of pipeline operators where each operator transforms the document.  pipeline operators can generate entirely new documents or remove documents from the final output.   this is a very powerful facility for filtering, processing and transforming data as it is retrieved.   mongodb also supports running map/reduce jobs over the data in a collection, using custom Javascript functions For the map and reduce phases of the operation.  this allows for ultimate flexibility in performing any type of calculation or transformation to the selected DATA.&N Bsp

Compared to ES, MongoDB is more than ES in support of server-side JavaScript and aggregation pipelines, as well as mapreduce support and capped collections

Another extremely powerful feature in MongoDB is known as "capped collections". With the capped collections facility, a user can define a maximum size for a collection-after which the collection can s   imply is written to blindly, and it would roll-over data as necessary to maintain the specified size limit. This feature was extremely useful for capture logs and other streaming data for analysis.

Another very NB feature is capped collections; by capped collections, the user can define the largest size for a collection, which is used to insert data (only insert updates cannot be deleted). Storing newly inserted data according to LRU extrusion data, this feature is ideal for storing and analyzing log data and stream data

Popular Science capped collections

Characteristics:

1. Only insert, UPDATE, cannot delete doc, can delete entire collection using Drop ()

2.LRU list, I believe you should be very familiar with this, Oracle is a lot of places is used in this rule, if the specified collection size is full, then according to LRU extrusion data to store the newly inserted data, here Remember that the update is not beyond the size of collection, It's also reasonable to not squeeze out space for updated data.

3. The inserted records are arranged in the order in which they are inserted, and the normal collection are definitely indexed on the _id, but there is no

4. You can quickly query and insert, if the proportion of writing than reading, it is recommended not to index, otherwise writing will consume a lot of extra resources.


As can see, while ElasticSearch and MongoDB has some overlap in possible use cases, they is not the same tool. What's about Hadoop? Isn ' t Hadoop "just MapReduce" which is supported by MongoDB anyway? Is there really a use case for Hadoop where MongoDB is just as suitable.

In a word, yes.  Hadoop is the grand-father of MapReduce based cluster computing frameworks. Hadoop provides probably the overall most flexible and powerful environment for processing large amounts of data, and Defi Nitely fits niches for which your would not use ElasticSearch or MongoDB.

to understand why this is true, look at how Hadoop abstracts Storage-via hdfs-from The associated Computat ional facility.   with data stored in HDFS, any arbitrary job can is run against that data, using either Java code written to th E core MapReduce API, or arbitrary code written in native languages using Hadoop streaming.   and starting with HADOOP2 and YARN, even the core programming model are abstracted so that's you aren ' t limited t o MapReduce.  with YARN You can, for example, implement MPI on top of Hadoop and write jobs in that style.

additionally, the Hadoop ecosystem provides a staggering array of tools that build on top of HDFS and core Map Reduce to query, analyze and process data.  hive provides a "SQL like" language, allows business analysts to query data using a syntax they is already fami Liar with.  hbase provides a column oriented database on top of Hadoop.  pig and Sizzle provide the more alternative programming models for querying Hadoop Data.  with data stored in HDFS using Hadoop, your inherit the ability to simply plug in Apache Mahout Machine learning algorithms on your data.  while using rhadoop is straightforward to use the R statistical language to perform advanced statistical analys ES on Hadoop data.

so while Hadoop and MongoDB also has some overlapping use cases, and share some useful functionality (Seamles s horizontal scalability, for example) it remains the case, all tool serves a specific purpose in enterprise Computin G.  if simply want to locate documents by keyword and perform simple analytics, then ElasticSearch May fit t He bill.  if you need to query documents the can is modeled as JSON and perform moderately more sophisticated analysis, then MongoDB becomes a compelling choice.  and If you had a huge quantity of data that needs a wide variety of different types of complex processing and analy SIS, then Hadoop provides the broadest range of tools and the most flexibility.  

as always, it's important to choose the right tool (s) for the job at hand.  and in the "Big Data" space The sheer number of technologies and the blurry lines can make this difficult.  as we can see, there is specific scenarios which best suit each of the these technologies and, more importantly, t He differences do matter.  though, the best news of all is you is not limited to using only one of the these tools.   depending on the details of your use case, it could actually make sense to build a combination platform.  for example, ElasticSearch and Hadoop is known to work well together, with ElasticSearch providing rapid keyword SE Arch, and Hadoop jobs powering the more complicated analytics.  

In the end, it takes ample the "careful" and "the best choices for your computing environment". Before selecting any technology or platform, take the time-to-evaluate it carefully, understand what scenarios it was desi  Gned to optimize for, and what tradeoffs and sacrifices it makes. Start with a small pilot project to "kick the tires" before converting your entire enterprise to a new platform, and Slowl Y grow into the new stack.

Follow these steps and you can successfully navigate the maze of "Big Data" technologies and reap the associated benefits. \



This article transferred from: Http://www.osintegrators.com/opensoftwareintegrators%7CChoosing-Between-ElasticSearch-MongoDB-%2526-Hadoop


PS: only translated parts, other parts will be completed in a late time ~, translation of the wrong place also ask you to correct

Choosing between ElasticSearch, MongoDB & Hadoop

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.