There has been an interesting phenomenon in the IT community over the past few years. Many new technologies have emerged and embraced "big data" immediately. A little bit older technology will also add big data to their own features, to avoid falling too far, we see the different technologies of the marginal ambiguity. If you have search engines such as Elasticsearch or SOLR, they store JSON documents, MongoDB has JSON documents, or a bunch of JSON documents are stored in HDFs in a Hadoop cluster. You can use these three configurations to accomplish a lot of co-breeding things.
Can es be used as a NoSQL database? Look, this is not quite right, but this is a reasonable scenario. Similarly, MongoDB's technique of using sharding on the basis of MapReduce can also do the work that Hadoop can do. Of course, with a lot of features, we can be on top of Hadoop (Hive, HBase, pig, and the same) you can also query data in a Hadoop cluster in a variety of ways.
So, can we now say that three of Hadoop, MongoDB, and Elasticsearch are identical? Obviously not! Each tool has its own best-fit scenario, but each has considerable flexibility to perform different roles. Now the question becomes, "what is the most appropriate use scenario for these technologies?" ”。 Let's take a look.
Elasticsearch has surpassed its original pure search engine role and has now added analysis and visualization features-but its core is still a full-text search engine. Elasticsearch is built on Lucene and supports extremely fast queries and rich query syntax. If you have millions of of documents that need to be targeted by keywords, elasticsearch is definitely the best choice. Of course, if your document is JSON, you can think of Elasticsearch as a lightweight "NoSQL database". However, Elasticsearch is not a suitable database engine and is not very strong for complex queries and aggregations, although statistical facets can provide some support for statistical information about a given query. The facets in the Elasticsearch are primarily used to support faceted browsing capabilities.
At present Elasticsearch has increased the function of aggregation
If you are looking for a small set of documents that correspond to a keyword query, and you want to support faceted navigation in these results, then Elasticsearch is definitely the best choice. If you need to perform more complex calculations, execute a server-side script on the data and easily run the MapReduce job, then MongoDB or Hadoop goes into the options.
MongoDB is a NoSQL database that is designed to be highly scalable and features automatic sharding and some additional performance optimizations. MongoDB is a document-oriented database that stores data in the form of JSON (which, to be exact, can be called Bson, some enhancements to JSON)--for example, a native data type. MongoDB provides a text index type to support full-text indexing, so we can see the boundaries between Elasticsearch and MongoDB, and the basic keyword search corresponds to the collection of documents.
The place where MongoDB exceeds Elasticsearch lies in its support for server-side JS scripts, aggregated pipelines, mapreduce support, and capped collections. With MongoDB, you can use the aggregation pipeline to process documents in a collection, and to process the document in multiple steps through a sequence of pipeline operations. Pipeline operations can generate a completely new document and remove the document from the final result. This is a very strong feature of filtering, processing, and transforming data when retrieving data. MongoDB also supports the execution of a map/reduce job on a data collection, using a custom JS function to manipulate the map and reduce processes. This ensures that MongoDB can perform the ultimate flexibility of any type of calculation or conversion of the selected data.
Another extremely powerful feature of MongoDB is called "Capped collections". With this feature, the user can define a collection maximum size--then the collection can be blind and will roll-over the necessary data to get log and other stream data for analysis.
As you can see, Elasticsearch and MongoDB have a possible overlap of application scenarios, and they are not the same tools. But what about Hadoop? Hadoop is MapReduce, which already has mongodb in-place support Ah! Is there a scenario specifically for Hadoop, MongoDB is just for you.
Yes! Hadoop is an old mapreduce, providing the most flexible and powerful environment for handling large amounts of data, without a doubt being able to handle scenes that cannot be handled with Elasticsearch or MongoDB.
To get a clearer picture of this, see how Hadoop uses HDFs abstraction to store--from relational computational features. With the data stored in HDFs, any job can operate on the data, write it on the core MapReduce API, or use the Hadoop streaming technology to program directly in the native language. Based on Hadoop 2 and yarn, even the core programming model has been abstracted, and you are no longer constrained by MapReduce. With yarn you can implement MPI on Hadoop and write jobs in that way.
Additionally, the Hadoop ecosystem provides an interleaved set of tools built on HDFs and core mapreduce for data query, analysis, and processing. Hive provides a SQL-like language that enables business analysis to be queried using a user-accustomed syntax. HBase provides a Hadoop-based column-oriented database. Pig and sizzle provide two more different programming models for querying Hadoop data. For the use of data stored in HDFs, you can inherit the ability of Mahout machine learning to your toolset. When using Rhadoop, you can perform advanced statistical analysis of Hadoop data directly using the R statistical language
So, although Hadoop and MongoDB also have partially overlapping scenarios and share some useful functionality (seamless horizontal scaling), there is a specific scenario between the two. If you just want to pass the keyword and simple analysis, then Elasticsearch can complete the task, if you need to query the document, and contain more complex analysis process, then MongoDB is quite suitable; if you have a huge amount of data, need a lot of different complex processing and analysis, Then Hadoop offers the broadest range of tools and flexibility.
A timeless truth is to choose the most suitable tool to do things at hand. In the context of big data, technology is endless, the boundaries between technology is quite vague, which is a very difficult thing for us to choose. As you can see, a particular scenario has the most appropriate technology, and this difference is quite important. The best news is that you are not limited to a particular tool or technology. Depending on the scenario you are facing, this allows us to build an integrated system. For example, we know that Elasticsearch and Hadoop work well together, and with elasticsearch fast keyword queries, Hadoop jobs can handle quite complex analyses.
Finally, the most appropriate choice is identified using the largest search and careful analysis. When choosing any technology or platform, you need to carefully validate them, understand what scenarios this stuff is suitable for, where you can optimize it, and what sacrifices you need to make. Start with a small pre-research project, and after confirming, apply the technology to the real platform and slowly upgrade to the new level.
With these recommendations, you can successfully navigate the big data technology and get the rewards.
Original
Elasticsearch, MongoDB, and Hadoop comparison
Reproduced Elasticsearch, MongoDB, and Hadoop comparison