標籤:elasticsearch mongodb hadoop
An interesting trend has been developing in the IT landscape over the past few years. Many new technologies develop and immediately latch onto the “Big Data” buzzword. And as older technologies add “Big Data” features in an attempt to keep up with the Joneses, we are seeing a blurring of the boundaries between various technologies. Say you have search engines such as Elasticsearch or Solr storing JSON documents, MongoDB storing JSON documents, or a pile of JSON documents stored in HDFS on a Hadoop cluster. Interestingly enough, you can fulfill many of the same use cases with any of these three configurations.
Elasticsearch as a NoSQL database? Superficially, that doesn’t sound right, but nonetheless it is a valid scenario. Likewise, MongoDB with support for MapReduce over sharded collections can accomplish many of the same things as Hadoop. And, of course, with the many tools that you can layer on top of a Hadoop base (Hive, HBase, Pig, and the like) you can query data from your Hadoop cluster in a multitude of ways.
Given that, can we now say that Hadoop, MongoDB and Elasticsearch are all exactly equivalent? Of course not. Each tool still has a niche for which it is most ideally suited, but each has enough flexibility to fulfill multiple roles. The question now becomes “What is the ideal use for each of these technologies”, and that my friends is what we will explore now.
Elasticsearch has begun to spread beyond its roots as a “pure” search engine and now adds some features for analytics and visualization - but at its core, remains primarily a full-text search engine for locating documents by keyword. Elasticsearch builds on top of Lucene and supports extremely fast lookup and a rich query syntax. If you have millions (or more) of text documents and you need to locate documents by keywords located in that text, Elasticsearch fits the bill perfectly. Yes, if your documents are JSON you can treat Elasticsearch as a sort of lightweight “NoSQL database”. But Elasticsearch is not quite a “database engine” and provides less support for complex calculations and aggregation as part of a query - although the “statistics” facet does provide some ability to retrieve calculated statistical information scoped to the given query. Facets in Elasticsearch are intended mainly to support a “faceted navigation” facility.
If you are looking to return a (usually small) collection of documents in response to a keyword query, and want the ability to support faceted navigation around those documents, then Elasticsearch is probably your best, first choice. If you need to perform more complex calculations, run server-side scripts against your data, and easily run MapReduce jobs on your data, then MongoDB or Hadoop enter the picture.
如果你的目的是通過指定關鍵字查詢取得一個文檔集合(通常是較小的),並且具有支援對文檔基於切面的導航,elasticsearch會非常合適,但是如果你希望支援更多複雜的計算,在伺服器端你的資料上運行指令碼,很容易的在你的資料上運行mapreduce成尋,那麼MongoDB和Hadoop就在被考慮的範圍裡邊了。
MongoDB is a “NoSQL” database which is designed from the ground up to be highly scalable, with automatic sharding support and a number of additional performance optimizations. MongoDB is a document oriented database which stores document in a “JSON like” format (technically BSON) with some extensions beyond plain JSON - for example, a native date type. MongoDB provides a text index type for supporting full-text search against fields which contain text, so we can see that there is overlap between what you can do with Elasticsearch and MongoDB, in terms of basic keyword search against a collection of documents.
Where MongoDB goes beyond Elasticsearch is its support for features like server-side scripts in Javascript, aggregation pipelines, MapReduce support and capped collections. With MongoDB, you can use aggregation pipelines to process documents in a collection, streaming them through a sequence of pipeline operators where each operator transforms the document. Pipeline operators can generate entirely new documents or remove documents from the final output. This is a very powerful facility for filtering, processing and transforming data as it is retrieved. MongoDB also supports running map/reduce jobs over the data in a collection, using custom Javascript functions for the map and reduce phases of the operation. This allows for ultimate flexibility in performing any type of calculation or transformation to the selected data.
與ES相比,MongoDB超過es的地方是支援伺服器端的javascript和彙總管道 以及MapReduce的支援和capped collections
Another extremely powerful feature in MongoDB is known as “capped collections”. With the capped collections facility, a user can define a maximum size for a collection - after which the collection can simply be written to blindly, and it will roll-over data as necessary to maintain the specified size limit. This feature is extremely useful for capture logs and other streaming data for analysis.
另外一個非常NB的特徵是capped collections;通過capped collections,使用者可以定義為一個collection定義最大的size,用來插入資料(只能插入更新 不能刪除),按照LRU擠出資料存放新插入的資料,這個特點非常適合擷取log資料和流資料的儲存和分析
科普一下capped collections
特點:
1.只能插入,更新,不能刪除doc,可以使用drop()刪除整個collection
2.LRU列表,相信大家對這個應該很瞭解了,oracle裡面很多地方就是用的這個規則,如果指定的集合大小滿了,那麼會按照LRU擠出資料存放新插入的資料,這裡記得更新是不能超出collection的大小的,不能擠出空間存放更新的資料,這個也合情合理。
3.插入的記錄都是按照插入的順序排列,普通的collection在_id上是肯定有索引的,但是這裡是沒有的
4.可以快速的查詢和插入,如果寫比讀的比例大,建議不要建立索引,否則寫會耗費很多額外的資源。
As you can see, while Elasticsearch and MongoDB have some overlap in possible use cases, they are not the same tool. But what about Hadoop? Isn’t Hadoop “just MapReduce” which is supported by MongoDB anyway? Is there really a use case for Hadoop where MongoDB is just as suitable.
In a word, yes. Hadoop is the grand-father of MapReduce based cluster computing frameworks. Hadoop provides probably the overall most flexible and powerful environment for processing large amounts of data, and definitely fits niches for which you would not use Elasticsearch or MongoDB.
To understand why this is true, look at how Hadoop abstracts storage - via HDFS - from the associated computational facility. With data stored in HDFS, any arbitrary job can be run against that data, using either Java code written to the core MapReduce API, or arbitrary code written in native languages using Hadoop Streaming. And starting with Hadoop2 and YARN, even the core programming model is abstracted so that you aren’t limited to MapReduce. With YARN you can, for example, implement MPI on top of Hadoop and write jobs in that style.
Additionally, the Hadoop ecosystem provides a staggering array of tools that build on top of HDFS and core MapReduce to query, analyze and process data. Hive provides a “SQL like” language that allows Business Analysts to query data using a syntax they are already familiar with. HBase provides a column oriented database on top of Hadoop. Pig and Sizzle provide two more alternative programming models for querying Hadoop Data. With data stored in HDFS using Hadoop, you inherit the ability to simply plug in Apache Mahout to use advanced machine learning algorithms on your data. While using RHadoop is straightforward to use the R statistical language to perform advanced statistical analyses on Hadoop data.
So while Hadoop and MongoDB also have some overlapping use cases, and share some useful functionality (seamless horizontal scalability, for example) it remains the case that each tool serves a specific purpose in enterprise computing. If you simply want to locate documents by keyword and perform simple analytics, then Elasticsearch may fit the bill. If you need to query documents that can be modeled as JSON and perform moderately more sophisticated analysis, then MongoDB becomes a compelling choice. And if you have a huge quantity of data that needs a wide variety of different types of complex processing and analysis, then Hadoop provides the broadest range of tools and the most flexibility.
As always, it is important to choose the right tool(s) for the job at hand. And in the “Big Data” space the sheer number of technologies and the blurry lines can make this difficult. As we can see, there are specific scenarios which best suit each of these technologies and, more importantly, the differences do matter. Though, the best news of all is you are not limited to using only one of these tools. Depending on the details of your use case, it may actually make sense to build a combination platform. For example, Elasticsearch and Hadoop are known to work well together, with Elasticsearch providing rapid keyword search, and Hadoop jobs powering the more complicated analytics.
In the end, it takes ample research and careful analysis to make the best choices for your computing environment. Before selecting any technology or platform, take the time to evaluate it carefully, understand what scenarios it was designed to optimize for, and what tradeoffs and sacrifices it makes. Start with a small pilot project to “kick the tires” before converting your entire enterprise to a new platform, and slowly grow into the new stack.
Follow these steps and you can successfully navigate the maze of “Big Data” technologies and reap the associated benefits.\
本文轉自:http://www.osintegrators.com/opensoftwareintegrators%7CChoosing-Between-Elasticsearch-MongoDB-%2526-Hadoop
ps:只翻譯了部分,其他部分將會在晚點時候完成~,翻譯不對的地方還請各位指正
Choosing Between Elasticsearch, MongoDB & Hadoop