Many friends ask whether the current full-time Hadoop is suitable for the introduction of our own projects, when to use SQL, when to use Hadoop, how to choose between them? Aaron Cordova to answer your question with a picture, for different data scenarios, how to choose the correct data storage processing tool is described in detail. Aaron Cordova is an expert on big data analytics and architecture in the United States, Koverse CTO and co-founder.
The text on Twitter @merv forwarded a blog, "Statistics of triangles."
This is a blog about how to count triangles in a graph, and compares the results of MapReduce using Vertica with Hadoop. On top of 1.3 GB of data, Vertica is 22-40x times faster than Hadoop. And it only uses three rows of SQL. Statistics show that the Vertica is simpler and faster on top of 1.3 GB of data. But the results are not so interesting.
The results for the write task will be quite different-yes, SQL is really very simple in this case, as you all know. SQL is much simpler than mapreduce, but in distributed computing, MapReduce is much simpler than SQL. And MapReduce can do things that SQL can't do, like processing.
Using 1.3 GB of data as a benchmark for Vertica or Hadoop is like saying "we're going to have a 50-metre race between Boeing 737 and DC10." Such a game is not even necessary to take off. The comparison of the above blog is the same truth. These techniques are clearly not designed to handle this level of data set.
If there is a scalable system, even if the small-scale data is still very fast, of course, it is better, but this is not discussed in this article. Whether the performance results of large-scale data are still so obvious, this problem is not so obvious, it is really worth proving.
In order to help you how to choose the technology based on your actual situation, I drew this flowchart:
Original link: http://aaroncordova.com/blog2/roncordova.com/2012/01/do-i-need-sql-or-hadoop-flowchart.html.
A graph that tells you whether you need SQL or Hadoop