Author: Chszs, reprint should be indicated. Blog home: Http://blog.csdn.net/chszs
I was asked, "How much experience do you have in big data and Hadoop?" I told them I've been using Hadoop, but I'm dealing with a dataset that's rarely larger than a few terabytes.
They asked me, "Can you use Hadoop to do simple grouping and statistics?" I said yes, I just told them I need to see some examples of file formats.
They handed me a flash disk that contained 600MB of data, and it looked like the data was not sample data, and for some reason I couldn't understand, when my solution involved pandas.read_csv files, not Hadoop, they were unpleasant.
Hadoop actually has a lot of limitations. Hadoop allows you to run a generic calculation, and I'll use pseudo code to explain:
Scala-style pseudo-code:
SQL-style pseudo code:
Goal: Count the number of books in the library
Map: You count the number of books on odd-numbered shelves, and I count the number of books on even shelves. (The more people, the faster the statistics)
Reduce: Add together the data we have counted separately.
We do only two: F (K,v) and G (K,V), except in the intermediate step of performance optimization, everything is fixed.
It forces you to do all the calculations, grouping, and counting in the map, performing the operation in a way that is like wearing tights, but many calculations are better suited to other models. The only reason to put on tights is that it may expand to a very large dataset, and in most cases your data may be smaller than a few orders of magnitude.
But because of the "big Data" and "Hadoop" buzzwords, even though many people actually don't need Hadoop, they are willing to wear tights.
If my data volume is hundreds of megabytes, Excel may not be able to load it
For Excel Software "Big Data" is not big data, in fact there are other excellent tools to use-I like the pandas. Built on top of the NumPy library, pandas can effectively load hundreds of of megabytes of data into memory in a vector-formatted way. In my 3-year notebook, it can be used to numpy 100 million of floating-point numbers in a blink of an inch. MATLAB and R are excellent tools.
For hundreds of megabytes of data, the typical approach is to write a simple Python script that reads the file rows in rows and processes it to write to another file.
Two, if my data is 10GB
I bought a new notebook with 16GB of RAM and 256GB SSD. If you want to load a 10GB csv file into pandas, the memory it occupies is actually very small-the result is saved as a numeric string, such as "17284832583" as a 4-byte 8-byte integer, or "284572452.2435723" The string is a 8-byte double-precision floating-point number.
At worst, you may not be able to load all the data into memory at the same time.
Three, if my data is 100GB, 500GB or 1TB
Buy a 2TB or 4TB hard drive and install a postgre on your desktop PC or server to fix it.
Four, Hadoop is far less than SQL or Python scripts
In terms of computational expression, Hadoop is weaker than SQL and is weaker than Python script.
SQL is a very straightforward query language, suitable for business analysis, SQL query is fairly simple, but also very fast-if your database uses the correct index, level two query or multi-level query is different.
Hadoop has no indexed concept, Hadoop has a full table scan, and Hadoop has a highly compromised abstraction-I spend a lot of time dealing with Java memory errors, file fragmentation, and cluster competition, which are much longer than I spend on data analysis.
If your data isn't structured data like a SQL table (such as plain text, JSON objects, binary objects), it's usually a small Python script that handles your data in rows. Store the data in a file, process each file, and so on. It would be a hassle if you switched to Hadoop.
Hadoop is much slower than SQL or Python scripts. After the correct use of the index, the SQL query is always--postgresql a simple lookup index, retrieving the exact key value. While Hadoop is a full table scan, it will reorder the entire table. By slicing the datasheet onto multiple computers, the reordering is quick. On the other hand, when dealing with binary objects, Hadoop needs to repeat back and forth to the named node to find and process data. This is suitable for use with Python scripts.
Five, my data exceeds 5TB
You should consider using Hadoop without making too many choices.
The only benefit of using Hadoop is that scalability is very good. If you have a table containing terabytes of data, Hadoop has an option for full table scans. If you don't have such a large amount of data, you should avoid using Hadoop as you would escape the plague. It's easier to use traditional methods to solve problems.
Six, Hadoop is an excellent tool
I don't hate Hadoop, I choose Hadoop when I don't work well with other tools. In addition, I recommend using scalding, do not use hive or pig. Scalding supports the use of the Scala language to write the Hadoop task chain, hiding its mapreduce.