Don't get into the same old Hadoop, your data isn't big enough.

Source: Internet
Author: User
Keywords Hadoop

This article, formerly known as "Don t use Hadoop when your data isn ' t", came from Chris Stucchio, a researcher with years of experience, and a postdoctoral fellow at the Crown Institute of New York University, who worked as a high-frequency trading platform, and as CTO of a start-up company, More accustomed to call themselves a statistical scholar. By the right, he is now starting his own business, providing data analysis, recommended optimization consulting services, his mail is: stucchio@gmail.com.

"How many big data and Hadoop experiences do you have?" They asked me. I've been using Hadoop, but I rarely handle more than a few terabytes of tasks. I'm basically just a big data novice--know the concept, write the code, but not experience it on a large scale.

Then they ask, "Can you do a simple group by and sum operation with Hadoop?" Of course I will, but I would say need to look at the specific file format.

They gave me a U disk with all the data, 600MB, yes, all their data. For some reason, I used pandas.read_csv (pandas is a Python data analysis library) rather than Hadoop to complete the task, they were very dissatisfied.

Hadoop is really limited. It is nothing more than running a generic calculation, expressed in SQL Pseudocode: SELECT G (...) From table GROUP by F (...) You can only change G and F operations, unless you want to do performance optimization in the middle step (this is not fun!). Everything else is dead.

(About MapReduce, before the author wrote a "41 words speak clearly mapreduce", you can refer to. )

In Hadoop, all calculations must be written according to a map, a group BY, a aggregate, or this computational sequence. It's the same as wearing tights. Many calculations are better suited to other models. The only reason to endure tights is that you can scale to a very large dataset. But your dataset may actually be far not that order of magnitude.

But because Hadoop and big data are hot words, half the world wants to put on tights even if they don't need them.

But my data has a good hundreds of MB! Excel doesn't fit

Great for Excel, but not big data. There are many good tools-I like to use numpy based pandas. It can load hundreds of MB of data into memory in an efficient, quantifiable format, and in my 3-year old notebook, NumPy can complete 100 million floating-point computations in a blink of an effort. Matlab and R are great tools.

Hundreds of MB of data is generally a simple Python script read-and-write files, processing, and then wrote a file on the line.

But my data has 10G!

I have just bought a notebook computer. 16G of memory cost 141.98 dollars, 256GB SSD overcharged 200 dollars. In addition, if you download a 10GB csv file in pandas, it's not really that big in memory-you can save a number like "17284932583" as a 4-bit or 8-bit integer, and "284572452.2435723" as a 8-digit double.

At worst, you can not load all the data into memory at once.

But my data is 100gb/500gb/1tb!.

A 2T hard drive is only 94.99 dollars, 4T is 169.99. Buy a piece, add it to a desktop computer or server, and install PostgreSQL.

Hadoop is far less applicable than SQL and Python scripts

Hadoop is much worse than SQL in terms of computational expressiveness. Calculations that can be written in Hadoop can be written more easily in SQL or in simple Python scripts.

SQL is an intuitive query language, not too much abstraction, often used by business analysts and programmers. SQL queries are often very simple and generally very quick--as long as the database is indexed correctly, it takes a few seconds for queries to be seen less frequently.

Hadoop does not have any notion of an index, it knows only full table scans. And there are too many levels of Hadoop abstraction--my previous projects are dealing with Java memory errors, memory fragmentation, and cluster contention, and the actual data analysis effort is out of time.

If your data structure is not in the form of an SQL table (such as plain text, JSON, binary), write a small piece of Python or a Ruby script that is more straightforward to handle in rows. Save in multiple files, one at a while. When SQL does not apply, Hadoop is not as bad as programming, but still has no advantage over Python scripts.

In addition to being difficult to program, Hadoop is usually slower than other technical solutions. SQL queries are very fast as long as the index is used well. For example, to calculate join,postgresql just look at the index (if any), and then query each key that you want. And Hadoop, you have to do a full table scan, and then rearrange the entire table. Sorting through multiple machines can be accelerated, but it also brings overhead for processing across multiple computers. If you want to process binary files, Hadoop must repeatedly access Namenode. A simple Python script simply accesses the file system repeatedly.

But my data is over 5tb!.

Your life is hard--it's hard to toss Hadoop, there's not much else to choose from (it might be able to carry a lot of hard disk capacity with a high fidelity machine), and other choices are often expensive (IOE and so on in your head ...).

The only advantage of using Hadoop is expansion. If your data is a single table of terabytes, full-table scanning is a strength of Hadoop. In addition, please take care of life and try to stay away from Hadoop. It is not worth the trouble, the traditional method both time-saving and labor-saving.

Original link: http://www.kankanews.com/ICkengine/archives/47621.shtml

You may also like:

1.Hadoop Distributed File system: Structure and Design

Design and optimization of network architecture in 2.Hadoop cluster environment

3.Hadoop unresolved challenges

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.