Stop bluffing! Do we really need to burn money blindly to pursue big data?

Source: Internet
Author: User
Keywords Large data large data nbsp; large data nbsp; we large data nbsp; we really big data nbsp; We really blindly

Big data is probably the hottest technical term now. Heat means there is a bubble, there is a place worthy of reflection. Quartz's Christopher Mims May 6 published an article, called "Most of the data is not large, pretending that big data is actually a blind waste of money", justified, recommended first read. The following is the translation:

If you're not in the big data camp now, try to get some. After all, competition requires big data. If you have a small amount of data, you will be completely defeated by your competitors.

Another big project that consultants and IT companies sell to businesses is that there are many questions behind the big numbers. Fortunately, honest big data practitioners (also known as data scientists) never put aside their skepticism and put forward a series of reasons to be tired of big data hype. Follows:

For one thing, even internet giants like Facebook and Yahoo! are not always dealing with big data, and the apps of Google style tools are inappropriate.

Facebook and Yahoo run their mega-trunking machines (a powerful collection of servers) to process data. The need for cluster processing is one of the hallmarks of large data. After all, data that can be processed at home PCs cannot be called large data. The need to split the business into small businesses, using a series of computers to handle each small business, is typical of a big data problem similar to the size of every page in the world of Google computing.

It now seems that for Facabook and Yahoo!, it is not necessary for each business to be clustered on the same scale. In Facebook, for example, most of the tasks that an engineer submits to a cluster are MB to GB, and can be done on a single computer or even a laptop computer.

Yahoo! also has a similar situation, the Yahoo! Cluster machine processing data median of only 12.5GB, usually desktop computers can not handle this task, but a better configured server is fully competent.

The above ideas are distilled from a paper in Microsoft Research called "Nobody ever got fired for buying a cluster." The paper points out that even in the most data-hungry companies, most problems need not be clustered. Because clustering is a relatively inefficient or even completely inappropriate solution for a large number of problem types.

For two, big data has become synonymous with data analysis, a definition that is confusing and counterproductive.

Data analysis can be traced back to all grain tables for the Royal granary, but now you have to add the word "big" to the data, and the necessary data analysis is already involved in a larger but less useful popular storm. For example, an article warns readers that "3 steps to apply big data to your small business", in fact, small business data volume Google documents can be processed, not to mention the notebook of Excel.

This means that most of the data processed by the enterprise is actually small data that is said by the Rufus Pollock of the Open Knowledge Foundation. It's important, it's a "revolution," Pollock said. But it has little to do with big data.

Reason three, super Dahua your data scale is becoming a matter that is not worth the candle.

The more data, the better? Really。 If you are looking for a relationship between the relevant formula--x,y, how can you provide me with effective information? The more data there is in fact, the greater the trouble that comes with it.

The information that can be extracted from large data is reduced as the data scale increases, writes Michael Wu, chief data analyst at Lithium, a social media analyst. This means that after a certain point, the return on marginal data generated by the continued increase in data is reduced to such an extent that collecting more data is just a waste of time.

One reason: The more "big" the data, the more error messages will be when looking for dependencies. As Vincent Granville, a data analyst, wrote in The curse's Big Data (the curse of large numbers): It's easy to get into the situation of dealing with millions of of the correlation, even if it includes only 1000 entries. "This means that," all of these correlations may be highly compliant, but this is just an accident: If you use this correlation analysis as a predictive model, the results will be wrong.

This error is often seen in the genetics of one of the original applications of large data. Scientists interested in genome sequencing have been searching for endless studies of their relevance, and have come up with all sorts of fruitless results.

Reason four, in some cases, big data will give you a sense of the sun, but it may also confuse you.

Once the company starts using large data, it is mired in a series of esoteric studies-statistics, data quality, and everything else that constitutes "data science". Just as the science of publishing every day is often overlooked or amended, or never proven, there are too many pitfalls.

Biases in data collection methods, lack of context, data aggregation gaps, manual data processing patterns and overall cognitive biases can lead to even the best researchers may find faulty models, said Kate Crawford, a visiting professor at MIT Media Labs. We may be caught in some sort of algorithmic illusion. " In other words, even if you have large data and are not handled by anyone in the IT department, he may need a ph. D. or equivalent experience. When the processing is complete, their answer may be that you do not need "big data".

So which is better--large or small data?

Do you need data for your business? Of course you do. But only the pointy-tipped Burt's bosses will buy the size of what is called importance, like the fashion. There are also problems inherent in the use of data-making decisions in the field of science-data quality, overall goals, and the importance of context and intuition. Remember: Gregor Mendel a genetic secret by using only one notebook of data. The importance is the quality of the data, not the size of the data.

Original link: Most data isn ' t "big," and businesses are wasting money pretending it is

(Responsible editor: The good of the Legacy)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.