Large data analysis: Terminator of data sampling

Source: Internet
Author: User
Keywords Large data analysis large data analysis they large data analysis they Terminator large data analysis they Terminator data Warehouse large data analysis they Terminator data Warehouse if

"If you really want to know the truth that happens in your business, you need a lot of very detailed data." "Http://www.aliyun.com/zixun/aggregation/8302.html" research director of the Institute of > Data Warehousing (tdwi) Philip · Lusem in his latest report on TDWI's big data. "If you really want to see something you've never seen before, it helps you tap into data that has never been analyzed by business intelligence," he said. ”

This is the reason for the existence of large data analysis, it is unprecedented. Not only is the big data concept itself a reminder, at least we can go back to the beginning of the 21st century, when storage and CPU technology are being flooded with millions of megabytes of data, and it is facing a scalable data crisis. "Advanced analysis techniques are unprecedented in applications for large and different datasets, such as data mining." This is the epoch-making significance of the emergence of large data analysis. This is a sign of the end of the data scalability crisis, Lusem said.

This has brought unprecedented significance to the enterprise. Data mining, data analysis and, in some cases, reports are made for the data collected by the enterprise. This is why practical solutions such as data sampling are seen as a fairly practical necessity for businesses.

"You can't put the entire dataset into a data mining plan. You have to choose the data you need, and you must ensure that the data is correct, because if you do not put the correct data, your technology may not work. Mark Madsen, a researcher at the Institute of Data Warehousing, told participants at the Prediction Analysis workshop.

"You can put a very small proportion of the data you collect into digging ... Sampling of probability events. "But decomposition will be very rare and become a very rare event that makes it hard to become a sample." ”

Ideally, you will find all these "rare" events that are unusual, such as fraud, customer churn, and potential supply chain disruption. They are high-value things hidden in your undifferentiated data and are hard to find.

IBM, Microsoft, Oracle and Teradata, along with most other well-known bi and Data Warehouse (DW) vendors, have started selling products that integrate Hadoop. Some even preach that they have realized the ubiquitous mapreduce algorithm.

These suppliers are not just talking about big data, they are talking about large data combined with advanced analytical techniques such as data mining, statistical analysis and predictive analysis. In other words, what they're talking about is big data analysis.

According to the Data Warehouse Research institute, large data analysis has not yet come into being, and has not been accepted by the mainstream. In a recent survey by the Institute of Data warehousing, more than one-third (34% per cent) of respondents said that their businesses combined large numbers to implement some form of advanced analysis. In most cases, they only take a very simple approach. For example, data sampling.

In fact, Daveinbar, senior director of data integration expert Pervasivesoftware, the company's big data product, said, "Indeed, if companies don't consider phasing out sampling and other past so-called best practices, they really are behind the back."

"If you continue to use the data sampling method, you can actually process all the data, but the scientific nature of the data is weakened." "he said. "In the world of Hadoop, there is no reason not to use commodity hardware or real smart software." In the past, we used sampling data, there might be economic cost considerations, or technical reasons. But today, these reasons are gone. Data sampling in the past is the best practice scenario, but I think it's time has passed. ”

"The problem with a needle in a haystack is not suitable for a sample, so too much emphasis on the training set may lead to problems." "Ultimately, it's easier to run an entire dataset than to follow statistical algorithms and worry about samples," says Madsen, who is responsible for information management consulting. Technology can handle data problems when there are assignment challenges, and access to statistical methods. ”

(Responsible editor: The good of the Legacy)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.