The real meaning of large data-automatic mining of large data

Source: Internet
Author: User
Keywords Big data very very very

Now Big data fire is not good, almost everyone is talking big data, but what is big data, I am afraid not how many people know, weeding too many people.

Big data doesn't mean a lot of data.

So not storing a lot of data is in the big data, because "big data" is simply a short name, said that the whole point should be "http://www.aliyun.com/zixun/aggregation/14185.html" > Large data Mining, The large data that has not been dug up is nothing but the crude oil that has not been mined.

Large data does not mean data mining in general sense.

A lot of people used to do data analysis or data mining, when the "Big Data Age" when the book came out, big data began to fire, they turned into a big data experts. If so, there is no need to mention the big data, because it has always existed, but to put it another way. It's like we don't have to say "drink made" instead of "Drink water" today. Well, yes, that's called playing the concept.

"Big Data Mining" actually has not said full, and then complete point, should be "large data automatic mining."

Previous data analysis or excavation, refers to people through data to carry out analysis, digging out some regularity of things for later use.

But in the face of large data, because not only the amount of data is too large, and often include a lot of data dimensions, people have been impossible to deal with such a huge amount of data, or even how to deal with do not know, then must use the computer to automatically process, mining the data in the law.

But at present, the computer can not be as strict as people, complex logical thinking, so they can not use our thinking model to analyze data, people may be as long as less data can be analyzed in the law, more data but there is no way, so we humans are using sampling analysis.

The computer is just the opposite, unable to analyze the rule according to a small amount of data, but it has an advantage, that is, the operation speed is very fast, so it is possible to deal with massive data to find out the rules.

Since computers are not yet able to carry out complex logical thinking, so it's a simple way of doing a simple statistic, a "hard count", a statistic of what's going to happen, and then when something like that happens again, it tells us there might be some sort of result.

Another feature of the big data here is that big data is mostly about forecasting, telling you what will happen in the future. Instead of just analyzing the past trends and status quo, the future is still to be judged by people.

Why is this simple method effective? This goes back to the word "big Data", which is that because the volume of data is very large, the results are often correct.

We must all know this example, toss a coin to count the probability of a positive or negative appearance, if you only throw 10 times, maybe 9 times on the front, to come to the conclusion is certainly wrong, but if you throw 100,000 times, 1 million times, or more, then the results of your statistics are basically correct, the probability of positive and negative appearance must be 50 %。

Yes, large data automatic mining is based on this principle.

There is no rigorous causal analysis, not through the analysis of the reasons for the results of the deduction, but through statistical knowledge of such a situation, generally there will be such a result, that is, the relationship between the phenomenon and the result. So the big data has a remarkable feature that concerns only relevance, does not care about cause and effect, and in more popular terms, "only know the result, do not know why".

This is actually people according to the advantages of the computer, found a new data analysis, mining methods, and the traditional way completely different, so the traditional data analysis or mining experts can not be called for big data.

But you have to be careful, and you'll run into an expert who may even be a famous professor from a prestigious university. Enter the bookstore you will also see a lot of big data books, cover without exception have a large "big data" three words, but in fact are in the traditional, artificial data analysis methods, and large data is not touched. Of course, this does not include the book "The Big Data age."

In addition, traditional neural networks, deep learning and other artificial intelligence, but also basically not large data, because this is still a lot of human factors, including modeling, training programs, where people still need to analyze the business logic is very familiar to do, at present this method is difficult to achieve practical results. and large data just let the computer according to some simple but clever algorithm, to carry out a large number of data statistics, to find even people can not imagine the law. Large data here is basically not related to the business logic, people do not need to know what the business, such as the analysis of mobile internet industry data, he does not need to know the context of the industry, the current situation and so on, he only need to statistics on a large number of historical data, can find out its future trend

Speaking of which, you must be very interested to ask, that can not find a real big data?

Let's start with a little story:

In the 80 's there were two computer geeks doing translation systems at IBM. At that time, the brick house is exploring the inner connection between language, grammar, syntax god Horse. The two nerds are different, they have to find the various languages corresponding to the literature of all data, others criticized "This computer brute science", and later they were a hedge fund boss recruited away. Now these two nerds are the Revival technology Co-ceo, the boss is Jim Simons.

The Revival technology CO-CEO each annual income is about 100 million dollars, more than the annual income of the CEO of Wall Street, the key is that the two are almost unknown. Their boss James Simons more famous, is a mathematician, with Shiing wrote the theorem, with Chen Yang is a colleague, annual income of more than 1 billion dollars, now retired to do charity. Tsinghua has chern-simons building, is Yang Ning Yang Simons money repair.

In the area of financial investment, hedge funds that focus on relevance and cause-and-effect are doing well (reviving technology, DE Shaw), but there is no comparable performance for companies with deep financial rationale and poor data analysis capabilities, and MIT financier Lo Frankly does not understand what Renaissance technology is doing.

Hey, say you, don't stare at people's annual income 100 million dollars.

The key here is that many people criticize "this kind of computer brute force is not science" (these people are certainly brick home, otherwise they are not qualified to criticize), and the financial scientists do not understand what they are doing.

What does that mean? It shows that there are few people in the developed world who approve of this approach, and fewer people know how to do it in this way, so you can imagine how many people in China know how to do this.

In China, if who uses this non-mainstream way of doing things, not to mention the experts, professors, not to mention what income billion, you estimate not to be starved of the odds of how much.

Anyway, I know a guy, starting in 2000, just like the two American nerds, using this "unscientific brute force method" for semantic relevance analysis, do the same thing with the two fools engaged in the translation system, are related to the language. It can be said that he made a breakthrough in this area, but he wrote the results of the document, the doctor, the experts do not read. He is now in a small company to do a common it work, barely maintain food and clothing, had a long time to find a suitable job, almost to wash dishes, when security.

Maybe someone will ask, do the language of the figure, this is reliable? With the idea of big data, you forget why, the two tech nerds have already told you the results.

You have to know why, you can also tell you:

In fact, language is much more complicated than numbers, for example: 1 and 2, computers naturally know their relationship, know which big, which small, and how much, but "people" and "big", computer how to know the relationship between the two, to know also line, the traditional method is to carry out a lot of manual annotation (professional called part of the label). To let the computer itself through data mining to know the meaning and relevance of words, even the basic thesaurus is not to the computer, and let it build itself, it is too difficult, it is inconceivable, but the guy did.

In other words, in large data, language processing compared with the number, the difficulty is definitely not one or two orders of magnitude difference, so can do the language, do the number is very simple. In the case of large data, the big data is no matter what kind of data you have, it is to find the relevance, so the text and numbers are not much different.

Some time ago, the guy came across a question about an industry trend analysis, saying that it took only one hours to come up with an algorithm that would produce results with a lot of data, but nobody in China could believe him.

Well, it's a bit off track, I'm sorry. But now you know what a real "big data" is. First of all remember that large data is used to predict, that is, directly to tell you the future results, the other is to firmly remember 7 words "large data automatic mining", then no one can not fooled you.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.