Big data maybe not as smart as you think?

Source: Internet
Author: User
Keywords Big data big data we big data we Google big data we Google these big data we Google these become

You may not realize it, but the significance of the data is no longer limited to the key elements of the computer system--the data has been scattered across the field, becoming the hub of the world.

Citing the comments from a managing director at JPMorgan Chase, the data have become "the lifeblood of the business". He threw his remarks at an important technical conference recently held, with data as the main object of discussion, and the meeting also gave an in-depth analysis of the ways in which institutions move to the "data-driven" path.

The Harvard Business Review magazine says "data scientists" will be "the sexiest jobs of the 21st century". In this article, the authors describe in detail how Netflix captures each user's actions and transforms us from "Happy users into unconscious puppets". The article also cautioned that "massive data analysis and processing has become a reality, and there is a growing trend." ”

The use of the "big data" concept without consequence or in a haphazard manner

All the articles mentioned above are trying to promote the advantages and power of big data, and hope to achieve the propaganda, marketing or profit target of big data--no doubt, big data has become the most dazzling technology development trend this year. If a reader's friend is a technical person, it should be obvious that the time has come to say nothing about big data. But at the same time, everyone seems to say nothing about the topic, because few people can really say what the big data is. Well, this conclusion is a little arbitrary. Strictly speaking, the current large data concept is mainly from a number of product-borne factions:

• New data in explosive growth is being collected in batches (including storage, processing, and analysis), thanks to the extreme thirst for information from industry bosses such as Google, Facebook and Amazon.

• The diversity of information is increasingly evident in the form of online shopping, Facebook status updates, tweet content, picture sharing, and all kinds of registration information.

• The entire industry is hungry for a solution that will be able to manage such huge data as quickly and efficiently as possible.

However, the concept of large data seems to be overused and haphazard, and even if the occasional use of the method is right, the scope of application is not as broad as technicians think.

The three factions mentioned above are real. Google is aggressively grabbing every byte of information from a variety of resources, trying to create profiles that match the usage habits of as many users as possible. This is a double-edged sword: take Google Now, for example, to advertise on the pretext of ' recommending the right product before the customer finds it--but we're not going to talk about moral issues here. )

Clearly, the data from all sources will not be in the form of rules. As a result, Google may need a unique set of processing tools to handle the data, at least for a different number and type of data that existed in the past.

The two most famous tools are the hadoop--distributed database framework--and the mapreduce--set of algorithms developed by Google, designed to organize the diversity of data from various sources into a single set of key/value pairs. With Hadoop and MapReduce, Google is able to split a massive collection of data into manageable chunks and process them independently of the server farm.

Can all this really come true? Without the complex preprocessing process, the original can not easily, fast management of large data sets through a relational database to manage well? It's possible.

Google's special Needs

is mapreduce really as capable of hosting the King of Data merge technology, the Crown of the rules of the game rewrite? The answer is almost no: the legality of Google's patented technology has been questioned, and many existing products have been able to easily achieve the same functionality in simpler ways. At present, the basic MapReduce instance that Google publishes on the network is only dozens of lines of Java code, from these content we can not find any revolutionary idea and breakthrough.

But let's assume that Google needs these tools to meet its very own unique needs, in other words, we might assume that all kinds of existing tools and database frameworks are not enough to achieve Google's technical ambitions. In this case, large data is clearly not a solution for all organizations and is suitable for dealing with all the massive computer applications. Although large data supporters have always believed this, we cannot rely on such a high level of expectations for such new databases and software models.

A large amount of data, even the proliferation of data, has long been a novelty. In the field of investment banking, high-frequency trading systems always need to deal with many transaction transactions in microseconds; the market data engine has for years been required to store and process thousands of price tags in seconds.

Again, my friend Ken Caldeira, who was at Stanford University's Carnegie Institute, immersed himself in meteorological science. As expected, I find that he often needs to deal with "PB-level data". One of my other physicist colleagues who had been trained on Wall Street for data analysis had spent a long time in genome research after 2000 years, according to which he said there was a "staggering data need to be analyzed" throughout the study.

In the age of large data, unprecedented numbers of large datasets are often cited, almost everyone is more or less in touch with them, and the previous generation has been very powerless to work on such a scale.

But in most cases, Caldeira and my data analyst friend are still using ... Python scripts and C + + to solve the problem. Yes, a lot of big data users are currently working on large-scale parallel architectures, clustering, and cloud computing, but this has been happening for more than 10 years, and as my friends have pointed out, "People often don't know the difference between what they do in the cloud, This is because the data in the cloud environment does not explicitly distinguish the contribution of different developers. "Leveraging distributed databases to win faster and more secure redundancy is important to every user, at least to help us significantly compress existing hardware costs."

Can you imagine the bank you trust to calculate account information in tweets and Facebook post?

Another factor that triggers the transformation of large data algorithms is the explosive growth of different types of data. As mentioned earlier, companies like Google and Facebook need to create and process statistics for profiles or resources from a variety of sources, and even more troubling is the format of the information. Of course, not every user is faced with such a problem. When people discuss these new, messy unstructured data, they mostly refer to information from social networks and blogging platforms.

Are the core systems used in the banking sector (which still dominate the old relational database in dealing with transaction matters) Do you really need access to social media data? Inventory systems, digital catalogs, or systems used by cancer researchers? We also need to consider how big data technology can work if data is not distributed and stateless for some reason.

Highly unstructured data still occupies a niche market that is specialized but relatively limited in size, but its performance and position are quite eye-catching. Unlike today's common systems, large data technologies do not require early parsing, translation, or preprocessing of merged data from a variety of resources.

If a company suddenly believes that it needs large data technology to make business go further, means that they must have a fundamental shift in their business and begin to develop a path that is completely different from the way they were before--and it is clear that even in extreme cases the assumptions are hard to establish.

Make your system scalable and big data can come around overnight

The concept of large data is often overused or completely misunderstood. No matter how fast the growth is, we can't call it a big data application for the increase in the amount of data in a particular application. What we need to do is to expand the system and the process is not that complicated-just make some design adjustments, and if the system itself has an extended design, it doesn't need to be adjusted.

Computer based text analysis is always in development. I remember that in the the 1970s, there were scholars who analyzed Shakespeare's dramatic works, hoping to find out the frequency and pattern of their particular words-I was fascinated. If there is any new breakthrough in today's big data, it may be to extend the work to a larger, larger number of textual content.

If these texts had appeared 20 years ago, and we had to grope for them at the time, it would have been impossible to accomplish. Scientists shook their heads and said, "We have a technical base, but we can't really achieve it." "Even today, when I still don't know the details of these analytical work, I find it difficult for researchers to build code algorithms in ways that are completely different from traditional sorting and search ideas." It can be said that the big data in this respect did not bring any groundbreaking achievements.

If the DVD-ROM leasing company could seize such details in the 90 's, they would be happy to actively capture and analyze them. This disturbing trend raises the question of how a movie rental company will use it, how it can be turned into profit, or how it will affect cooperation between companies. These figures are like tiles in a jigsaw puzzle, and the ability to solve the puzzle will make the enterprise have the absolute initiative.

"Data" does not become the root of all evils overnight, at best, it is another important resource of the new era. We should not be too superstitious about big data, abandon the existing data technology without any problem, and should not push all the problems into the big data. It is clear that traditional technology will not be "outdated" and the new technology can not be hoodwink.

(Responsible editor: Schpeppen)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.