Analysis: How can the big data develop in depth?

Source: Internet
Author: User
Keywords Big data we can now

The large data in the wall are registered as dead data. Large data requires open innovation, from data openness, sharing and trading, to the opening of the value extraction ability, then to the foundation processing and analysis of the open platform, so that the data as the blood in the body of the data society long flow, moisture data economy, so that more long tail enterprises and data thinking innovators have a colorful chemical role, To create a golden age of big data.

My Big Data Research trail

I have been doing 4-5 years of mobile architecture and Java Virtual Machine, 4-5 years of nuclear architecture and parallel programming system, the last 4-5 years also in vogue, first input networking, in recent years has been big data. Our team's large data research trajectory is shown in the following illustration:

2010-2012, mainly related to the relationship between data and machine: horizontal expansion, fault tolerance, consistency, software and hardware collaborative design, while clarifying various computing patterns, from batch processing (MapReduce) to flow processing, Big sql/ad query, graph calculation, machine learning and so on. In fact, our team is just part of Intel's big data development effort, and the Shanghai team is the backbone of Intel's Hadoop release, and now Intel is the biggest shareholder in Cloudera, not a release, but platform optimization, Open source support and vertical solutions remain the focus of Intel's big data research and development.

Starting from 2013 the relationship between data and people: How do you do with data scientists? Distributed machine learning, feature engineering and unsupervised learning, how to do the interactive analysis tool for the domain experts, how to do the interactive visualization tool for the end user. Graphlab, stale Synchronous Parallelism, a research center supported by Carnegie Mellon University in the United States, has made interactive visualization and large data analysis on SCIDB at MIT's Research Center, while China has mainly spark SQL and Mllib (machine learning Library) are now also involved in depth learning algorithms and infrastructure.

2014 key analysis of the relationship between data and data: The focus of our original work is open source, and later found open source is only an open innovation part of the open innovation of large data to do the opening of data, the opening of large-scale data infrastructure and the opening of value extraction capabilities.

The dark Sea of data and external effects

Here is a very interesting picture, the yellow part is fossil class, that is, there is no network, no digitized data, and most of the data is in this sea. Only the data of the sea level (some call it Surface Web) is the data that is truly accessible to all. Reptiles can crawl, search engines can retrieve data, and most of the data is in the dark Sea (called the Dark Web), which is said to account for more than 85% of the total data, They lie on the floor in some islands, in some enterprises and government.

Data is to the data society, just as water is to the city or blood to the body. The city is born and nourished by the river, and once the blood is stagnant, the body is at stake. Therefore, for the society known as the data to survive, we must let the data flow, otherwise this society will lose a lot of important functions.

Therefore, we hope that the data can be like "Goldwind Jade Dew" to produce the chemical effect. Mr. Ma presented a internet+ concept, and Intel also has a large data x, which is equivalent to large data multiplied by all walks of life. As shown in the following figure, the data has a very interesting effect, called an external effect (externality), such as this data is useless to me but useful to TA, the so-called poison of my honey.

For example, financial data and electricity quotient data collide together, has created internet finance, such as microfinance, where telecommunications data and government data meet can produce demographic values that help cities plan where people live, work and entertain; together with financial data and medical data, McKinsey cites many applications, For example, it can be found that the logistics data and the electronic business data together, you can understand the operation of various economic sub areas, logistics data and financial data generated supply chain finance, and financial data and agricultural data can also occur some chemical role. For example, several people from Google Analytics, using the United States to open meteorological data, in each of the farmland to build a micro-meteorological model, can predict disasters, help farmers insurance and claims.

Therefore, to take the road of data opening, so that the data in different fields really flow up, combined to release the value of large data.

Three concepts about openness

1. Open Data

First is the narrow sense of data openness. The main body of data openness is the government and scientific research institutions, which open up non-public government data and scientific research data. Now there are companies willing to open up data, like Netflix and some telecoms operators, to help them value their data and build ecosystems. But data openness is not equal to information disclosure. First of all, data is not equal to information, information is extracted from the data. We want to open the raw data first, and secondly, it is an active and free open, and we are now often told to apply for information openly, which is passively open.

Tim Berners Lee has proposed a five-star standard for data openness, to ensure data quality: A star is an open licensed format, such as a PDF, followed by a structure that turns data from a file into a table like Excel; Samsung is an open format, such as CSV; four stars are able to find each data item by URI Five star representatives can link with other data to form an Open data map.

Now the mainstream data portals, like Data.dov or data.gov.uk, are based on open source software. Intel's large Data research center at MIT has also done a form called Datahub: The mascot is interesting, half the elephant, representing the database technology, and half the octopus, taken from GitHub's mascot, the Octopus Cat. It provides more functionality such as manageability, provides structured data services and access control, manages data sharing, and enables visualization and analysis in situ.

Broad-sense data openness and data sharing and transactions, such as point-to-point data sharing or on a multilateral platform to do data transactions. Marx said ownership is the foundation of the economy, but now we can find that the means of production leasing system into a mainstream (refer to "Lean Startup"), in the data scene, I do not necessarily have the data, not even the entire dataset, but can be rented. In the process of leasing to ensure the right to data.

First of all, I can do the data for you, but not for you to see. Yao 82, "Millionaires ' Dilemma" (The Millionaire Dilemma), two millionaires than rich who are unwilling to say how much money they have, this is the typical "available but not visible" scenario. In real life there are many examples, such as the U.S. Department of Homeland Security has a list of terrorists (data 1), airlines have passenger flight records (data 2), the Department of Homeland Security to the airline passenger flight records, airlines do not give, because of privacy, he turned to the Department of Homeland Security terrorist list, nor Because it's a state secret. Both sides have found the will of terrorists, but do not want to give the data, there is no way to let data 1 and data 2 to sweep together, but to protect data security?

Secondly, in the use of data in the process of audit, in case the scanner secretly put the data to send back to how to do? Moreover, the need for data pricing mechanism, the value of both sides of the data must not be the same, the insights produced by each side of the different uses, so to have a pricing mechanism, than pot-style data sharing more incentive.

From point-to-point sharing, to multilateral data transactions, from a One-to-many data service to a Many-to-many data market, to the data exchange. If the current data market is more of a data set to buy and sell, then the data exchange is a market based on value discovery and pricing, such as stock exchanges, small quantities, high-frequency data transactions.

We have supported a number of studies to implement the functions mentioned earlier, such as being usable and not visible. Case one is through the encryption database Cryptdb/monomi, in the data owner side of the database is completely encrypted, which in fact also prevent many of the current data leakage problem, we have heard, for example, an Internet service provider employees secretly put data out to sell, Once your data is encrypted, it doesn't work. Second, this encrypted database can run the ordinary SQL program of party B, because it uses the same state encryption technology and onion encryption, some of the semantics of SQL can also be implemented in the ciphertext.

For the "Millionaire Dilemma", we've made another available but invisible technique called the Data Café. It is known that cafes are places where people and people carry out ideological collisions, and that this data café is a new value that allows data and data to collide.

For example, two electric dealers, one is selling clothes, one is selling cosmetics, their customers insight is relatively limited, if the data on both sides together to do an analysis, then you can get a full picture of the user. Again, cancer Xiaozheng is a kind of long tail disease xiaozheng, there are too many gene mutations, each research organization's genome samples are relatively limited, which explains to some extent why the cure rate of cancer Xiaozheng in the past 50 years has only increased by 8%. Then, the data from several research institutes can be touched in the café, which also accelerates the study of cancer Xiaozheng.

At the bottom of the café is a multi-party security computing technology based on a joint study by Intel and Berkeley. The above is a secure, trustworthy spark, based on the "data lineage" audit, based on the contribution of the data to the results.

2. Open Large Data infrastructure

Now there are big data thinking people, but they are very anxious, can not play, play not big data, he does not know how to store, how to deal with these large data, this requires cloud computing. The opening of the infrastructure is also the traditional Platform as a Service, as in Amazon AWS there are mapreduce,google with big Query. The foundation of these large data processing and analysis platform can reduce the threshold of data thinkers, release their creativity.

Decide.com, for example, crawls hundreds of thousands of of data a day, analyzes price information (structured and unstructured), and then tells you what brand to buy and when to buy the best. Only four PhD algorithms, others by AWS. Another company, Prismatic, also took advantage of AWS, a personalized reading recommendation, and I've studied its calculations, storage, and high-performance libraries, with a variant of Lisp Clojure beautifully written, with only three students actually doing the technology.

So when these infrastructures are socialized, the spring of big data thinkers is coming soon.

3. The opening of value extraction ability

Now the model is generally a small one or a one-to-many. For example, Tesco and Dunnhumby, the latter is a very small company, to find Tesco to make it a customer loyalty program, do it for decades, such long-term strategic cooperation is better than short-term data analysis Services, decision-making more serious long-term. Of course, Dunnhumby is now not a small company, but also for other large companies to provide data Analysis Services. Wal-Mart and another small company to work together to do data analysis, and finally he bought the small company, became its Walmart Labs.

A one-to-many model, typically owned by Palantir--peter Thiel and a few professors at Stanford, is now private, but is valued at nearly billions, and is adept at providing data-value extraction services to various governments and financial institutions. The real opening of this ability is kaggle, its bilateral, one side is more than 100,000 analyst, the other side is the demand side enterprises, enterprises in the Kaggle, the analyst bid, get business. This may be a real solution to the long tail Company's value extraction capabilities. Of course, it would be better if we could combine it with our data café.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.