Intel Wugansha: Large Data development context

Source: Internet
Author: User
Keywords We this the big data is

On the morning of 26th, Mr. Wugansha, chief engineer of the Intel China Research Institute, delivered a speech on the theme of "Big Data development: Seeing yourself, seeing the world, seeing sentient beings". In the speech, Wugansha pointed out that the next wave of the big science and technology revolution has been ready, large data models can be divided into three categories, the first category to see themselves, as Socrates said you have to know yourself. The second level is to see heaven and earth, you have to pay attention to yourself, to the world between heaven and earth, to understand the community and social behavior. The third is to see sentient beings, the so-called sentient beings are heaven and earth, nature, everything, the so-called all sentient beings all have the Buddha nature, this is heaven and earth, nature, the laws of all things.

In his speech, he presented the dragon-era software definition city, Dragon respectively, Data driven, resilient, automated, gamified, Open, networked, However, the new large data ecosystem and service mode and the new large collection, storage, management, calculation and security technology are the inevitable way to the Dragon era. The new thinking of large data includes the rapid depreciation of data over time, the accuracy of individual data is no longer important, and changing the "data is a scarce resource" worldview.

In addition, he also suggested that the future of intelligent city public data and services platform should contain three layers, the underlying urban operating system, the middle-tier data trading market and the top-level city store, which requires relevant technology to be achieved.

The following is the full text of the speech:

Wugansha: Good morning! It is a great honor to be on this stage, my title today is "Big data development skeleton-see oneself, see heaven and earth, see sentient beings". These three realms, I believe many people can agree that this is one of the most amazing of the guru, the organizing committee let me talk about the big data for our life, work and our thinking changes, so I put this metaphysical title up. I know I dug a big hole for myself, whether it can fill in, whether there is a suspicion of the title party, please forgive us.

Speaking of myself, I was in Intel more than 10 years, early four or five years, mainly to do virtual machines, compilers and mobile architecture, in the middle of 45 is multi-core, heavy core architecture, as well as parallel computing, these years to go to distributed systems, such as the internet of things, large data and so on. As you can see, from mobile phones to multi-core, re-core to distributed systems, each stage we can see a relatively long period of trend, we put this trend as a belief, in this above obsession for four or five years to produce memories. I very much agree with you that the big data is a very exciting opportunity and we also take this as our most important belief. Why do you say that? I want to give you a look at the scientific and technological revolution of the macro-law, in human history, there are three scientific and technological revolutions, for the first time lasted 50 years, to achieve mechanization. The second time lasted for a whole century, with electrification as a sign. The third is the most far-reaching revolution in human history, the emergence of information technology and other industries such a mutual impact.

Kondratiev of the former Soviet Union discovered three long wave theories, although the man had been revolutionized in the Soviet Union for the third time, but his fourth wave was still good enough to coincide with our third technological revolution. So there is reason to believe that if 2008 is the end of the fourth wave, now we are at the beginning of the fifth long wave, and there is reason to believe that we now face a technological revolution of 3.5 or fourth times. The next wave of orgasm is on the horizon.

Look at the small cycle of the information revolution, we believe that the information technology revolution has gone through three cycles, the first cycle is a frame, with IBM's 360 hosts as the representative of the architecture, we produced a compatible instruction and operating system, high-level language compilers, the second period is digital, the third is networked, Make our information accessible to everyone. Now we have reason to believe that something new is happening now, and the fourth time we think it is these keywords, mobile Internet, IoT, cloud computing and big data will be the main way of the fourth scientific revolution. We believe that these four technologies are not separate, and I will show that these four technologies are relevant later.

Speaking of large data, what is a thing, just with the IBM King is also talking, I think it is certainly not a database, the database is part of it, but it is a way of thinking, but also a strategy, to the business level, with the application of a combination of things. I divide the big data pattern into three categories, the first category to see myself, as Socrates says you need to know yourself. The second level is to see heaven and earth, you have to pay attention to yourself, to the world between heaven and earth, to understand the community and social behavior. The third is to see sentient beings, the so-called sentient beings are heaven and earth, nature, everything, the so-called all sentient beings all have the Buddha nature, this is heaven and earth, nature, the laws of all things. Take a look at these three different aspects, the first to see their own, Christianity has a saying that every walk will leave footprints, we often leave footprints on the internet, such as Beida did micro-blog visualization, Tsinghua did micro-blog keywords, prismatic did a micro-gossip, Coursera according to your interest behavior help To help you with online learning, Klout is a social impact platform that can figure out your social impact, such as your points over forty or fifty points, and you can enjoy VIP at the airport for free. So this is the first aspect. The second aspect is the state of mental health of each of us. The third is your consumer behavior, FICO, an American consumer credit rating company, which openly claims that I understand what you are going to buy tomorrow, including our precision marketing, the so-called nano positioning.

Based on these new ideas, we have to have new methodologies, of course, these methodologies are not my original, many methodologies have appeared in the theoretical end of the essay, and recently was further elaborated. The first is to sample data to the complete set of data, the first level, we have to take data collection as a comprehensive habit, the second level, we need to avoid subjectivity in data collection. The foreigner wrote a book, said the original data itself is contradictory rhetoric, which with the collection of subjective thinking, so we should try to avoid, how to avoid? We have to insert this collection point through the tool, not the person, into the infrastructure. Third-tier concept, because you have data collection, you have to solve the storage problem.

The second is the integration of multiple data sources, we have a lot of data sources, how to integrate it through the data fusion algorithm, how to extract semantics from unstructured data. If these data sources are distributed across regions, this distributed central system is not the same as our distributed central system and how I can integrate multiple data sources across a data center.

The third, the large data plus the simple algorithm, it is more meaningful than the small data and complex algorithms. This is in fact in many ways has been confirmed, such as machine translation, our search to the current very popular depth learning, have found that your dataset is large, your algorithm can be simple, but your results can be better. If your algorithm can be combined with the context, the accumulation of knowledge, this result is better. Google's first search, for example, was based on statistics, but it added to the knowledge map's functionality, and the results were better.

There is also the relationship between causation and relationship, now this has appeared in many places, we all say that we have to be relevant, regardless of causality, not that we do not need to ultimately pursue this causal relationship, but what is our traditional scientific attitude? To see a correlation, I would like to understand why, I want to give a hypothesis, build a model, and then verify the model, which brings a considerable amount of subjective factors. In this time period, I try to find the correlation, not to consider cause and effect, first find the relevance, and then study the cause and effect. A man in the United States invented the shotgun gene sequencing, and he didn't see a new species and then measured it, he is directly to the sea to check, directly to the New York of the air, he can find millions of new gene fragments, and then based on new fragments, and then compared to existing organisms, and then take this correlation. I think of the previous period of avian flu, we in the market to the air can be measured, why to sample it? So this way of thinking is very important.

Another is descriptive analysis, our original report, the original analysis is descriptive analysis, what is it? I want to know what happened in the past and why. The best case is to be able to understand what is happening now. But the future is predictive, I want to know what will happen in the future, even the prescription analysis, I want what to happen in the future, what kind of things I want to do, can let this thing happen in the future.

Another is real-time, which is more important than absolute accuracy. As you know, shopping basket analysis is based on historical data to make a relatively accurate analysis, but the problem is that when you are shopping in a supermarket, you find the user the best point is that he is still browsing, looking for things, rather than the final checkout, so real-time is very important. This is a large class of thinking and methodology.

As you can see, in our practical applications, for example, modern traffic requires a number of data sources, some data from the Beijing Monitoring and command center, some more than two cities in the data. Our daily video and image data and the original data to hundreds of GB, other data, you can see structured data, mobile phone location information, 18 million. Taxi GPS information 20 million per day, traffic card information 19 million per day, there are high-speed fees, as well as static data, census data, even seems to have no relationship with the exchange of these areas, in fact, can also produce relevance, such as our water supply system, Our water supply system can know the rush hour in the morning, the same smart system can know the peak time of the office lights every night, according to this time it can calculate the night traffic jam time. Including the quality of our sleep is related to the state of our transportation, and our emotional analysis of social networks is in fact related to our traffic. This integration of multiple data sources can achieve maximum value.

Great value can also bring new thinking, first data is a raw material, if we are now in the new round of Industrial revolution, the third industrial revolution in the early days of the Industrial Revolution, the raw material is our data, so it has the original value. At the same time, if the data is a crude oil reserves, the information extracted from the data is crude oil, so it has refined derivative value. Data is an asset, and we say that our corporate IT department is simply not making money, but if the data becomes an asset, it can become a profit center, which has a first-use value and a reusable value. For example, the logistics company has personal information data, the shipper's data, and many customer data. The initial idea must be to use the data well and make it more efficient. But think again, it can in fact repeatedly use these values, such as the shipper's credit data, so that he can carry out loan services to the shipper, even with the shipper on the road of the goods on the mortgage, he can understand each subdivision of the economic performance, and can become a financial information company, So the data can be reused. The last data is money, which can be traded, since it is currency.

What is the new methodology based on this new thinking? It may be a kind of data asset product and social analysis service, in order to achieve these, we must first consider the democratization of data, how to achieve the democratization of data, so that everyone access to data? In fact, our government should take the first step to open up our data, from the United States, New York and Chicago have Kaiyuan data, etc., all of which represent the Government should lead the way in the front. In addition to the government's free access to such data, other should have paid data, through the market and pricing of data, you this data is based on the amount of pricing or according to your data type pricing. And, we are not everyone who has the data has the ability to analyze, so you have to social analysis of services, let others help you analyze, in the protection of data ownership and other rights under the premise of others to help you analyze, in fact, in the United States have such a company to achieve this thing.

All of these bring new data to the great ecosystem of the system, the first is the data owners, the second is the data intermediary, the third is the data technology companies. Now many traditional industry customers he may be the owner of the data, but now there are many new, for example, Microsoft has to provide data products and services, but also to exchange data, so it takes on the service of the data intermediary. And like Alibaba may have assumed three roles.

In the city of Wisdom, how to arrange such an ecosystem? We believe that the future of intelligent city, it will appear a public data and services platform, the platform is the bottom of the city's operating system. As you know, the operating system is used to manage resources, scheduling resources, in our cities, you also have a lot of distributed storage, interconnection and computing resources, there are many distributed sensor resources. Operating system at the same time there are many high-level abstraction, we have files, processes, threads, semaphores, in urban life there are streetlights, roads, a variety of power grids, so these high-level abstraction can be built by the city's operating system. The second layer is the data trading market, you have to have such a data mart, let everyone put the data to trade to generate value. I've just said a variety of data markets like New York, Chicago and Dublin. The third tier is the city's App store, which has a wide variety of applications that connect you personally, your environment, and your service data. These three-tier architecture you need to master new technologies, such as the IaaS, PAAs layer, you need to have multiple paradigms, at the Daas level, you need to have the data pricing function and the protection of the rights, in SaaS, you have to connect the city, government and personal life. This is the traditional large data technology station, the bottom is the calculation of the interconnected storage, now this is actually a lot of new developments, our calculation from a single node into a rack of computing, our standard server into a customized server, hardware accelerator, hardware and software collaborative design and so on. The information and results of data processing can be displayed with the result of user's consumption, and the problem of data right is a relatively new concept.

Let's start with some new considerations that need to be done on this station, we think that a large data system must be a specific application to make a specific optimal large data system, and this large data system to consider three factors, one is the gross, one is accuracy, one is real-time. We now think that in many cases you can only meet two, not triangle all satisfied, this is just our present observation. For example, batch calculation can satisfy the general quantity and accuracy, but can not meet the real-time performance. Complex data processing, can meet real-time, but the processing of data can only be in a window, relatively small, at the same time it is a real-time inside. Instant query, can also sample the data to achieve the second query results. Incremental calculation is relatively good balance of these three aspects, the so-called incremental calculation is the historical data on one side, the new data constantly added to produce new value. Of course, incremental computing must be combined with memory calculations to achieve better short latency computations. The small data personal calculation is at this end, it can complete accuracy, and our city calculation is on the other side, it is a gross amount. So you have to have a design tradeoff.

Based on this design trade-off, we also do a complete station, this station inside, of course, there is Hadoop, if you copy three copies, very wasteful resources. SQL and ad hoc queries, including graph calculations, which enable large-scale data analysis and visualization of data, and the following are based on IA platforms, scaffolding. The Intel Institute is involved in a lot of work, for example, Intel now has Hadoop.

Now who owns the data, who can use the data, who is using the data, and where is the management boundary? It's not clear that Google's Road state database is not open, whether our social media database is a poster or a social network. For example, is our vehicle recorder owned by an insurance company or a car or a person, is your medical records electronic medical record a hospital or a personal one? In fact, these rights are not particularly clear, so we now stress that the data has three kinds of rights, the first is the right to ownership, the second privacy rights, the third is the use of the right to know.

First, we have to protect his ownership, we need to have the law and technology to protect. The second is our privacy, you know, privacy and service is a kind of dialectics, the key is that we have to control the use of this privacy data, this control needs to use the right to know, the use of the right to know that the data is the owner of the use of this data is measurable, data conversion, its descent is lost, produce much value. And especially like the GPL, I did 1.0 of open source software, others did 2.0, if he sold the money, whether I can share a part of the profit.

In conclusion, we need to understand new ecosystems, participate in ecosystems, and provide new service models through the dragon of the Times. The third is in the collection, management, storage, analysis, data security of large data to have something new.

Finally use this as a conclusion, just said several are not separate, the big data is fundamental, is the core, cloud computing is the operation, it is the way and means, mobile Internet, IoT is materialized large data and cloud computing value.

Today I will be here, thank you!

12 Next
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.