IBM Li Yonghui: Watson large data and analysis platform

Source: Internet
Author: User
Keywords Large data ibm watson bdtc bdtc2014
Tags analysis application big data cloud computer computing contractor cross

"Csdn Live Report" December 2014 12-14th, sponsored by the China Computer Society (CCF), CCF large data expert committee contractor, the Chinese Academy of Sciences and CSDN jointly co-organized to promote large data research, application and industrial development as the main theme of the 2014 China Data Technology Conference (big Data Marvell Conference 2014,BDTC 2014) and the second session of the CCF Grand Symposium was opened at Crowne Plaza Hotel, New Yunnan, Beijing.

2014 China Large Data Technology conference the first day of the plenary meeting, IBM Greater China division of Systems and Technology Division of Outstanding Engineers Li Yonghui delivered a speech "IBM Watson large data and analysis platform: technical review." Watson's name was IBM founder Thomas J. Watson, who took part in the Jeopardy dangerous-edge TV game show in 2011, the 100 anniversary of IBM, and eventually won the championship after a round of rounds. Watson is not a machine, it is a cluster, there are 2,880 nodes. The goal is to solve the natural expression of human language questions, understand a large number of unstructured data, have self-learning ability, and can respond to the computer in real time. It is used in medical, financial, cross-industry applications and cloud services.


IBM Greater China System and Technology Division outstanding engineer Li Yonghui

The following is a transcript of the speech:

Good morning, ladies and gentlemen, it's a pleasure to attend the 2014 China Large Data Technology conference today, I am pleased to introduce Watson's system, a large data analysis platform technology Overview. If you haven't heard anything about Watson, Watson has a lot of things to say. Watson's name was IBM founder Thomas J. Watson, who, on the 100 anniversary of IBM in 2011, took part in the Jeopardy dangerous-edge TV game show, and won the championship after a round of rounds, giving the prize to a charity group, This is our 100 anniversary celebration of Watson's activities. In addition to introducing the machine is what platform, what technology, I also give you look forward to the direction we go back, especially the large data analysis platform. 100 years of enterprise in the forefront of the field of large data continues to show, and IBM China this year just set up 30 years, we grow together with China.

IBM is now working on data programmatically, there is also a structured data analysis, the data analysis report is the development of the past few decades, we will find that these development bottlenecks, bottlenecks from the production of large data, when you have to deal with a large number of data, there is a new way to deal with data Mining Association, etc., At the same time, we use programming to write program analysis of SQL statements, do programming development, in the future with the data volume increased significantly, you are not programmed, time too late, the amount of data application is too large. So to see Watson, the reason that everyone is interested in it, in addition to the second man-machine war won people, in addition to create a cognitive computing era, dealing with a number of traditional applications, we can also use a self-learning mechanism, you do not need to tell the computer what you want to learn, automatic mining retention information to you, Develop the ability to update according to traditional data.

From the programming age to the perceptual computing age, the traditional search method, the future of active mining data, the traditional search for data, searching machines are deterministic words, the future provides a chance to provide evidence to give you a reference to make a decision.

In the future, in addition to unstructured data, the Internet of things, the networking of cars and even wearable devices produce body data, and so on, which may provide more dimensional data supply for future analysis, as well as human natural language analysis, and so on. IBM's research in this area will continue, but the future development is multi-faceted.

What does Watson look like? Watson is not a machine, is actually a cluster, is a cluster of IBM power. We made the 100 anniversary of the man-machine war, combined with excellent engineers, research institutes and hardware software platform together, to focus on this platform. The platform contains 10 cabinets, 5 in front 5, and 16TB in memory. Response is required to do a response in two or three seconds, so many of our computational analysis is in memory operation, it runs the operating system today we also see a lot of large data kits and so on are in the open source community. So the operating system we run has some open source tools on it.

IBM has put its own research tools inside, IBM contributed to the industry's most important tools, natural language analysis, UMEA, we use a highly parallel architecture to provide support. At the same time, we also did some deep data analysis, tools, and we adopted the cluster approach, optimize the environment and so on, this is probably its platform introduction.

From this platform, we can see how we will get down to the ground. If today we say that China has customers interested in doing these related analysis, we used to be through the Watson platform based on Power7 platform, today has released the Power8, from 8 CPUs to 128 CPUs, simple performance Watson one times higher, and they provide very large memory capacity. Power7 to Power8 4.35 Hz, the highest frequency chip, this chip can also provide 8 threads of concurrency, in the world of large data we have to do a lot of parallel operations, the inside throughput is very good.

Memory uses memory memory, in large data processing, memory speed is very important, the speed is today's Intel Platform 4 times times, processing memory operation process inside. I pack a direct write to memory, there are questions to answer together, through the hardware to achieve, through the operation of the program. Here is a brief talk about hardware differences, the large data field has a standard test terasort,power8 to do more than Intel released the fastest data twice times, IBM Why do Watson platform, platform to support our high rate of analysis.

We need a new solution to the big data problem today, IBM in POWER8 public standards, you can let the board card directly plugged into the motherboard card with the CPU connected, this is the industry's innovation, but also an open standard, this standard we have a customer to do keyword query, large data very common use of the scene. Terasort is an open source tool, made 24 machines, the future expansion of data to continue to add the machine, today can expand the memory through the flash drive, card reading to memory space, we are in a Power8 machine inside a card, received the Flash machine to provide 40 terabytes of memory space, do data interaction, I only need the traditional 24 machines, we only use one machine, two U machines plus two flash memory replaces the original four cabinets of the scheme, saving the cost of 3 times times.

Just said a lot of hardware innovation, this is not only, this morning the first issue mentioned that the future development direction is open source and so on, Ibmpower is now open, we open the alliance is called Open Power Alliance, the global 65 companies to participate, including Google's own development of Power8 model machine, Future use in Google inside. There are 11 companies in Greater China, we are open to the world, and the Chinese government is very interested in the past few months, we have received a lot of Chinese government support, two months ago we in Suzhou and the Ministry of Industry, Vice Minister Yang Xueshan announced the establishment of China's power Technology eco-Alliance, In the future we will see that the power chip may be produced in China, which is a truly open platform.

I just talked about some hardware, and then I'll talk about Watson software from software. As a software, you need to have a benchmark, how to applaud how to call bad. We developed Watson to participate in the game of Precision Quiz program, we do not think that the question is very simple, you ask a question I give a very clear answer, its answer to hide a lot of puns, when we answer that question, we need to understand the whole question what is it asking? It is very difficult to be confident and to answer very quickly.

So we're analyzing the inside, I want to design a Watson machine to defeat human beings, I first want to know how human performance, in this figure we listed the dangerous fringe of the results of the game, the red represents the winners, gray representatives to participate in but lost, red dots together we call the winner region, If I were to invent a machine that would win people, I have to make my ability to analyze the capabilities of the machine to improve the performance of the red area in order to be able to win, so we see from 2006, we developed the system's first generation of the QA system, developed to Watson machine four years, slowly step by step, At first the line was far from the winner's area, the dimensions of the diagram, the X family answer the question of the hundred degrees, game shows inside the 10 questions, 10 to 100% answer, accuracy is to answer the correct degree, the answer is 100 points, if you see the results of the human quiz game is very good. If the machine is to reach that level we have to have a lot of optimizations to do.

How does Watson implement the technique of analyzing quiz games in software? We use technology called depth question and answer, analysis of the nature of the problem, the solution to a number of machines inside, parallel to do the analysis to do the search contrast, combined to come up with a result. A problem produces a lot of components of the semantic component, capturing important words through semantic analysis, I will make a lot of information as the next analysis, the process is the data to produce more data, more data to produce more data, a question finally produce 100,000 of the data is not surprising.

The difficulty is that I need to rob the answer after Shine, Rob not to be robbed by others. At that time I did the development of Watson, had done a comparison, a problem with two hours to analyze the results, to the end we landed more than 2,888 Power7 that machine, to achieve the response between the Chine. Answer the process, a problem is like this, I analyze the key words inside, through the key words I will do some search, after the search I will find the simplest answer, is the possible answer, through the possible answers I then split into the machine to do a search for evidence to look at the relevance and so on. Through the relevance of my last will do a rating, scoring out will give the machine to answer, if my confidence is very high I will answer this question, if I do not have confidence I will not answer, answer error will be deducted points, this is the basic process.

This technique, just mentioned, uses one of the core parts plus Uema, we also understand that users have manufacturers support products, open source technology to customers to do the internal large data analysis, in the Uema data, combined with speech analysis can do a very simple image Jianshi to you, packaged together through data access to crawl, Through the analysis, after parsing, through the operation to the final combination of results and then analyze, a series of one-stop service, we can combine.

Although the mention of Watson, in fact, the operation is in the memory inside to do fast enough, but actually you think about it, when I'm going to educate that machine, and the machine needs training to answer questions, every day there's so much data that I need to get that data into that machine, and how I manage. We see customers doing big data can also create another problem, I often encounter open source is very good, bought a lot of machines back in the inside run, a year plus a machine, the next year plus a machine, the next year there are new machines out, I want to buy a new machine. Often see the lack of resources utilization, how can I mobilize resources is also a problem, IBM also see this problem, we also provide a number of other scheduling platform, in addition to its own support operations, can support open source tools, open source programming is like a trend, My platform can support a number of open source tools packaged together to mobilize effective resources, as long as the submission of the job, look at the back of the system which resources are more free to you transferred to the past. Therefore, this multi-tenant solution can help customers to effectively solve many of the projects in the large number of users, a multi-tenant environment to mobilize resources.

Watson mentions that the challenges faced by big businesses are exactly the same as the information lifecycle Management information security that our small business encounters, and that when you have the most data, you are more important in dealing with the data. So when we are doing big data, we also need to consider how to manage the data effectively, in which Watson, although the operation is in memory, and the data should be backed up regularly, I need a file system that can manage manageability. IBM has a file system called GPFS, a highly parallel universal file system that has been in use for more than 15 years, with GPFS for all of IBM's High-performance computing systems. The benefit of GPFS is the flexibility to increase the reduction of data nodes, at the same time, highly parallelization of the relationship, increase throughput, the bottom can do tiered storage management, you have some data is very important, like keywords you can exist in the hardware of high-speed flash, if the data decades ago in the relatively slow storage inside, You can efficiently manage storage, at the same time I can also pass the data through the GPFs word belt, automatically migrate past, help you effectively solve the problem of data management, but also can provide interface, general file system, CD and so on, can all operate, equal to say all those management tools, scripts can be used in this GPFS inside , Watson used the way to put a lot of data inside, start to upload some key data into memory to use, while I have a remote replication mechanism to provide remote synchronous replication or Cross-domain asynchronous replication technology, so that the global environment can provide local data can also remote data, Provide a file system for everyone, and in the future we will also provide a gateway to open source or public cloud storage platform to go inside, this is the GPFS environment.

Watson's future plan, just said Watson is IBM's 100 anniversary of the second man-machine war platform, with the name of the Company Institute, can not lose. Our first man-Machine war is 1997, perhaps 00 people have not heard of the 97, the first human-machine war in the use of dark blue platform, dark blue platform is the next chess platform, when the use of Power2 machine, 32 nodes, today we Watson is 90 nodes, Power7, 2,287 nodes, our next plan to the landing, landing first pick an industry, the first industry is the medical profession. Why? Watson depth analysis technology needs to be closely linked with the industry, we chose the medical industry, how to treat cancer, collect medical information and so on to help doctors to treat cancer, as well as the financial industry, the current development of Cross-industry, the latest in some of the services announced this year, the Internet to provide free services.

To start with the selection of cancer, we scan a lot of case data into the material, hundreds of thousands of journals swept in, patient cases swept in, when a new patient came in, according to the latest medical journal recommendations to provide a doctor with evidence of medical advice, first of all to emphasize that this is not a substitute for human to do medical doctors, For the doctor to help him solve the problem, we see the Doctor is also a person, he can not spend a lot of time every year to learn the new areas of content, we see a year doctors spend 5-15 hours to learn new medical technology is very remarkable. We have biotechnology, there are Ebola and so on, these diseases are never solved, we through the machine to help you solve. Next plan, just mentioned that Watson through the service delivery, we have been open, at present, free to open 8 services can provide, you put an article to tell it, after scanning to know your text is what kinds of words, can be differentiated into 20 languages, Can know the language can do the next step analysis can find a most appropriate tool analysis, do the conversion between languages, according to your language article to determine what type of user you are users, such as he is an extroverted user, knowledge users and so help you do personalized service to customers. The tools that are available now may be more basic, perhaps providing an analysis of the type of text, and so on, and we will see more and more services available in the future.

Finally, the first speaker Lee also mentioned earlier, I hope that our large data areas are cross-border, the farther the better, IBM also hope that in this field with Chinese customers to do more cross-border services, large data is a new generation of natural resources, this is the IBM President's speech, We have done some cross-regional cooperation with many industries in the past year, which is also an expression of our ability to support IBM's big data development, like the recent IBM Voice analysis with Tencent during this 6 July World Cup, Do some of the World Cup audience real-time analysis of the comments on the Internet to see each game hot spot, which star is praised and so on this is very successful cooperation projects.

Finally, let's give you a very simple short film to see what IBM Watson can also cross the field.

More highlights, please pay attention to the live topic 2014 China Large Data Technology Congress (BDTC), Sina Weibo @csdn cloud computing, subscribe to CSDN large data micro-signal.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.