Chairman of the Apache Software Foundation Doug Cutting:the Data Revolution

Source: Internet
Author: User
Keywords Large data hadoop cloudera bdtc bdtc2014
Tags apache application applications big data cloudera computer continue contractor

"Csdn Live Report" December 2014 12-14th, sponsored by the China Computer Society (CCF), CCF large data expert committee contractor, the Chinese Academy of Sciences and CSDN jointly co-organized to promote large data research, application and industrial development as the main theme of the 2014 China Data Technology Conference (big Data Marvell Conference 2014,BDTC 2014) and the second session of the CCF Grand Symposium was opened at Crowne Plaza Hotel, New Yunnan, Beijing.

Figuratively Architect of Cloudera,chairman of Apache Software Foundation Doug Cutting as the first speaker, delivered the "The Data Revolution" speech.


Figuratively Architect of Cloudera,chairman of Apache Software Foundation Doug Cutting

The following is a transcript of the speech:

Doug cutting: I am very honored to be here, and today I am very happy to be here to talk to you. Now that we are in the midst of a data revolution, we have a new way of building data systems, more efficient, more motivated and more effective, a new revolution that is better than the way we were before. And this new approach is boosting the economy, now that China is ready for this innovation, and China can benefit from this innovation, so I'd like to start by introducing you to the premise or assumption of a new revolution. On the hardware front, there has been innovative innovation and development over the past 50 years, with exponential growth over the past 50 years, whether processor speed or processor price/performance, memory size, and capacity for processing. This is a development that has not been seen in other technical fields, we have changed the whole trend, and our growth rate is millions.

Our overall productivity has improved in other areas, for example, agriculture or other industries, but no industry has experienced the growth of our computer industry near millions, we need to see where the future, especially the calculation has so big changes, the future will lead us to where? So we do this by assuming that the memory of our hardware, including our ability to deal with it, will be much more advanced than in the future. And we've also seen hardware development before, but not so much progress as it is now, so it's really important for us because we need to think about what drives or drives hardware. That's because we use computers in almost every sector of our lives. We are now using computers not only in accounting and science, but also in agriculture, in transportation, in health care and in government departments. We have deployed hardware in a variety of areas that can be capable of generating and storing data. In hospitals, in parks, in parking lots, in places where they can help us generate data, if we collect this data, we can better understand our behavior, we understand the business we are engaged in our tasks, which can further improve our lives and improve our work, Let us live higher with new discoveries and inventions, so these data are very valuable. Companies in our industry and in the industry they use data effectively, if they do this, they can get a leading position in the industry, we can use the data effectively to achieve this, but if you have no way to effectively use the data, you will lose in the competition, Because the competitor is using the data effectively. Our equipment is producing data that we can use to collect this data to better in the next century to discover potential, we are more cost-effective, the data is growing, we can simultaneously through the software to combine these two, we need to take some way to create these software to let us better use the strength.

In the past 10 to 20, we've found that the way software has been developed has changed, prior to this, our software is generally owned by the company, it is licensed to all people, the software is basically controlled by the enterprise, it is equivalent to let some people control our software in the development or use of software in the process, In this case you will feel uncomfortable, but on the basis of open source we will get rid of this feeling of being controlled. We can share and develop software on our own, because we know that software is shared by all of us. We are faced with a choice, open source or commercial all, people undoubtedly chose open source.

As far as I am concerned, I have a very personal feeling in a project, Lucene is my 90 generation of a search technology, in 2000, using open source to open to everyone, then can say 2000 years of the best search technology. Since 2000, has gradually become the world's most common engine technology, but also become the most powerful search technology, not to say that it has the first advantage in technology, but because it is open source, open source for revolution played a crucial role.

Now we are at a stage where if we want a technology to become the basic technology used in the industry, the basic requirement is to be open source. We know that when we launch a new technology, people are more and more inclined to use the way of open source, this is also the demand of the market, so we put open source as an important standard of our software development. Now we have three major elements of the new revolution, but also three pillars: hardware, software, and just mentioned the new way we develop software and open source. In 2003, I was working on the Revolution Open source technology engine, when we had to integrate the million web data to collect data and develop a search engine on it. So we know that we have to do these tasks on a lot of machines to distribute, so we take advantage of the increasingly inexpensive hardware that is very easy to achieve scalability, but unfortunately we need a variety of menus to run on top of five machines, Five machines running at the same time we need a person to do the job full-time, if 20 machines we need 5 people to complete the work. How to accomplish this task? We found a perfect solution to automate our hands-on tasks. When the memory of the hard disk is wrong, the data will not be lost, and will not cause the system to crash, but the system will continue to run, we can on thousands of servers to help us let the system continue to run. The second system we used, we did a race, it also has the same performance, that is, it can achieve automation and reliability. When there is any mistake, the system will not crash, but can continue to run, and it can help us to achieve better management of reliability.

So it can be seen that this is actually a new way, and it also allows us to deal with and store the way the revolution took place. Because business software is now more cost-effective and cheaper, it can help us store more data, process more data, we can transform the data, and we can identify it in a more convenient way, and we can improve our flexibility in this process. In addition, we have a common processing process, we can use a variety of methods. In 2006 we optimized the technology we mentioned just now, forming what we call Hadoop today, Google they have in-house development technology, we provide open source technology, get more people to use, so 2006 years when we have 40 to 50 nodes in the run. Later, with Yahoo's help, we developed a scalable system that can run on thousands of millions of machines and can run very smoothly for a few weeks, in fact, has formed a new way of computing for the world.

At that time, the main support for batch calculation and processing, through this project defects are gradually resolved, and it can help us to better and more effective use in various industries, it is very flexible and very high vitality. This is because we have a wide variety of competitions in the developer community and we see different systems, and these systems are evolving. The first is ecosystem, which can be very good for batch processing, and now this platform is actually the core of our platform, and can achieve very good common functionality. Later, we gradually came to the batch processing, to achieve better other related functions. Based on the processing capabilities of these clusters, all of these capabilities are well compatible and they can share data, they can run large amounts of data on different search engines, and they can be very responsible. It actually helps us form a new approach. We've spent a lot of time designing systems to get data, uploading data into the system, and so on, but now it's completely different, we can run different applications entirely on one system, and we can handle different workloads, We can run different applications on one system and can achieve great flexibility and agility. And our platform is also beginning to support more and more processing, such as memory processing, flow processing, of course, with spark support for our two major function flow processing and memory processing, and we continue to advance with technology, This technology can help us to transform the traditional single data form into enterprise data hub has become a new data application platform.

When I first read Google's report, I thought there would be more and more people interested in this platform, and we found that more and more people chose Hadoop to make Hadoop a system, whether it was Microsoft or Oracle, etc. We see the ecosystem of Hadoop as one of the default platforms and systems for data applications. I think this trend should continue and will last for a very long time, because it is a very dynamic and powerful system.

We have more than 20 open source projects, all of which are independent and are not controlled by one side, and these projects can be continuously improved. The key is that these projects can help us to do a better job of the system, we have better file systems, callers, and so on, and we see that we are spark to replace some of the old institutions, this is a new change in our organization, so that our system can at least better meet the needs of people. I think we can all get involved, like Google, which released a new report a few years ago, and talked about data trading, before it was felt that it was hard for us to effectively transform and deal with large-scale data, but in Google's report we could actually achieve that. We can exchange and share data on the one hand, but it's not going to lose the core of the data, we can do that on this platform, and I think it's an inevitable trend because we need a system for this kind of transactional data, and now enterprise data hubs like smartphones, It is a system and tool that we are constantly using in our lives, I think in this field we can think about smartphones, this is a phone is also a tool for texting, it is also a camera head, but also a variety of functions, that is, more than one function, like the camera head we have a better camera, But most of our phones will choose a smartphone, because we think it's very convenient to have a smartphone with a variety of features and can be combined and connected to other systems, so for our enterprise data hub is also the case, it can integrate a variety of tools, it can help us easier to complete the work. So I believe that we are in the depths of the data revolution, we have updated more effective ways to manage data, which can help us to promote economic development, in the next 10 years. And I think China is now ready to embrace the new revolution. This revolution is just beginning, but I think this is actually a data era, the enterprise data hub will help to become China, as a data-driven economy and the country, thank you very much thank you!

More highlights, please pay attention to the live topic 2014 China Large Data Technology Congress (BDTC), Sina Weibo @csdn cloud computing, subscribe to CSDN large data micro-signal.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.