"Big data is not hype, not bubbles. Hadoop will continue to follow Google's footsteps in the future. "Hadoop creator and Apache Hadoop Project founder Doug Cutting said recently.
As a batch computing engine, Apache Hadoop is the open source software framework for large data cores. It is said that Hadoop does not apply to the online interactive data processing needed for real real-time data visibility. Is that the case? "I believe Hadoop has a future beyond the batch," says Doug Cutting, founder of Hadoop and the creator of the Apache Hadoop project, Cloudera. ”
"Batch processing has its own niche. For example, you need to move a lot of data and analyze all the data. But I still think what people really want is a combination of batch processing and online computing. Hadoop will be the core of the enterprise's future mainstream data processing system. "Cutting said.
Where do Hadoop go?
At the just-concluded Strata conference+hadoop World Conference, Cutting explained the core idea of the Hadoop stack and its future direction. "Hadoop is seen as a batch-processing computing engine, in fact, this is where we started (combining MapReduce)." MapReduce is a great tool and there are a lot of books on the market for how to deploy algorithms on MapReduce. "Cutting said.
MapReduce is a programming model, designed by Google, to use distributed computing to process massive amounts of data in parallel. MapReduce gets an input and divides it into smaller child problems that are assigned to different nodes for parallel processing. They then regroup the answers of the child questions to form the output.
"It's very effective," cutting says, "and it allows you to move the calculations to the data." That way, when you're working with data, you don't have to replicate data everywhere, and it's a shared platform. Building a distributed system is a complex process, so we don't want to redeploy it over and over again. MapReduce proved to be a solid foundation, relying on MapReduce developed a number of tools, such as pig and hive. ”
Hadoop Key Features
To demonstrate the versatility of the Hadoop large data platform, cutting describes what he considers to be the two core themes of Hadoop. First, the Hadoop platform is extensible, not only for small datasets stored in memory, but also for processing large datasets.
"One of the key factors in assessing scalability is affordability." We run on a general-purpose hardware platform because it allows you to expand further. If you can buy 10 times times of storage, you can store 10 times times the amount of data. So affordability is the key, and that's why we use general-purpose hardware because it's the most affordable platform. "Cutting said.
Another key feature of Hadoop is open source. Cutting points out that open source software is very affordable. Developers can pay the suppliers, but pay for the value they provide. Developers do not have to pay for years, and over time suppliers need to gain the trust and confidence of developers by providing value to you. In addition, for Hadoop, users can save data in their original form, and then use different patterns when you use the data.
Another popular practice in large data areas is that, in general, analyzing more data than smarter algorithms can help you better understand your problems. In other words, you should spend more time collecting data than the algorithms used to adjust smaller datasets. Intuitively, it's much like a high-resolution image, and if you're trying to parse a picture, you should choose to zoom in on a high-resolution image instead of a low-resolution one.
Cutting also points out that batch processing is not a typical feature of Hadoop. For example, the hbase that imitates Google BigTable is part of the Hadoop stack, which has become a very successful open source, relational, and distributed database. HBase is an online computing system, not a batch computing system.
HBase also supports batching, which shares storage with HDFS and other components of the Hadoop stack. I think that's one of the reasons why HBase is so popular. HBase are integrated into other parts of the system rather than being an independent system. It can be shared with other components of the stack, and can be shared with features such as availability, security, and disaster recovery. Cutting explained.
Technology future
If Hadoop is not just a batch computing platform, but rather a more generic data-processing platform, what will it be, and where will it go? ' Of course we want to have open source large data platforms and be able to run on general-purpose hardware, ' says cutting. At the same time, we want it to have linear scalability, that is, if you need to store 10 times times the data, you only need to buy 10 times times the hardware. No matter how big your dataset is, you can expand it in this way.
The same is true for performance. For batch performance, if you need a larger batch throughput or a smaller latency, you only need to increase the number of hardware. The same is true for interactive queries. Adding hardware will give you a linear extension of performance and data processing levels. "People tend to think that with a large data platform, something needs to be discarded," cutting said. I don't think so. In the long run, we don't have to give up any functionality. ”
On the future direction of Hadoop's technology, cutting says Google has given a roadmap. "After Google published the GFS and MapReduce papers, we quickly copied them into the Hadoop project. Over the years, Google has inspired Hadoop's open source stack in many ways. Google's sawzall system spawned pig and hive, while BigTable directly inspired HBase. I am excited to see that this year Google published a paper called Spanner, which describes the mechanism for implementing transmission in a distributed database system. Many people may think that this will not be a reality soon, but it shows us the way forward. "Cutting said.
Cutting points out that as a complex technology, spanner does not quickly become part of Hadoop, but it does define the direction of technology development. He also mentions the Impala (Cloudera recently released Database Engine), which can use SQL queries to store datasets stored in HBase. Impala will bring a new experience of interactive online queries to users, and it has also followed some of Google's research results and has been released for some time. Cutting that Impala will develop into a common technology platform.
"We already know where to go and how to achieve our goals," he said. So, I encourage you to start using Hadoop now because you will reap more in the future. "Cutting said.
"Big data is not hype, not bubbles. Hadoop will continue to follow Google's footsteps in the future. "Hadoop creator and Apache Hadoop Project founder Doug Cutting said recently.
As a batch computing engine, Apache Hadoop is the open source software framework for large data cores. It is said that Hadoop does not apply to the online interactive data processing needed for real real-time data visibility. Is that the case? "I believe Hadoop has a future beyond the batch," says Doug Cutting, founder of Hadoop and the creator of the Apache Hadoop project, Cloudera. ”
"Batch processing has its own niche. For example, you need to move a lot of data and analyze all the data. But I still think what people really want is a combination of batch processing and online computing. Hadoop will be the core of the enterprise's future mainstream data processing system. "Cutting said.
Where do Hadoop go?
At the just-concluded Strata conference+hadoop World Conference, Cutting explained the core idea of the Hadoop stack and its future direction. "Hadoop is seen as a batch-processing computing engine, in fact, this is where we started (combining MapReduce)." MapReduce is a great tool and there are a lot of books on the market for how to deploy algorithms on MapReduce. "Cutting said.
MapReduce is a programming model, designed by Google, to use distributed computing to process massive amounts of data in parallel. MapReduce gets an input and divides it into smaller child problems that are assigned to different nodes for parallel processing. They then regroup the answers of the child questions to form the output.
"It's very effective," cutting says, "and it allows you to move the calculations to the data." That way, when you're working with data, you don't have to replicate data everywhere, and it's a shared platform. Building a distributed system is a complex process, so we don't want to redeploy it over and over again. MapReduce proved to be a solid foundation, relying on MapReduce developed a number of tools, such as pig and hive. ”
Hadoop Key Features
To demonstrate the versatility of the Hadoop large data platform, cutting describes what he considers to be the two core themes of Hadoop. First, the Hadoop platform is extensible, not only for small datasets stored in memory, but also for processing large datasets.
"One of the key factors in assessing scalability is affordability." We run on a general-purpose hardware platform because it allows you to expand further. If you can buy 10 times times of storage, you can store 10 times times the amount of data. So affordability is the key, and that's why we use general-purpose hardware because it's the most affordable platform. "Cutting said.
Another key feature of Hadoop is open source. Cutting points out that open source software is very affordable. Developers can pay the suppliers, but pay for the value they provide. Developers do not have to pay for years, and over time suppliers need to gain the trust and confidence of developers by providing value to you. In addition, for Hadoop, users can save data in their original form, and then use different patterns when you use the data.
Another popular practice in large data areas is that, in general, analyzing more data than smarter algorithms can help you better understand your problems. In other words, you should spend more time collecting data than the algorithms used to adjust smaller datasets. Intuitively, it's much like a high-resolution image, and if you're trying to parse a picture, you should choose to zoom in on a high-resolution image instead of a low-resolution one.
Cutting also points out that batch processing is not a typical feature of Hadoop. For example, the hbase that imitates Google BigTable is part of the Hadoop stack, which has become a very successful open source, relational, and distributed database. HBase is an online computing system, not a batch computing system.
HBase also supports batching, which shares storage with HDFS and other components of the Hadoop stack. I think that's one of the reasons why HBase is so popular. HBase are integrated into other parts of the system rather than being an independent system. It can be shared with other components of the stack, and can be shared with features such as availability, security, and disaster recovery. Cutting explained.
Technology future
If Hadoop is not just a batch computing platform, but rather a more generic data-processing platform, what will it be, and where will it go? ' Of course we want to have open source large data platforms and be able to run on general-purpose hardware, ' says cutting. At the same time, we want it to have linear scalability, that is, if you need to store 10 times times the data, you only need to buy 10 times times the hardware. No matter how big your dataset is, you can expand it in this way.
The same is true for performance. For batch performance, if you need a larger batch throughput or a smaller latency, you only need to increase the number of hardware. The same is true for interactive queries. Adding hardware will give you a linear extension of performance and data processing levels. "People tend to think that with a large data platform, something needs to be discarded," cutting said. I don't think so. In the long run, we don't have to give up any functionality. ”
On the future direction of Hadoop's technology, cutting says Google has given a roadmap. "After Google published the GFS and MapReduce papers, we quickly copied them into the Hadoop project. Over the years, Google has inspired Hadoop's open source stack in many ways. Google's sawzall system spawned pig and hive, while BigTable directly inspired HBase. I am excited to see that this year Google published a paper called Spanner, which describes the mechanism for implementing transmission in a distributed database system. Many people may think that this will not be a reality soon, but it shows us the way forward. "Cutting said.
Cutting points out that as a complex technology, spanner does not quickly become part of Hadoop, but it does define the direction of technology development. He also mentions the Impala (Cloudera recently released Database Engine), which can use SQL queries to store datasets stored in HBase. Impala will bring a new experience of interactive online queries to users, and it has also followed some of Google's research results and has been released for some time. Cutting that Impala will develop into a common technology platform.
"We already know where to go and how to achieve our goals," he said. So, I encourage you to start using Hadoop now because you will reap more in the future. "Cutting said.
(Responsible editor: Lu Guang)