A piece of text to read Hadoop

Last Update:2016-08-16 Source: Internet

Author: User

Tags hortonworks mapr hadoop ecosystem

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

We are honored to witness the Hadoop decade from scratch to the king. Moved by the rapid technological changes, I hope that through this content in-depth understanding of Hadoop yesterday, today and tomorrow, looking forward to the next 10 years.

This article is divided into technical articles, industry articles, application articles, Outlook Chapter four parts

　　Technical Articles

At the beginning of the 2006 project, the word "Hadoop" represented only two components--hdfs and MapReduce. For the next 10 years, the word represents "core" (The core Hadoop project) and a growing ecosystem associated with it. This is very similar to Linux and is made up of a core and an ecosystem.

Now Hadoop released the stable version of 2.7.2 in January and has evolved from the traditional Hadoop Troika Hdfs,mapreduce and the HBase community into a vast ecosystem of more than 60 related components , There are more than 25 components included in each major release, including data storage, execution engines, programming and data access frameworks.

After Hadoop transformed resource management from MapReduce into a universal framework in 2.0, it evolved from a 1.0 three-tier structure to the current four-tier architecture:

Bottom--storage layer, file system HDFs

Middle tier – resource and data management, yarn and sentry, etc.

Upper--mapreduce, Impala, spark and other computing engines

Top tier-advanced packages and tools based on compute engines such as mapreduce, spark, etc. such as Hive, Pig, mahout, etc.

　　Storage tiers

　　HDFs has become the de facto standard for large data disk storage and is used for online storage of large volumes of log-like files. After years of development, the architecture and function of HDFs basically cured, such as Ha, heterogeneous storage, local data short-circuit access and other important features have been realized, in the road map in addition to Erasure code has no exciting feature.

As HDFs becomes more stable, the community becomes less active, and the use scenarios of HDFs become mature and fixed, while the upper layer will have more and more file formats encapsulated: Columnstore file formats, such as Parquent, are a good solution to existing bi-class data analysis scenarios In the future, new storage formats will be used to adapt to more scenarios, such as array storage to serve machine learning applications. Future HDFS will continue to expand support for emerging storage media and server architectures.

　　The 2015 HBase released its 1.0 release, which also represented HBase's move towards stability. new hbase features include clearer interface definitions, multi-region replicas to support high-availability reads, family-granularity flush, and RPC read-write queue separation. Future HBase will no longer add new features, but would evolve more in terms of stability and performance, especially large memory support, memory GC efficiency, and more.

　　Kudu is the new distributed storage architecture that Cloudera released in October 2015, completely independent of HDFS. its implementation is based on a 2012-year spanner paper published by Google. Given Spanner's great success within Google, Kudu is hailed as an important component of the next generation analytics platform for processing fast data queries and analysis to fill gaps between HDFs and hbase. Its emergence will further bring the Hadoop market closer to the traditional data warehousing market.

　　The Apache Arrow Project provides a specification for the processing and interaction of column-memory storage. developers from the Apache Hadoop community are currently working on it as a de facto standard for big data system projects.

Arrow projects are supported by Big data giants such as Cloudera, Databricks, and many committer are also core developers of other star big data projects such as HBase, Spark, kudu, and more. Considering Tachyon and so on, it seems that there are not too many practical ground gas applications, Arrow's high-profile appearance may become the future new memory analysis file interface standards.

　　Control layer

　　Control is also divided into Data control and resource control.

With the increase of the size of Hadoop cluster and the expansion of external services, how to effectively and reliably share and utilize resources is the problem that the control layer needs to solve. yarn born out of MapReduce1.0 became a common resource management platform for Hadoop 2.0. due to its geographical location, the industry is optimistic about its future prospects in the field of resource management .

Traditional other resource management frameworks such as Mesos, and now the rise of Docker, will have an impact on yarn's future development. How to improve yarn performance, how to integrate with the container technology, how to better adapt to the scheduling of short tasks, how to more complete multi-tenant support, how to fine-grained resource management and so on is the actual production of the urgent needs of yarn to solve. To make Hadoop go further, yarn needs to do a lot of work in the future.

　　on the other hand, big data security and privacy are getting more and more attention. Hadoop relies on and relies only on Kerberos to implement security, but each component will have its own authentication and authorization policy. The open source community never seems to really care about security, and if you don't use components from Hortonworks's Ranger or sentry from Cloudera, big data platforms are largely safe and sound.

Cloudera's newly launched Recordservice component makes sentry the first in a security race. Recordservice not only provides a consistent level of security granularity across all components, but also provides a record-based underlying abstraction (somewhat like spring, instead of the original Kite SDK), allowing the upper-level application to decouple from the underlying storage while providing a reusable data model across components.

　　Compute engine Layer

　　One of the biggest differences between the Hadoop ecosystem and other ecosystems is the concept of "single platform multiple Applications". there is only one engine on the bottom of the database, only the relational application, so it is "single platform single Application", and the NoSQL market has hundreds of nosql software, each for different scenarios and completely independent, it is "multi-platform Multi-application" mode. While Hadoop uses a single HDFs store at the bottom, there are a number of components that serve multiple scenarios, such as:

Deterministic data analysis: mainly simple data statistics tasks, such as OLAP, attention to rapid response, the implementation of components such as Impala;

Exploratory data analysis: Mainly information-related discovery tasks, such as searching, focusing on unstructured full-volume information collection, implementing components such as search;

Predictive data analysis: Mainly machine learning tasks, such as logistic regression, focus on the advanced and computational capabilities of computational models, and implement components such as Spark, mapreduce, etc.

Data processing and Transformation: mainly ETL tasks, such as data pipelines, focus on IO throughput and reliability, implement components have mapreduce, etc.

One of the most dazzling is spark. IBM announced the development of 1 million spark developers, Cloudera in the one Platform initiative to support the default generic task execution engine for spark for Hadoop, plus hortonworks full support for Spark, We believe that spark will be at the heart of future big data analytics.

　　Although Spark is fast, it is still unsatisfactory in the production environment and needs to be further enhanced in terms of scalability, stability, and manageability. at the same time, Spark has limited capacity in streaming, and other streaming products are required to achieve sub-second or bulk data acquisition or processing. This flaw is taken into account by Cloudera's announcement that the spark stream data technology is designed to work with 80% of applications. We do see real-time analytics (rather than simple data filtering or distribution) scenarios where many implementations of S4 or storm-like streaming engines have gradually kafka+spark streaming instead.

The spark's popularity will gradually bring MapReduce and Tez into the museum.

　　Service Layer

　　The service layer is the programming API detail that wraps the underlying engine, providing the business people with a higher abstraction of the access model , such as pig, Hive, and so on.

And one of the hottest is the OLAP SQL market . Now, 70% of the traffic to spark comes from sparksql! SQL on Hadoop in the end which strong? Hive, Facebook's Pheonix, Presto, Sparksql, Cloudera's Impala, MAPR push drill, IBM Bigsql, or pivital open source Hawq?

This is perhaps the most fragmented place, technically almost every component has a specific application scenario, ecologically speaking, the various manufacturers have their own favor, so Hadoop on the SQL engine is not just a technical game (and therefore considering the neutrality of this article, do not comment here). What can be met is that all the future SQL tools will be integrated, some products have become outdated in the competition clock, we look forward to the market choice.

　　The surrounding tools are flourishing, and the most important is visualization, task management and data management.

There are many open source tools that support Hadoop-based query programming and instant graphical representations such as Hue, Zeppelin, and so on. The user can write some SQL or spark code and describe some tags of the code, and specify a visual template, and then save it for others to reuse, this clock pattern is also called "Agile bi." Commercial products in this area are highly competitive, such as tableau, Qlik, etc.

The originator of the scheduling tool, Oozie, can implement several scenarios in which the MapReduce task runs in series, and later Nifi and kettle other tools provide a more powerful scheduling implementation, worth a try.

There is no doubt that Hadoop's data governance is relatively straightforward relative to the traditional database ecosystem. Atlas is hortonworks new data governance tool, although not fully mature, but is making progress. Cloudera's navigator is the core of the Cloudera commercial release, bringing together a range of features such as lifecycle management, data traceability, security, auditing, and SQL migration tools. Cloudera's acquisition of Explain.io later integrates its products into navigator Optimizator components to help users migrate traditional SQL applications to the Hadoop platform and provide optimization recommendations that can save several months of work.

　　Algorithm and machine learning

Realizing automated, intelligent data value mining based on machine learning is the most compelling vision for big data and Hadoop, and the ultimate expectation of many companies for big data platforms. as more data becomes available, the value of future big data platforms depends more on how much AI is being calculated.

Now machine learning is slowly spanning the ivory tower, from a small number of academics to research the science and technology issues into many enterprises are validating the use of data analysis tools, and has become more and more into our daily life.

Machine learning Open Source project in addition to the previous mahout, MLlib, Oryx, and so on, this year a lot of remarkable events, ushered in a number of star Giants to join:

January 2015, Facebook Open source Frontier deep learning tool "Torch".

In April 2015, Amazon launched its machine learning platform, Amazon machines Learning, a comprehensive hosting service that makes it easy for developers to develop and deploy predictive models using historical data.

November 2015, Google Open source its machine learning platform TensorFlow.

In January, IBM Open source SYSTEMML and became the official Apache incubation program.

At the same time, Microsoft Research Asia Distributed machine learning tools DMTK open source via GitHub. The DMTK consists of a framework that serves distributed machine learning and a set of distributed machine learning algorithms that can be applied to big data in a machine learning algorithm.

In December 2015, Facebook Open source for neural network research server "Big Sur", with high-performance graphics processing unit (GPUS), to the deep learning direction of the design of the chip.

　　Industry Chapter

　　There are tens of thousands of businesses using Hadoop and making money on Hadoop. almost a large enterprise is more or less already using or planning to use Hadoop technology. For Hadoop targeting and use, you can divide the Hadoop industry company into four categories:

First echelon: Such companies have used Hadoop as a big data strategic weapon.

Second echelon: This kind of company will make Hadoop product.

Third Echelon: These companies create products that add value to the overall Hadoop ecosystem.

Fourth echelon: This type of company consumes Hadoop and offers Hadoop-based services to smaller companies of the first and second classes.

Today, Hadoop is technically proven, recognised and even maturing. One of the most representative of Hadoop's development trajectory is the release of Hadoop released by commercial companies. Since 2008, when Cloudera became the first Hadoop commercialization company and launched its first Hadoop release in 2009, many large companies have joined the ranks of Hadoop products.

The term "release" is a unique symbol of open source culture, it seems that any company as long as the open source code to a package, and then more or less a condiment can have a "distribution", but behind the mass ecosystem components of the value of screening, compatibility and integration assurance and support services.

Prior to 2012, the distribution was largely based on Hadoop patching, and there were several privatized versions of Hadoop that reflected the quality flaws of Hadoop products. This fact is evidenced by the ultra-high activity of HDFs, hbase and other communities over the same period.

And then the company is more tools, integration, management, not to provide "better Hadoop" but how to better use of "existing" Hadoop.

After 2014, with the rise of spark and other OLAP products, it is well-established that the offline scenario of Hadoop's good length has been resolved, hoping to expand the ecosystem to adapt to new hardware and expand into new markets.

　　Cloudera presents the architecture of the hybrid Open Source: The core component name is CDH(Cloudera's distribution including Apache Hadoop), Open source free and synchronized with the Apache community, unlimited use of the user to ensure that Hadoop basic skills can be continuously available, not be bound by manufacturers; Data governance and system management components are closed and require commercial licensing to enable customers to better and more easily use Hadoop technology, such as deploying security policies. Cloudera also provides operational capabilities necessary to run Hadoop in an enterprise production environment in the Commercial Components section, which is not covered by the open source community, such as non-downtime rolling upgrades, asynchronous disaster preparedness, and so on.

　　Hortonworks uses the 100% fully open source strategy, with the product name HDP (Hortonworks Data Platform). All software products open source, users free to use, Hortonworks provide commercial technical support services. Compared with CDH, management software uses open source Ambari, data governance uses Atlas, and security components continue to hug hive thighs using Ranger rather than sentry,sql.

　　MAPR uses a model of traditional software vendors, using the implementation of privatization. Users will not be allowed to use the software until they have purchased it. its OLAP products are the main push drill, do not repel Impala.

Today, the mainstream public cloud, such as AWS, Azure, and so on, has provided Hadoop-based PAAs Cloud services beyond the original IaaS services that provide virtual machines. Future development of this market will exceed private Hadoop deployments.

　　Application Chapter

　　The Hadoop platform frees up unprecedented computing power while significantly reducing computational costs. the development of the underlying core infrastructure productivity inevitably brings about the rapid establishment of the big Data application layer.

The applications on Hadoop can be broadly divided into two categories:

　　It optimisation

　　move already implemented applications and services to the Hadoop platform for more data, better performance, or lower costs. It benefits the enterprise by increasing the output ratio and reducing the cost of production and maintenance.

Over the years, Hadoop has proven to be a very suitable solution for several such scenarios, including:

Historical log data online query: Traditional solutions store data in expensive relational databases, not only for high cost, low efficiency, but also for high concurrent traffic in online services. The schema for the underlying storage and query engine with HBase is ideal for queries with fixed scenarios (non-ad hoc), such as flight queries, personal transaction queries, and so on. Now it has become a standard solution for online query applications, China Mobile has clearly indicated in the enterprise technical guidance the use of hbase technology to achieve all branches of the Clear Bill inquiry business.

ETL Task: Many manufacturers have provided excellent ETL products and solutions, and has been widely used in the market. However, in the scenario of big data, traditional ETL encounters the serious challenge of performance and QoS guarantee. Most ETL tasks are lightweight, heavy IO type, while traditional it hardware scenarios, such as small computers that host databases, are designed for compute-class tasks, and IO reaches dozens of GB at most, even with the latest network technology.
Hadoop with Distributed architecture provides the perfect solution, not only using Share-nothing's scale-out architecture to provide linearly scalable infinite Io, which ensures the efficiency of ETL tasks, while the framework provides load balancing, Features such as automatic failover ensure the reliability and availability of task execution.

Data Warehouse Offload: There are many offline batch data processing services in traditional data warehouse, such as daily reports, monthly statements, etc., which occupy a lot of hardware resources. And these tasks are often the kind of things that Hadoop does.

A frequently asked question is whether Hadoop can replace data warehouses, or whether businesses can use free Hadoop to avoid purchasing expensive data warehousing products. In a technical exchange, Mike Stonebroker, the leading authority in the database industry, said: "The Data Warehouse and Hadoop scenarios are very coincident, and these two markets are bound to merge in the future."

We believe that Hadoop in the data Warehouse market will sooner or later replace the current product, but Hadoop is not the way it is now. For now, Hadoop is just a supplement to the Data Warehouse product, and the Data Warehouse builds a mashup architecture to serve the upper-level application syndication.

　　Business Optimisation

　　in Hadoop, the implementation of the original not yet implemented algorithms, applications, from the original production line to hatch new products and business, to create new value. new business to bring new markets and customers, thereby increasing the revenue of enterprises.

Hadoop provides powerful computing power, and professional big data applications have been excellent in almost any vertical area, from banking (anti-fraud, credit, etc.), healthcare (especially in genomics and drug research), to retail, services (personalized services, smart services such as Uber's auto-dispatch capabilities, etc.).

Within the enterprise, various tools have emerged to help enterprise users operate core functions. For example, big data updates data in real time through a large amount of internal and external data, helping sales and marketing figure out which customers are most likely to buy. Customer service applications can help personalize the service; HR applications can help identify how to attract and retain the best employees.

Why is Hadoop so successful? The question seems to be an afterthought, but when we marvel today that Hadoop has achieved such dominance in just 10 years, it does naturally think about why all this is going to happen. Based on comparisons with other projects over the same period, we believe that there are many factors that combine to create this miracle:

Technical Framework: The concept of localized computing, which is respected by Hadoop, is now an intrinsic factor in the success of other products, such as scalability, reliability, and resilient multi-layered architectures. No other such complex system can quickly meet the changing needs of users.

Hardware Development: Moore's Law represents a scale up architecture that encounters a technical bottleneck, and the increasing computational demands force software technology to move to a distributed direction to find a solution. At the same time, the development of PC server technology makes it possible to use inexpensive node groups such as Hadoop, while also having an attractive price/performance advantage.

Engineering Validation: When Google published GFs and MapReduce papers, it had considerable internal deployment and practical applications, and Hadoop had already verified engineering reliability and usability in internet companies such as Yahoo before it was brought to the industry, greatly increasing confidence in the industry and rapidly embracing popularity. And a large number of deployment examples to promote the development of Hadoop drink maturity.

community Driven: The Hadoop ecosystem has always adhered to open source, and the friendly Apache license basically eliminates the entry threshold for vendors and users, thus building the largest and most diverse and active developer community of all time, continuing to drive technological advances that have made Hadoop go beyond many previous and contemporaneous projects.

focus on the bottom: The foundation of Hadoop is to build a distributed computing framework that makes it easier for application developers to work. The continuing focus of the industry continues to underpin the bottom-up and continues to blossom in areas such as resource management and security, and to clear the barriers to the deployment of the enterprise's production environment.

　　Next Generation analytics Platform

The Apache Hadoop community has grown at a frantic pace over the past decade and is now the de facto big data platform standard. But there's still more work to do! The future value of big data applications lies in forecasting, and the core of forecasting is analysis. What will be the next generation of analytics platforms? It is bound to face, and must, solve the following problems:

More and faster data.

Updated hardware features and architecture.

More advanced analysis.

More secure.

So for the next few years, we will continue to witness the "post-Hadoop era" of the next generation enterprise Big Data platform:

The advent of the memory computing age. With the growth of advanced analytics and real-time applications, there is a higher demand for processing power, with the focus of data processing coming back to the CPU from IO. The memory-centric spark replaces the IO-throughput-centric MapReduce as the default generic engine for distributed Big data processing. As a general-purpose engine that supports batch processing with support for quasi-real-time streaming, Spark will be able to meet more than 80% of the application scenarios.

However, after all, spark core is batch processing, good at iterative computing, but it does not meet all the application scenarios. Other tools that are designed for special scenarios will complement them, including:

A) OLAP. OLAP, especially the aggregation of online statistical analysis applications, for the storage, organization and processing of data and simple offline batch processing applications are very different.

B) Knowledge discovery. Unlike traditional applications that address known issues, the value of big data lies in discovering and resolving unknown issues. Therefore, to maximize the intelligence of the analyst, the data retrieval into the data exploration.

Unified data access Management. Data access now requires different interfaces, models, and even languages because of the different format and location of data storage. At the same time, different granularity of data storage brings many challenges in security control and management governance. The future trend is to isolate the underlying deployment operations details and the upper-level business development, so that the platform needs the following functional guarantee:

A) security. The ability to implement data management security policies with the same caliber in traditional data management systems, including integration of user rights Management, granular access control, encryption, decryption and auditing across components and tools, on big data platforms.

B) unify the data model. By abstracting The data description, we can not only manage the data model uniformly, reuse the data parsing code, but also the details of masking the underlying storage for the upper layer processing, so as to realize the development/processing and operation/deployment of the uncoupling.

Simplifies real-time apps. Now users are not only interested in how to collect data in real time, but also to realize data visibility and analysis results on-line as soon as possible. Whether it's the previous Delta architecture or the current lambda architecture, you want to have a solution for fast data. Cloudera's latest public kudu, though not yet in the product launch, is the best possible solution to this problem: the use of a single platform simplifies the fast data "access" implementation, is the future log class data analysis of the new solution.

　　Looking forward to the next 10 years

　　after 10, Hadoop should be a "synonym" for ecology and standards , and the underlying storage layer is not just existing storage architectures such as HDFs, hbase, and kudu, and the upper-level processing components will be as much as the App Store, Any third party can develop its own components based on Hadoop's data access and compute communication protocols, and the user chooses the appropriate components to deploy automatically based on the usage characteristics and computing needs of their data in the market.

Of course, there are some obvious trends that inevitably affect the way Hadoop goes:

Cloud computing

Now 50% of the big data tasks are already running in the cloud, and after 3 years this ratio may rise to 80%. Hadoop's development in the public cloud requires more secure localization support.

Hardware

Advances in rapid hardware will force communities to revisit the roots of Hadoop, and the Hadoop community will never stand idly by.

Internet of Things

The development of the Internet of things will bring massive, distributed and decentralized data sources. Hadoop will adapt to this development.

What happens in the next ten years? Here are some of the author's speculations:

The SQL and NoSQL markets will merge, and Newsql and Hadoop technologies will eventually become unified, and the Hadoop market and the Data Warehouse market merge, but product fragmentation will persist.

Hadoop integrates with other resource management technologies and cloud platforms, integrating technologies such as Docker and unikernal to unify resource scheduling management, providing full multitenant and QoS capabilities, and merging enterprise data analysis centers into a single architecture.

Enterprise Big Data Product scenario. Companies that provide products and technologies directly in the future tend to mature and turn to services. A growing number of new companies are offering industry-and scenario-like solutions, such as personal network credit kits and services.

The scene of the big data platform "splits". Unlike today's talk of Big data, which is called Hadoop and a framework, the future data platform will be based on different levels of data (from dozens of TB to ZB), different application scenarios (a variety of proprietary application clusters) in the subdivision of the ladder-based solutions and products, and even the emergence of customized integrated products.

　　Postscript

Now Hadoop has become the "new norm" for enterprise data platforms. We are honored to witness the Hadoop decade from scratch to the king. As we move to technology, we hope to be able to make a bit of a reading for Hadoop yesterday, today and tomorrow, as a gift for Hadoop celebrating its 10 birthday.

The author's level is limited, in addition time is urgent, superficial rough place, also invites the reader to forgive and advise. Some of the content is quoted from the network, some sources could not be found, but also the original author to forgive.

Big data tomorrow is good, the future of Hadoop must be enterprise software necessary skills, I hope we can witness together.

　　Introduction of the old driver

　　Chen Yu , Cloudera Pre-sales technical manager, industry consultant, senior Solution Architect, the original Intel Hadoop release core developer. 2006 joined the Intel compiler department engaged in server middleware software development, good at Server software debugging and optimization, has led the team to develop the world's leading performance of the XSLT language processor. 2010 years later, Hadoop product development and solutions consultants, successively responsible for Hadoop products, HBase performance tuning, and industry solutions consultants, have successfully implemented and supported multiple hundreds of node Hadoop clusters in the transportation, communications and other industries.

A piece of text to read Hadoop

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More