Big Data Evolution Trajectory

Source: Internet
Author: User
Tags cassandra rounds hypertable hadoop ecosystem

When it comes to open source big data processing platform, we have to say that this area of pedigree Hadoop, it is GFS and mapreduce open-source implementation . While there have been many similar distributed storage and computing platforms before, it is hadoop that truly enables industrial applications, lowers barriers to use, and drives industry-wide deployments. Hadoop is one of the cornerstones of a big data processing platform, thanks to the ease of use and fault tolerance of the MapReduce framework , as well as the storage system and computing systems involved. Hadoop can meet most of the offline storage and offline computing needs, and performance is good, a small number of offline storage and computing requirements, in the case of low performance requirements, can also be implemented using Hadoop. Therefore, in the initial stage of building a large data processing platform, Hadoop can meet more than 90% of the offline storage and offline computing needs, becoming the first choice for the major companies of the initial platform.

A total of 81, open source big Data processing tools summary (ON)

A total of 81, open source big Data processing tools summary (below), including log collection system/Cluster management/RPC, etc.

As Hadoop clusters grow larger, a single point of Namenode is becoming a problem: the first problem is that the single-machine memory is limited and the number of files is not being loaded; the second problem is a single point of failure, which seriously impacts the high availability of the cluster. Therefore, there are several distributed namenode schemes in the industry to solve the single point problem. In addition, in order to achieve a variety of computing frameworks can be run in the same cluster, the full reuse of machine resources, Hadoop introduced yarn. Yarn is a generic resource manager that is responsible for resource scheduling and resource isolation. It attempts to become a unified resource management Center for each computing framework, allowing the same cluster to run MapReduce, Storm, Tez and other instances at the same time.

Hadoop solves the problem of big data platforms, with the refinement of business and demand, in some segments of the big data platform people have put forward higher expectations and requirements, so the emergence of a number of different areas of more efficient and more targeted platform. First, based on the improvement of the Hadoop framework itself, there are variant platforms such as Haloop and Dryad, but these platforms were largely not deployed on a large scale, either because the improvements were not noticeable or were replaced by new platforms that were redesigned from the Hadoop framework.

HBase was born to solve the problem of making a massive web page analysis on the Hadoop platform and then implementing a generic distributed NoSQL database. Hadoop refers to the design of Google's GFs and mapreduce. And Google's bigtable in the ecosystem of Hadoop corresponds to HBase. HBase enriches the way Hadoop is stored , and provides tabular storage on the basis of HDFs file-based storage, making it possible to extract the many attributes of a Web page and store it as a field to improve the efficiency of Web query analysis. At the same time, HBase is widely used as a universal NoSQL storage system, which is a non-relational database based on column storage, which makes up the shortage of hdfs in random reading and writing, and provides low-latency data access capability. But HBase itself does not provide scripting language (such as SQL) data access, in order to overcome the problem of data access is not convenient, the first pig language for Hadoop began to support HBase. Pig is a lightweight scripting language that operates on Hadoop and HBase, and people who don't want to write MapReduce jobs can easily access data with pig.

Another well- known system similar to HBase is the hypertable written by C + +and the open source implementation of bigtable, but because fewer people are being maintained and the Hadoop ecosystem is becoming more active, Gradually, hypertable was forgotten by people. Another system that has to be mentioned is Cassandra, which was originally developed by Facebook and is also a distributed NoSQL database. But unlike HBase and hypertable, which are bigtable,Cassandra combines Amazon's dynamo storage model and bigtable data model. one of its major features is the use of the gossip protocol to achieve a centralized peer-to storage, all servers are equivalent, there is no single point of problem. the difference between Cassandra and HBase is that the Cassandra configuration is simple, the platform components are few, the cluster deployment and operation is easier, The cap theorem focuses on availability and partition tolerance, does not provide row locks, is not suitable for storing oversized files, hbase configuration is relatively complex, platform components are many, clustered deployment and operations are a little cumbersome, The CAP theorem focuses on consistency and availability, providing row locks that can handle oversized files.

Although the MapReduce framework for Hadoop is easy to use, it is relatively cumbersome to call the map and reduce interfaces directly to achieve similar results for data Warehouse class requirements that are traditionally used with SQL operations, and is a threshold for users unfamiliar with the MapReduce framework, So Hive was born to solve this problem. It builds a data warehouse framework on Hadoop that maps structured data files into a single database table and provides SQL-like query interfaces that make up the gap between Hadoop and data warehousing operations, greatly improving the productivity of data query and presentation services. On the one hand, users who are familiar with SQL can migrate to the Hive platform at a very small cost, and on the other hand, they can be easily migrated to the hive platform due to the magnitude of the data that cannot be stored in the traditional Data Warehouse architecture. So the hive platform has become a core solution for many companies ' big data warehouses.

The main difference between Hive and hbase where there is a small overlap in functionality is that the HBase essence is a database that provides low-latency data read and write capability at the storage layer, which can be used in real-time scenarios, but does not provide a query method for the SQL-like language. So the data query and calculation is not very convenient (pig learning cost is higher); hive essentially maps SQL statements into MapReduce jobs, with high latency but easy to use, suitable for offline scenarios and does not store itself. In addition, hive can build on HBase and access hbase data.

the presence of hive bridges the domain of Hadoop and data warehousing , but with the gradual application of hive, the efficiency of hive is not too high because hive queries are implemented using MapReduce jobs in the computational layer, not the storage layer. As a result, the data transmission and interaction of the MapReduce framework are limited, and the cost of job scheduling is affected. In order to make Hadoop-based data Warehouse operations more efficient, there is a different implementation scenario behind Hive,--impala, whose Hadoop-based data query operations are not implemented using mapreduce jobs, but rather skip the computing layer of Hadoop. Directly read and write to Hadoop's storage layer--hdfs. Because the calculation layer is omitted, all the overhead of the computational layer is eliminated, the problem of single data interaction in the computational layer is avoided, and the disk IO problem between the multi-wheel computation. Directly read and write HDFs, can achieve more flexible data interaction, improve read and write efficiency. It realizes the column storage of nested data, and adopts multi-layer query tree, so that it can execute query and result aggregation quickly in parallel in thousands of nodes. According to some public information, Impala's efficiency in each scenario can be 3~68 times higher than that of hive, especially in some special scenarios, even up to 90 times times more efficient.

Hadoop has greatly reduced the threshold for massive data computing capabilities, making it possible for businesses to quickly use Hadoop for big data analysis, and as the analysis and computation continues to deepen, the need for differentiation is slowly emerging. People are beginning to find that some calculations, if the timeliness is faster, the benefits will become larger, can provide users with a better experience. In the beginning, in order to improve the timeliness of the Hadoop platform, often a whole batch of calculated data, cut into the hour-level data, even sub-hour data, thus becoming a relatively lightweight computing task, so that the results of the current fragment can be calculated faster on Hadoop, Combining the results of the current fragment with the previous cumulative results, it is possible to get the overall result of the current results quickly and achieve a high timeliness. But with the increasingly fierce competition in the Internet industry, more and more attention to timeliness, especially the real-time analysis of statistics, the demand for a large number of minutes or even seconds output results are expected. The limit for the timeliness of Hadoop computing is typically 10 minutes or so, and constrained by cluster load and scheduling strategies, it is very difficult to be stable for less than 10 minutes, unless it is a dedicated cluster. Therefore, in order to achieve a higher timeliness, in minutes, seconds, or even milliseconds to calculate the results, storm came into being, it was completely out of the MapReduce architecture, redesigned a suitable for streaming computing architecture, data flow as a driver, triggering the calculation, so each to a piece of data, Can produce a calculation results, timeliness is very high, generally can reach the second level. Moreover, the design of a forward loop-free graph computing topology provides a very flexible and rich computing method, covering the common real-time computing requirements, so it has been widely deployed in the industry.

Storm's core framework ensures that the data flow is reliable by sending at least once each piece of data, which is normally sent once and the exception is re -sent. This can cause intermediate processing logic to receive two duplicate data. In most businesses this does not create additional problems, or can tolerate such errors, but for businesses with strict transactional requirements, problems can occur, such as withholding the money twice, which is unacceptable to the user. In order to solve this problem, Storm introduced the transaction topology, implemented the semantics of precise processing once, and was later replaced by the new Trident mechanism. Trident also provides aggregate query operations such as join, GroupBy, and filter for real-time data.

Similar to storm's system is Yahoo's S4, but Storm's users far more than S4, so the development of storm more rapid, more perfect function.

With the gradual popularization of the big data platform, people are no longer satisfied with such simple mining as statistics, data association, etc., and gradually start to try to use machine learning/pattern recognition algorithm in deep mining of massive data. Because machine learning/pattern recognition algorithms are often more complex, are computationally intensive algorithms, and is a single-machine algorithm, so before Hadoop, these algorithms used in massive data is almost impossible, at least industrial applications are not possible: one is not able to calculate such a large number of data, and second, even if a single machine can support , but the calculation time is too long, usually the time of the calculation from a few weeks to several months, which is not acceptable for industry resources and time consumption; Thirdly, there is no easy-to-use parallel computing platform, which can quickly change the single-machine algorithm to parallel algorithm, which leads to the high cost of parallelization of the algorithm. With Hadoop, these problems are solved, a large number of machine learning/pattern recognition algorithms can be quickly parallelized with the MapReduce framework, is widely used in search, advertising, natural language processing, personalized recommendations, security and other services.

So the problem is, the above machine learning/pattern recognition algorithms are often iterative calculation, generally will iterate dozens of to hundreds of rounds, then on Hadoop is a continuous dozens of to hundreds of serial tasks, before and after the two tasks are going through a lot of Io to pass data, incomplete statistics, Most iterative algorithms take up about 80% of the time spent on Hadoop, and if you can dispense with these IO overhead, the increase in computing speed will be huge, so the industry is emerging with a memory-based trend, and Spark is the leader. It proposes the concept of RDD, by using the use of the RDD to put the results of each round distributed in memory, the next round directly from memory to read the previous round of data, saving a lot of IO overhead. It also provides a richer way to manipulate data than Hadoop's mapreduce approach, some of which need to be decomposed into several rounds of hadoop operations that can be implemented in spark. As a result, iterative calculations such as machine learning/pattern recognition tend to be several to dozens of times times faster than the Hadoop platform for computing on Spark. On the other hand, Spark is designed to take care of both mapreduce and iterative calculations, so older mapreduce calculations can also be migrated to the spark platform. Thanks to Spark's compatibility with Hadoop computing and its excellent performance in iterative computing, the proven spark platform is rapidly gaining popularity.

It is increasingly discovered that spark has the advantage of being able to expand into more areas, and now spark is moving toward the Universal multifunction Big data platform. In order for spark to be used in the data warehousing world, developers have introduced the shark, which provides a class SQL query interface on the spark framework that is fully compatible with hive QL, but has recently been superseded by Spark SQL, a better user experience . Spark SQL covers all the features of shark and accelerates query analysis of existing hive data, as well as supporting relational queries directly on the native Rdd object, significantly reducing the use threshold . In the field of real-time computing, the spark streaming project builds a real-time computing framework on spark that splits data streams into small time fragments (for example, seconds) and executes in batches. With Spark's in-memory compute mode and low-latency execution engine, real-time computing on Hadoop is possible on spark. Although timeliness is a bit different than a dedicated real-time processing system, it can also be used in many real-time/quasi-real-time scenarios. In addition Spark also has the graph model domain bagel, actually is Google's Pregel on the spark implementation. It provides a graph-based calculation model, which was later replaced by the new spark graph model API--GRAPHX.

As big data clusters become bigger and larger, the probability of local failures is getting higher and the distributed consistency of cluster core data becomes more and more difficult to guarantee. The advent of zookeeper solves this tricky problem by implementing the well-known fast Paxos algorithm, which provides a clustered distributed consistency service that enables other platforms and applications to achieve distributed consistency of data by simply invoking its services. No need to care about specific implementation details, so big data platform developers can focus more on the platform's own characteristics. For example, the storm platform uses zookeeper to store cluster meta information (such as node information, status information, task information, and so on), so that the fault tolerant mechanism can be implemented simply and efficiently. Even if a component fails, the new replacement can quickly register on the zookeeper and get the required meta information to restore the failed task. In addition to distributed consistency, zookeeper can also be used as leader selection, hot standby switching, resource positioning, distributed locks, configuration management, and more.

data in its life cycle is mobile , often there will be generated, collected, stored, calculated, destroyed and other different states, the data can flow between different states, but also from the same state of the internal flow (such as the State of computing), the flow of upstream and downstream carriers have many kinds, such as terminal, Online log server, storage cluster, compute cluster, and so on. In the background, most of the time is the flow between the big data platform, so the big data platform is not isolated, processing data, they tend to become upstream and downstream relationships, and the data from the upper stream to swim down, you need a data pipeline to properly connect each data correctly up and down the tour, The requirements for this scenario can be well solved using the Kafka cluster. Originally developed by LinkedIn, Kafka is a distributed message publishing and subscription system. The Kafka cluster can act as a big data pipeline, responsible for correctly connecting the top and bottom of each type of data. Each upstream data is sent to the Kafka cluster, while downstream is flexibly selecting the upstream data that it needs by subscribing to the Kafka cluster. Kafka supports multiple downstream subscriptions for the same upstream data. When data is generated upstream, Kafka will persist the data within a certain time window, waiting for the downstream to read the data. Kafka's data persistence and internal fault tolerance mechanism provide better data reliability, which is suitable for both offline and online message consumption.

The date of birth of the above platform is as follows:

Big data platforms greatly increase the productivity of the industry, making it easier and more efficient to store and compute massive amounts of data. Through the use of these platforms, can quickly build a large-capacity users of the application, mobile Internet is in their catalytic rapid development, change our lives.

Big Data Evolution Trajectory

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.