Point of view: Streaming computing drives real-time business change
Source: Internet
Author: User
KeywordsReal-time streaming computing driving business change data processing
During the year, we saw that many vendors focused mainly on integrating Hadoop or NOSQL data processing engines and improving basic data storage. The most successful thing about Hadoop is that it uses MapReduce. MapReduce is a programming model for processing Super large datasets and generating related execution, MapReduce's core idea is to draw lessons from the function is the programming language and the character of the vector into language.
Today, many vendors, including Microsoft, IBM, Oracle, Cloudera, and MAPR, have rolled out a combination of Hadoop products. Oracle NoSQL database, for example, is one of the components of the big Data appliance that Oracle publishes at global conferences, and the Big Data appliance includes Hadoop, Oracle database Hadoop adapters, Oracle Database Hadoop loader and R language system.
Also this month, Microsoft unveiled a preview release for Windows Azure based on Apache Hadoop, which Microsoft has revealed could enable Hadoop applications to be deployed in a matter of hours, which in the past could take days. This trend will continue in the coming year. As we have seen, Hadoop technology is widely deployed in many areas.
But Hadoop also faces some tricky situations, and it is well known that the bulk processing of Hadoop is a favorite place, but it is still not enough in some areas, especially in areas such as mobile, web-client or financial, web-advertising, which require real-time computing. These areas produce a large amount of data and do not have enough storage space to store the data received by each business. The flow calculation can analyze the data in real time and decide whether to discard the useless data, which need not go through the map/reduce link.
From the point of view of real time computing, Yahoo! Distributed Stream Computing platform S4 is more advantageous than Hadoop. MapReduce system is mainly to solve the static data processing, that is, the current MapReduce system implementation start-up calculation, the general data has been in place (for example, saved to the Distributed File System).
The flow computing system at the start, the general data is not fully in place, but a steady stream of inflow, and not like the batch processing systems to pay attention to the total data processing throughput, but the data processing latency, that is, want to enter the data faster processing the better.
Yahoo! The design of S4 has drawn heavily on the design of IBM's stream 處理 Core (SPC) middleware. Only SPC is subscription model, and S4 combines mapreduce and actors model. And simple programming interface, high availability + high scalability, try to avoid disk IO, try to use local Memory to reduce processing latency, use the central and symmetric architecture, all nodes have the same responsibility, facilitate deployment and maintenance; Can be customized and designed to be scientific, easy to use and flexible is Yahoo! S4 design goals.
Storm, a real-time data-processing platform for Twitter, also received widespread attention (Twitter also posted its source code at the strange Loop meeting, held on September 19 Saint Louis). Storm's momentum has been strong, and the tools Twitter has developed have made it more powerful.
Storm's role is mainly in the following three areas: Information flow processing (stream 處理) storm can be used to real-time processing of new data and update the database, both fault tolerance and scalability; computation) Storm can continuously query and feedback the results to the client immediately. For example, send hot topics from Twitter to the browser; distributed remote program invocation (distributed RPC) storm can be used to handle intensive queries in parallel. The storm topology is a distribution function that waits for the call information, and when it receives an invocation message, it evaluates the query and returns the results of the query. For example distributed RPC can do parallel searches or handle large collections of data.
Another well-known distributed streaming system is the Borealis,borealis developed by Brandeis University, Brown University and MIT, which evolved from the previous flow system Aurora and Medusa. Currently the Borealis system has been discontinued and the latest release version has been discontinued for 2008 years.
Borealis has a wealth of papers, complete user/Developer documentation, which is implemented by C + + and runs on the x86-based Linux platform. At the same time, the system is open source, and the use of a large number of Third-party open source components, including for query language translation ANTLR, C + + Network programming framework Library NMSTL.
The flow model of the Borealis system is basically consistent with other flow systems: to accept multiple data streams and outputs, for fault tolerance, deterministic computation, and for systems with high fault tolerance, the input flow operator is ordered.
With the increasing demand for real-time computing, distributed streaming computing will become the next major focus of distributed computing, and will become a powerful complement to the MapReduce framework such as Hadoop.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.