Translation-in-stream Big Data processing streaming large data processing

Last Update:2016-04-18 Source: Internet

Author: User

Tags cassandra compact hortonworks

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Original:http://highlyscalable.wordpress.com/2013/08/20/in-stream-big-data-processing/

Ilya Katsov

For quite some time since. The big data community has generally recognized the inadequacy of batch data processing.

Very many applications have an urgent need for real-time query and streaming processing. In recent years, driven by this concept. Has spawned a series of solutions. Twitter Storm,yahoo S4,cloudera Impala,apache Spark and Apache Tez have added big data and a NoSQL camp. This paper attempts to explore the techniques used in streaming systems and analyze their relationship with large-scale batch processing and OLTP/OLAP databases. and explore how a unified query engine can support streaming, batch, and OLAP processing at the same time.

In Grid Dynamics, we are faced with the need to build a streaming data-processing system. You need to handle 8 billion of events every day and provide fault tolerance and strict transactional, that is, you cannot lose or repeatedly handle events.

The new system is a complement and successor to the existing system. Existing systems are based on Hadoop, data processing is high latency, and maintenance costs are too high.

Such requirements and systems are quite generic and typical. So we describe it as a normative model, as an abstract problem statement.

A high-level presentation of our Production environment Overview:

watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvawrvbnr3yw50b2jl/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/center ">

This is a typical big data infrastructure: Every application in multiple data centers is producing data, and data is delivered through the subsystem to HDFS on the central facility, and the raw data is aggregated and analyzed through the standard Hadoop stack (mapreduce,pig,hive). The aggregated results are stored on HDFS and NoSQL, and then exported to the OLAP database for customized user application access. Our goal is to equip the entire facility with a new streaming engine (see bottom of the figure). To handle most dense data streams. Transport pre-aggregated data to HDFS, reducing the amount of raw data in Hadoop and the load on the batch job.

The design of the streaming engine is driven by the following requirements:

sql-like function : The engine can run Sql-like queries, including joins and various aggregation functions on the time form. Used to implement complex business logic.

The engine can also handle the relative static data (admixtures) that is loaded from the aggregated data.

More complex multi-track data mining algorithms are not within the short-term target range.
modularity and flexibility : instead of simply running sql-like queries, the corresponding pipeline is created and deployed on its own initiative, and it should be able to connect the modules. It is very convenient to combine more complex data processing chains.
fault tolerance : Strict fault tolerance is the basic requirement of the engine. As in sketches, one possible design is to use a distributed data processing pipeline to implement joins, summaries, or chains of these operations, and then connect those pipelines through a fault-tolerant persistent buffer. Using these buffer, you can implement the advertisement/subscription communication mode. It is very convenient to add or remove pipelines, which maximizes the modularity of the system.

Pipelines can also be stateful. The middleware of the engine provides persistent storage to enable the stateful checkpoint mechanism.

All of these topics will be discussed in the section of this article.
interacting with Hadoop: an engine should be able to access streaming data and data from Hadoop. Serves as a custom query engine on top of HDFs.
High Performance and portability: even on the smallest cluster, the system can deliver thousands of messages per second. The engine is supposed to be compact and efficient and can be deployed on small clusters in multiple data centers.

To figure out how to implement such a system, we discuss the following topics:

First, we discuss streaming data processing systems. The relationship between the batch processing system and the relational query engine, the streaming system can be used in a large number of other types of systems have been applied technology.
Second, we describe some of the patterns and techniques used in building streaming frameworks and systems.

In addition, we have researched the emerging technologies and provided some implementation tips.

The article isbased on a in the project developed at Grid Dynamics Labs. Much of the Creditgoes to Alexey Kharlamov and Rafael Bagmanov who led the project and Othercontributors:dmitry Suslov, K Onstantine Golikov, Evelina Stepanova, Anatolyvinogradov, Roman belous, and Varvara Strizhkova.

The basis of distributed query processing

Distributed streaming data processing is obviously associated with a distributed relational database. Many standard query processing techniques can be applied to the streaming engine, so it is very useful to understand the classical algorithms of distributed query processing and to understand their relationship with streaming and other popular frameworks such as MapReduce.

Distributed query processing has been developed for decades and is a very large area of knowledge. We start with a concise overview of some of the major technologies and provide the basis for the discussion below.

Partitioning and shuffling

Distributed and parallel query processing is heavily dependent on data partitioning, breaking up big data into multiple slices so that individual processes are processed separately.

Query processing may contain multiple steps. Each step has its own partitioning strategy. So the data shuffling operation is widely used in distributed database.

While the optimal partitioning for selection and projection operations requires some skill (for example, in a range query), we are able to use hash partitioning to distribute data across processors in a streaming data filter.

Distributed joins are not so simple and need to be studied in depth. In a distributed environment. Parallel joins are implemented through data partitioning, that is, data is distributed across processors, and each processor executes a serial join algorithm (such as nested loops join or sort-merge join or hash join) to process part of the data. Finally, the results are obtained from different processors.

The distributed join mainly uses two kinds of data partitioning techniques:

Disjoint data partitions
Divide-Broadcast Join

Disjoint data partitioning technology uses join key to shuffle data to different partitions, and partition data does not overlap.

Each processor runs a join operation on its own partition data, resulting in a simple concatenation of different processor results. Consider examples of R and S DataSet joins. They are join with numeric key K and are partitioned with simple modulo functions.

(If the data is based on a certain policy.) Already distributed on each processor):

Demonstrates the partition-broadcast join algorithm. The dataset R is divided into disjoint partitions (R1,R2 and R3 in the figure). The data set S is copied to all processors. In the distributed database. The partitioning operation itself is usually not included in the query process, since data initialization is already distributed across different nodes.

Such a strategy applies to large dataset join small datasets, or joins between two small datasets. Streaming data processing systems can apply such techniques, such as the use of static (admixture) and data flow to join.

Group by The process also relies on shuffling . is essentially similar to MapReduce. Consider the following scenario, where the dataset is grouped by character field, and then the numeric field for each group is summed:

In this example. The calculation process consists of two steps: Local rollup and global rollup. This basic and map/reduce operation corresponds. The local rollup is optional. Raw data can be transferred, shuffle, summarized in the global rollup phase.

The general view of this section is that. The algorithms above are naturally implemented through the message-passing architecture pattern. In other words, the query run engine can be seen as a distributed network node that is connected by Message Queuing. Conceptually and streaming pipelines are similar.

Pipeline

In the previous section. We note that very many distributed query processing algorithms are similar to messaging networks.

But. This is not enough to illustrate the efficiency of streaming: all operations in a query should form a link, and the data flow is smoothed through the entire pipeline, that is. No matter what the operation can not block the processing process, can not wait for a large input without producing no matter what output. There is no need to write intermediate results to the hard disk.

There are some operations such as sorting, natural incompatibility of the idea (it is obvious that the sort does not produce any output before processing the input), but the pipeline algorithm is suitable for very many scenes. A typical pipeline for example with what is seen:

In this example, a three processor hash joins four datasets: R1,s1. S2 and S3. First, parallel to the S1,S2 and S3 set up a hash table, and then R1 the tuple flow through the pipeline, from the S1. Finds matching records in the S2 and S3 hash tables. Streaming is a natural use of this technology to enable the join of data flow and static data.

In relational databases, join operations also use the symmetric hash join algorithm or other advanced variants.

A symmetric hash join is a generalized form of a hash join algorithm. A normal hash join requires at least one input for a fully usable talent output (where one input is used to establish a hash table), and a symmetric hash join maintains a hash table for two inputs when the data tuple arrives. Fill separately:

watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvawrvbnr3yw50b2jl/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/center ">

When a tuple arrives, it looks for the corresponding hash table from another data stream. If a matching record is found, the result is output. The tuple is then inserted into the corresponding hash table of its own data stream.

Of course. Such an algorithm does not make sense to join the infinite stream altogether. Very much under the scenes. Joins are used for a limited time form or for other types of buffers. For example, use LFU to cache the most frequently used tuples in a data stream.

The symmetric hash join applies to the rate at which the buffer is larger than the stream, or if the buffer is frequently cleared by the application logic, or if the cache recycle policy is not predictable. In other cases. Using a simple hash join is sufficient. Because the buffer is always full, it does not clog the processing flow:

It's worth noting that. Streaming processing often requires complex flow association algorithms, record matching is no longer based on the conditions of word multiplies, but based on scoring measures, in such scenarios. There is a need to maintain a more complex cache architecture for two streams.

Streaming mode

In the previous section, we discussed some of the standard query techniques used in large-scale parallel stream processing.

Conceptually, it seems that an efficient distributed database query engine is capable of streaming, and vice versa, a streaming system should also be able to act as a distributed database query engine. Shuffling and pipelines are key technologies for distributed query processing, and they can be implemented naturally through messaging networks.

But the real situation is not that simple.

In the database query engine. Reliability is not so critical, because a read-only query can always be executed again, while a streaming system must focus on the reliable handling of messages. In this section, we discuss the techniques for streaming systems to guarantee message delivery. and other patterns that are less typical in standard query processing.

Stream playback

In a streaming system, the ability to turn back the clock and replay the data stream is critical. Because of the following reasons:

This is the only way to ensure that data is handled correctly. Even if the data processing pipeline is fault-tolerant, it is difficult to ensure that the data processing logic is defect-free. People are always faced with the need to repair and deploy the system again, and to put data back in the pipeline of the new version number.
The problem investigation requires ad hoc enquiry. Assuming a problem occurs, people may need to add logs or change the code to re-run the system on the data that is generating the problem.
Even if the error does not occur frequently. Even if the system is fault-tolerant on the whole, the streaming system must be designed to read a specific message from the data source again when an error occurs.

As a result, input data is typically streamed from the data source through the buffer to the streaming pipeline, agreeing that the client moves the read pointer back and forth in the buffer.

The Kafka Message Queuing system implements a buffer that supports scalable distributed deployment and fault tolerance, providing high performance at the same time.

Stream playback requires that the system design consider at least the following requirements:

The system can store the raw data in the pre-defined period.
The system can undo part of the processing result. Play back the appropriate input data. Outputs the new version number result.
The system can rewind the data at high speed and play back. Then catch up with a steady stream of data flow progress.

Blood Trail

In a streaming system, events flow through a series of processors until the end point (for example, an external database). Each input event produces a forward graph of the Child event node (blood), with a graph that ends with a final result.

To ensure the reliability of data processing, the entire diagram must be successfully processed, and the process can be restarted in the event of a failure.

Achieving efficient blood tracking is a challenge. Let's start by introducing how Twitter storm tracks messages, guaranteeing "at least one" message processing semantics:

All events emitted by the data source (the first node of the processing diagram) are marked with a random EventID. The framework maintains a set of [Eventidàsignature] key-value pairs for each data source's initial event. Among them, signature is initialized with EventID.
After the initial event is received by the downstream node. can produce 0 or more events, each carrying its own random EventID and EventID of the initial event.
Assume that the event is successfully received and processed by the next node in the diagram. This node updates the signature of the initial event, and the rule is to vary the ID of all events generated by the initial signature and a) input event ID and b). In the second part, event 01111 generates events 01100,10010 and 00010. So the signature of event 01111 becomes 11100 (=01111 (initial value) XOR 01111 XOR 01100 xor 10010 xor 00010).
An event can also be generated based on multiple input events.

Such a case. Events are associated to multiple initial events, carrying multiple initial IDs (the third part of the picture is a yellow background event).
The signature of the event changes to 0 to indicate that the event was successfully processed. The last node confirms that the last event in the diagram is successfully processed and no longer sends a new event down the tour.

The framework sends a commit message to the event source node (part Three).
The framework periodically iterates through the initial event table to find events that have not been fully processed (that is, signature events that are not 0). These events are marked as failed. The framework requests the source node to replay events. (Translator NOTE: Storm message playback is not self-active, can be sent with the message ID parameters, and then according to the failure of the message ID self-processing playback logic, general spout docking Message Queuing system, the use of Message Queuing system playback function. ）
It is important to note that because of the commutative nature of the XOR or operation, the order of signature updates is irrelevant. In the. The confirmation operation in the second part can take place after the third part, which makes it possible to fully asynchronous processing.
It is also important to note that the algorithm is not strictly reliable-some combinations of IDs may accidentally cause signature to become 0.

However, the 64-bit long ID is sufficient to guarantee a very low error probability, and the probability of error is probably 2^ (-64), which is acceptable in most applications. The main advantage of the algorithm is that only a small amount of memory is needed to save the signature table.

The above implementations are elegant and have a de-centering feature: Each node sends a confirmation message independently. You do not need a central node to explicitly track blood.

However. Transaction processing becomes more difficult for data streams that maintain a sliding form or other type of buffer. For example, a sliding window may contain thousands of events. A very large number of events are in uncommitted or computational state and need to be persisted frequently, and it is difficult to manage the event confirmation process.

Apache Spark[3] uses the second implementation method, the idea is to think of the result as the input data processing function. In order to simplify the blood tracing, the frame batch processing event, the result is also batch, each batch is the input batch processing function.

The results can be calculated in batches, assuming that a calculation fails, the framework simply needs to run it again. Consider the following example:

In this example, the framework joins two streams on a sliding form. Then the results go through a processing phase. The framework splits the flow into batches, specifying IDs for each batch, and the framework can obtain the corresponding batches at any time by ID.

Streaming is split into a series of transactions, each of which processes a set of input batches, uses the processing function to transform the data, and saves the results. In, the Red highlighting section represents a single transaction. Assuming a transaction fails, the framework re-runs it, and most importantly, the transaction is capable of parallelism.

This approach is simple and powerful, enabling centralized transaction management and providing natural "just-in-time" message processing semantics.

Such techniques are also applied to batch processing and streaming at the same time, as they are split into a series of batches regardless of whether the input data is streaming.

Status checkpoints

In the previous section, we used the signature (checksum) in the blood tracking algorithm to provide "at least once" messaging semantics.

This technology improves the reliability of the system, but leaves at least two open issues:

Very many scenarios require "run only once" processing semantics. For example, suppose some messages are sent two times. The counter pipeline will count the results of the error.
When the message is processed. The node calculation status in the pipeline is updated.

The state of the calculation needs to be persisted or replicated to avoid state loss when the node fails.

Twitter Storm uses the following protocols to address these issues:

Events are divided into batches. Each batch is associated with a transaction ID.

The transaction ID is monotonically incrementing, for example. The first event ID is 1 and the second batch ID is 2. etc.). Suppose the pipeline fails while processing a batch, and this batch of data is sent again with the same transaction ID.
First of all. The framework notifies the pipeline that a new transaction is started on the node. And then. The framework sends this batch of data through a pipeline.

At last. The framework notifies the transaction that it is finished, and all node commits state changes (such as updating to an external database).
The framework guarantees that the commit of all transactions is orderly, for example, transaction 2 cannot be committed before transaction 1. This ensures that the processing node is able to persist state changes using the following logic:
- The latest transaction ID and status are persisted
- Assume that the current transaction ID of the framework request submission is different from the ID in the database. The status is updated, for example, the counters in the database are added. Because of the strong ordering of transactions, each batch of data is only updated once.
- Assuming that the current transaction ID is the same as the database ID, then this is the replay of the batch data, and the node ignores the commit. The node should have already processed this batch. and updates the state, and the transaction may fail because of an error in the other parts of the pipeline.
- Ordered commits are critical to implementing the "just-in-time" processing semantics. However, the strict sequential processing of transactions is not advisable, because the first node in the pipeline is in the spare state until all downstream nodes have been processed. Serialization commits can alleviate this problem by processing the transactional process in parallel. For example, as seen in:

Assuming that the data source is fault-tolerant and can be replayed, the transaction guarantees a "run only" processing semantics. However, even with bulk batching, persistent status updates can lead to severe performance degradation.

So. You should minimize or avoid intermediate calculation result states.

As a supplement, state writing can also be implemented in different ways. The most straightforward way is to copy the in-memory state to the persisted storage during the transaction commit process.

Such a way does not apply to large-scale states (such as sliding forms). Another option is to store a transaction log, such as a series of operations logs that convert the original state into a new state (a set of join and cleanup events for sliding forms). Although this approach requires rebuilding the state from the log, disaster recovery becomes more cumbersome, but in very many scenarios. It can provide better performance.

Additive and Sketch

The scalability of the intermediate and finally calculated results is very important, which can greatly simplify the design, realization, maintenance and recovery of the streaming data processing system. Scalability means that the results of large-scale or large-capacity data partitioning can be combined by a smaller time frame or smaller partitioning results. For example, the daily PV amount equals the sum of PV per hour.

State-additive consent to split the data stream, as we discussed in the previous section, each batch can be independently computed/recalculated, which helps to simplify blood tracing and reduce the complexity of state maintenance.

The implementation of the additive is often not easy:

There are some scenarios where additive is really easy. For example, a simple count can be added.
In some scenarios. Some additional information needs to be stored for scalability. For example, the system counts the average purchase price per hour, although the average daily purchase price is not equal to the average purchase price of 24 hours, but. If the system also stores every one hours of volume, you can easily calculate the average daily purchase price.
In other scenarios. It is very difficult and even impossible to achieve the additive. For example, the system counts independent guest data for a site. If 100 independent users visited the site yesterday and today. But these two days of independent access to users may be between 100 and 200 regardless of the value.

We have to maintain a list of user IDs. The additive is achieved through the intersection of the ID list and the operation of the set.

The size and processing complexity of the user ID list is equivalent to the original data.

A sketch (sketches) is an efficient way to convert non-additive values to an additive value. In the example above, the ID list can be replaced by a compact additive statistic counter. Counters provide approximate values rather than exact values, but are acceptable in many applications. Sketches are popular in specific areas such as Internet advertising and can be seen as an independent streaming mode. See [5] for an in-depth overview of sketching techniques.

Logical Time Tracking

Streaming calculations typically rely on time: summaries and joins are generally useful on sliding time forms, and processing logic often relies on event time intervals. Obviously, the streaming system should have its own time view and should not use CPU wall clock time. Data flow and specific events are replayed as a result of a problem. So it's not easy to achieve the right time tracking.

Usually. The global concept of logical time can be implemented in the following way:

All events generated by the original system should be marked with a timestamp.
When the processor in the pipeline processes the stream, it tracks the maximum timestamp, assuming that the persisted global clock is behind, and updates it to the maximum timestamp. Time synchronization of other processors with the global clock
While the data is being played back. Resets the global clock.

Persisted storage Rollup

We've discussed how persistent storage can be used for stateful checkpoints, but this is not the only effect of streaming systems introducing external storage. Consider using Cassandra to join multiple data streams on a time form.

Instead of maintaining the in-memory event buffers, we were able to save the incoming events of all the data streams to Casandra, using the join key as the row key and seeing:

watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvawrvbnr3yw50b2jl/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/center ">

On the other side, the second processor periodically iterates through the data records. Cleans up events in the time-out form by assembling and sending records after a join.

Cassandra can also speed up the process by sorting events based on timestamps.

The wrong implementation fall short the entire streaming process--even with high-speed storage systems such as Cassandra or Redis, writing each piece of data alone introduces a serious performance bottleneck.

On the one hand, the use of storage systems provides a more intact state persistence feature. It is assumed that an acceptable performance target can be achieved in a very large number of scenarios by using optimization methods such as batch write.

Sliding form Aggregation

Streaming data processing often deals with "what is the sum of a data value for the last 10 minutes stream", such as a continuous query on a time form. The most straightforward solution to this type of query is to calculate aggregate functions such as sum for each time form separately. Obviously, such a scheme is not optimal. Because two consecutive time form instances have a high degree of similarity.

Suppose the form of the moment T consists of a sample {s (0), S (1), S (2),..., s (T-1), S (t)}, then the time t+1 form will include {s (1), S (2), S (3) ..., S (t), S (t+1)}. Observe the ability to use incremental processing.

The incremental calculation on the time form is also widely used in digital signal processing. Contains software and hardware. A typical example is the calculation of the sum value.

Assuming that the sum value of the current time form is known, the sum value of the next time form can be obtained by adding a new Swatch and subtracting the oldest sample from the form.

Similar techniques can be used not only for simple aggregation functions such as sum and product, but also for more complex conversion processes. Example. The SDFT (sliding discrete Fourier transform) algorithm [4] is much more efficient than using the FFT (high-speed Fourier transform) algorithm for each form.

Query processing pipeline: Storm, Cassandra, Kafka

Now go back to the practical questions raised at the beginning of the article. We designed and implemented our own streaming system based on Storm,kafka and Cassandra, which applied the technology described earlier. This. We only provide a concise overview of the solution-specific descriptions of the holes and techniques in all the implementations are too lengthy and may require a separate article.

The system uses Kafka0.8 for granted.

Kafka, as a partitioned, fault-tolerant event buffer, enables streaming playback and easy addition of new event producers and consumers. Enhances the scalability of the system. The ability of the Kafka to read pointer backtracking also makes it possible to randomly access the incoming data batches, as appropriate. Able to achieve spark-style lineage tracking. It is also possible to point the system input to HDFS for processing historical data.

As described in the previous description. The Cassandra is used to implement state checkpoints and persist storage aggregations. In very many usage scenarios. The Cassandra is also used to store the results finally.

Twitterstorm is the cornerstone of the system.

All active query processing is performed in Storm's topologies, topologies and Kafka, and Cassandra Interact.

Some data streams are simple: Data arrives kafka;storm read and processed. Then store the results in Cassandra or somewhere else. Other data streams are more complex: A storm topology passes the data to a topology with Kafka or Cassandra. Shows two data streams of this type (red and blue curve arrows).

Towards a unified big data processing platform

Existing technologies such as Hive,storm and Impala allow us to handle big data with ease, complex analysis and machine learning using batch processing. Online analysis using real-time query processing, continuous query using streaming processing. Further. The lambda architecture can also effectively integrate these solutions. This brings us to the question of how these technologies and methods can be assembled into a unified solution in the future. In this section we discuss distributed Relational query processing, batch processing, and streaming processing, the most prominent common denominator. Total can cover all user scenarios, thus the most promising solution in this area.

The key point is that relational query processing, MapReduce, and streaming can all be implemented using the same concepts and techniques as shuffling and pipelines. At the same time:

Streaming must guarantee strict data transfer and intermediate state persistence.

These features are not critical for batch processing where the calculation process is very convenient to restart.
Flow processing is inseparable from piping. For batch processing, the pipeline is less critical and not applicable in some scenarios. Apache Hive is based on a phased process of mapreduce, in which intermediate results are materialized and not fully utilized on the merits of the pipeline.

The above two points imply that the adjustable persistence strategy (memory-based messaging or storage on the hard disk) and reliability are notable features of our imagined unified query engine.

The unified query engine provides a set of processing primitives and interfaces for a high-level framework.

In emerging technologies, the following two are worth focusing on:

Part of Apache Tez[8],stinger initiative[9]. Apache Tez replaces the MapReduce framework by introducing a set of fine-grained query processing primitives. Its goal is to have the framework of Apache Pig and Apache hive break down query statements and scripts into efficient query processing pipelines rather than a series of mapreduce jobs, which are usually very slow. Because of the need to store intermediate results.
Apache spark[10]. The Spark project is probably the most advanced and promising unified big data processing platform, which already includes the batch processing framework. SQL query engine and streaming framework.

References

Wilschut and P. Apers, "Dataflow Query execution in a Parallel main-memory environment"
T. Urhan and M. Franklin, "Xjoin:a reactively-scheduled pipelined Join Operator"
M. Zaharia, T. Das, H. Li, S.shenker, and I. Stoica, "discretized Streams:an ef?cient and Fault-tolerantmodel for Stream Processing on Large Clusters "
E. Jacobsen and R. Lyons, "The Sliding DFT"
Elmagarmid, Data Streams modelsand algorithms
N. Marz, "Big Data Lambda Architecture"
u=d6ckvfygozxerfxgxs9ubereu7w0hys0r2ltz5sv7gej13itwfjruhelcugrulhgffampeiwsyv%2fvwpsxg0ll4tlhy8n4vvx% 2fhwfceebudaviqeh5yspl2kee4hycg%3d%3d&b=6 ">j Kinley," The Lambda architecture:principles for architecting Realtimebig Data Systems "
U=d6ckv%2fgzsibbwltyymvue15e9rgnv4c0smlpbzj0&b=6 ">http://hortonworks.com/hadoop/tez/
http://hortonworks.com/stinger/
U=d6ckpocktoqcxunfydmzbb9gq77p&b=6 ">http://spark-project.org/

Translate-in-stream Big Data processing streaming large data processing

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More