S4: Yahoo! Distributed stream Computing Platform

Source: Internet
Author: User

I recently read Yahoo! Papers on S4S4: distributed stream Computing PlatformYou can take any notes and may criticize and correct them:

The first is the Application Scenario of streamcompute, that is, why we need streamcompute. Yahoo!, a commercial search engine provider! Advertising revenue accounts for a large part of its revenue, and the billing method of advertising revenue is "cost-per-click", that is, billing by the number of clicks. So, Yahho! The most relevant advertisement must be provided on the page of each search result, and the degree of relevance between the advertisement and the user search result is determined by the user's preferences, the geographical location of the user (country, region ), users' previous search keywords and users' previous clicks are determined. Therefore, how to provide the most relevant advertisement in the user's next search based on the above information in real time requires processing of incoming user requests, that is, stream processing, because the data is constantly "streaming", rather than static.

Because the data comes in as a "stream", the batch processing-based Hadoop platform is not suitable for this job, so Yahoo! S4 is developed. S4 is short for Simple Scalable Stream System.

S4 regards A stream as A series of "events" (elements). Each event is in the form of (K, A), which is somewhat similar to the MapReduce key-value pair.

The basic computing unit of S4 is PE (Processing Element). Each PE is identified by the following four parts (Identify): function, used to process and compute data; acceptable event type; the main attribute of the acceptable event type (keyed attribute); the value of the main attribute of the actually accepted event type. Here the third and fourth articles may be a bit round. Let's take a WordCount example: Suppose there is a PE used to calculate the number of words (similar to the Reduce of WordCount in MapReduce architecture ), PE1 and PE2 are both of these PES. The event format is (word, count ). The functions of PE1 and PE2 perform addition operations on the word count. The accepted event type is (word, count), and the main attribute of the accepted event type is word, the main attribute value of the event type actually received by PE1 is "hello", and the main attribute value of the event type actually received by PE2 is "world ".

There is a special pe that is keyless, that is, this PE will accept all its acceptable event types. In the wordcount example, for all words W, the PE accepts the event (W, count ). If you are smart, you will surely think this is a bit like a map in mapreduce. In fact, I think in general, the PE of S4 is the map or reduce in mapreduce, but the input data in mapreduce is static and batch, while the data in S4 is "inbound.

Zhang Xiaodong from Ohio State University had a paperDot: A matrix model for analyzing, optimizing and deploying software for Big Data Analysis in Distributed SystemsAs we once said, both mapreduce and Dryad are essentially the same and can be attributed to the dot model, so S4 can also be attributed to the dot model, but the data input form is different.

Pn (processing node) is the logic host of the PE. a pn can have multiple PES. PN is used to listen for incoming events, perform operations on the arrival events, distribute events (to different PES) and send events with the help of the communication layer. Each keyless PE is mapped to a PN, and keyless PE is mapped to all Pn.

The functions of the communication layer include cluster management, automatic fault recovery (assigning tasks to slave nodes), and ing between physical nodes and logical nodes. The Pn mentioned above is only a logical node, and its real physical nodes are mapped by the communication layer, that is, the transparency of physical locations, which is very helpful for fault recovery.

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.