Comparison of operator layers of common computing frameworks

Last Update:2014-09-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Background

Some time ago, I designed the operator layer for the internal self-developed computing framework. I have compared some open-source computing frameworks with the operator layer. In this article, I will make a coarse-grained sorting.

The following figure shows a split of the abstract layers of the computing framework. For details, refer to the spark SQL sharing slides on spark Meetup in Hangzhou last Sunday.

Pig-Latin
The DSL on hadoop Mr is process-oriented and suitable for large-scale data analysis. The syntax is beautiful. Unfortunately, it is only applicable to CLI.

A = load ‘xx‘ AS (c1:int, c2:chararray, c3:float)B = GROUP A BY c1C = FOREACH B GENERATE group, COUNT(A)C = FOREACH B GENERATE $0. $1.c2X = COGROUP A by a1, B BY b1Y = JOIN A by a1 (LEFT|FULL|LEFT OUTER), B BY b1

Cascading
In the hadoop Mr package, Twitter summingbird is based on cascading. Each operator is new, and the pipe instance is passed into the new operator by "iteration.

// define source and sink Taps.Scheme sourceScheme = new TextLine( new Fields( "line" ) );Tap source = new Hfs( sourceScheme, inputPath );Scheme sinkScheme = new TextLine( new Fields( "word", "count" ) );Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE );// the ‘head‘ of the pipe assemblyPipe assembly = new Pipe( "wordcount" );// For each input Tuple// parse out each word into a new Tuple with the field name "word"// regular expressions are optional in CascadingString regex = "(?<!\\pL)(?=\\pL)[^ ]*(?<=\\pL)(?!\\pL)";Function function = new RegexGenerator( new Fields( "word" ), regex );assembly = new Each( assembly, new Fields( "line" ), function );// group the Tuple stream by the "word" valueassembly = new GroupBy( assembly, new Fields( "word" ) );// For every Tuple group// count the number of occurrences of "word" and store result in// a field named "count"Aggregator count = new Count( new Fields( "count" ) );assembly = new Every( assembly, count );// initialize app properties, tell Hadoop which jar file to useProperties properties = new Properties();AppProps.setApplicationJarClass( properties, Main.class );// plan a new Flow from the assembly using the source and sink Taps// with the above propertiesFlowConnector flowConnector = new HadoopFlowConnector( properties );Flow flow = flowConnector.connect( "word-count", source, sink, assembly );// execute the flow, block until completeflow.complete();

Trident
Storm provides advanced abstraction primitives that extend the exactly-once semantics of transactional topology to meet transactional requirements. The primitive is too abstract and the construction process is full of repetitive field definitions.

TridentState urlToTweeters =       topology.newStaticState(getUrlToTweetersState());TridentState tweetersToFollowers =       topology.newStaticState(getTweeterToFollowersState());topology.newDRPCStream("reach")       .stateQuery(urlToTweeters, new Fields("args"), new MapGet(), new Fields("tweeters"))       .each(new Fields("tweeters"), new ExpandList(), new Fields("tweeter"))       .shuffle()       .stateQuery(tweetersToFollowers, new Fields("tweeter"), new MapGet(), new Fields("followers"))       .parallelismHint(200)       .each(new Fields("followers"), new ExpandList(), new Fields("follower"))       .groupBy(new Fields("follower"))       .aggregate(new One(), new Fields("one"))       .parallelismHint(20)       .aggregate(new Count(), new Fields("reach"));

RDD
Distributed elastic data sets on spark, with rich primitives. The flexibility of the RDD primitive is attributed to the FP nature and syntactic sugar of the scala language, and its richness comes from the richness of the scala language API. It is difficult for Java to implement such powerful expression capabilities. However, RDD is indeed very useful for reference.

scala> val textFile = sc.textFile("README.md")textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3scala> textFile.count() // Number of items in this RDDres0: Long = 126scala> textFile.first() // First item in this RDDres1: String = # Apache Sparkscala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))linesWithSpark: spark.RDD[String] = spark.FilteredRDD@7dd4af09scala> textFile.filter(line => line.contains("Spark")).count() // How many lines contain "Spark"?res3: Long = 15scala> textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)res4: Long = 15scala> val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)wordCounts: spark.RDD[(String, Int)] = spark.ShuffledAggregatedRDD@71f027b8scala> wordCounts.collect()res6: Array[(String, Int)] = Array((means,1), (under,2), (this,3), (Because,1), (Python,2), (agree,1), (cluster.,1), ...)

Schemardd
The "table" RDD in Spark SQL provides an additional DSL for SQL. However, this DSL is only applicable to SQL statements, and its presentation capability is insufficient. It is "vertical ".

val sqlContext = new org.apache.spark.sql.SQLContext(sc)// createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD.import sqlContext.createSchemaRDD// Define the schema using a case class.case class Person(name: String, age: Int)// Create an RDD of Person objects and register it as a table.val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt))people.registerAsTable("people")// SQL statements can be run by using the sql methods provided by sqlContext.val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")// DSL: where(), select(), as(), join(), limit(), groupBy(), orderBy() etc.val teenagers = people.where(‘age >= 10).where(‘age <= 19).select(‘name)teenagers.map(t => "Name: " + t(0)).collect().foreach(println)

Apache crunch
The open-source implementation of Google flumejava is a standard operator layer that now supports hadoop and spark tasks.
Crunch complies with flumejava settings, implements distributed and immutable data representation sets such as pcollection and ptable, and implements paralleldo (), groupbykey (), combinevalues (), and flattern () four basic primitives can be derived from this primitive: Count (), join (), top (). Deffered evalution and mscr (mapshufflecombinereduce) operation are also implemented.
The writing of crunch tasks is heavily dependent on hadoop. The essence of crunch is to write mapreduce pipeline on the batchcompute framework. There are not many primitives, and paralleldo () is not suitable for stream context. In addition, many of its features and functions are not required, but abstract data representation, interface model, and process control can be referenced.

public class WordCount extends Configured implements Tool, Serializable {  public int run(String[] args) throws Exception {    // Create an object to coordinate pipeline creation and execution.    Pipeline pipeline = new MRPipeline(WordCount.class, getConf());    // Reference a given text file as a collection of Strings.    PCollection<String> lines = pipeline.readTextFile(args[0]);    PCollection<String> words = lines.parallelDo(new DoFn<String, String>() {      public void process(String line, Emitter<String> emitter) {        for (String word : line.split("\\s+")) {          emitter.emit(word);        }      }    }, Writables.strings()); // Indicates the serialization format    PTable<String, Long> counts = words.count();    // Instruct the pipeline to write the resulting counts to a text file.    pipeline.writeTextFile(counts, args[1]);    // Execute the pipeline as a MapReduce.    PipelineResult result = pipeline.done();    return result.succeeded() ? 0 : 1;  }  public static void main(String[] args) throws Exception {    int result = ToolRunner.run(new Configuration(), new WordCount(), args);    System.exit(result);  }}

Summary

The final figure shows the implementation hierarchy comparison of various data pipeline projects on hadoop:

Full Text :)

Comparison of operator layers of common computing frameworks

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Comparison of operator layers of common computing frameworks

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Comparison of operator layers of common computing frameworks

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support