Interpreting spark from a. NET parallel perspective

Last Update:2017-03-10 Source: Internet

Author: User

Tags shuffle

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

For a developer like me who has been working on the. NET platform, these big data nouns such as hadoop,spark,hbase are unfamiliar, and for distributed computing,. NET has similar parallel (I'm not talking about Hdinsight), This article is what I tried to tell about spark from the perspective of the parallel class library on. Net.

Let's start with an example of a rotten street in C # (not HelloWorld) and count the frequency of a word appearing in a story.

The following C # code is the frequency of statistical words written using. NET Parallel.

1 usingSystem;2 usingSystem.Collections.Generic;3 usingSystem.Linq;4 usingSystem.Text;5 usingSystem.Threading.Tasks;6 7 namespaceWordcountdemo8 {9     usingSystem.IO;Ten     usingSystem.Threading; One     class Program A     { -         /// <summary> -         ///Let's count the number of words in an article as an example the         ///(The demo that calculates the number of words in an article is a helloworld of various big data calculations).  -         ///  -         ///Wordcountflow is a number of Word programs -         ///Wordcountdetail the Wordcountflow function each row is disassembled and explained in detail.  +         /// </summary> -         /// <param name= "args" ></param> +         Static voidMain (string[] args) A         { at             stringFilePath =@"D:\BigDataSoftware\spark-2.1.0-bin-hadoop2.7\README.md"; -  - Wordcountflow (filePath); -Console.WriteLine ("----------------------"); - Wordcountdetail (FilePath);  -         } in  -         /// <summary> to         ///the procedure flow of counting words +         /// </summary> -         /// <param name= "FilePath" ></param> the         Static voidWordcountflow (stringFilePath) *         { $ File.ReadAllLines (FilePath). AsParallel ()Panax Notoginseng. SelectMany (t = t.split (' ')) -. Select (t =New{word = t, tag =1 }) the. GroupBy (t = t.word). Select (t =New{word = T.key, Count = t.select (p = = P.tag). Aggregate ((A, B) + A +b)}) +                 //If you are unfamiliar with the aggregate function, the above code is equivalent to the downstream A                 //. GroupBy (t = t.word). Select (t + = new {Word = T.key, Count = t.sum (p = = P.tag)}); the. ForAll (t = Console.WriteLine ($"Parationid:{thread.currentthread.managedthreadid} ({t.word}-{t.count})")); +         } -  $         /// <summary> $         ///A detailed explanation of the program flow of several words -         /// </summary> -         /// <param name= "FilePath" ></param> the         Static voidWordcountdetail (stringFilePath) -         {Wuyi             //reads the entire article, each row is stored as a string into the array lines the             string[] lines =File.ReadAllLines (filePath); -             //AsParallel () is the core method of the parallel class library, meaning string[] lines this array is divided into several partitions (Partition).  Wu             //Assuming this article has 500 lines, then this method will break the string[500]-lines into (string[120] partitiona), -             //( string[180] partitionb), (string[150] partitionc), (...) and several partition About             //The . NET runtime uses a partitioning algorithm based on the current program's load (mainly CPU usage) to determine exactly how many partition to divide. $             //we can probably think that the CPU has a few logical cores (inaccurate) and will be decomposed into several partition.  -             //in subsequent calculations, the. NET runtime will request a separate thread for each partition to be processed. -             //For example: Partitiona by the Line 001 line process, partitionb by the Line 002 line process ...  -parallelquery<string> parallellines =lines. AsParallel (); A             //Linesa,linesb,linesc ... Each row stored in the array is divided into words according to the space, and the result is still stored in the structure of the parallelquery<string> block. +             //Note below with * * * *, if there is no understanding of functional programming, can be ignored directly.  the             //* * * * If you have some knowledge of functional programming, you will know that lambda is inherently lazy, if the following line of code to make a breakpoint, when the debug to this line of code, -             //* * * * When the mouse moves to parallelwords, we don't see every word, $             //****runtime does not really break down each line into words, this line of code is just a computational logic.  theparallelquery<string> parallelwords = parallellines.selectmany (t = t.split (' ')); the             //With each list tagged with 1, this line of code returns the type Parallelquery<var>,var for runtime to automatically judge, the type of Var here should actually be the             //class Anonymous Type the             // {  -             //Public Word {get;set;} in             //Public Tag {get;set} the             //}     the             varWordparis = parallelwords.select (t =New{word = t, tag =1 }); About             //grouped by words, the number of words in the same group is summed, similar to the following SQL Select Word,count (tag) from Wordparis GROUP by word the             //Note that the same word may be distributed in different sections, such as the common "the" in English, there may be 3 "the" in the Partitiona, 2 "the" in Partitionb , the             //but Partitiona and partitionb are handled differently by different threads, and if the runtime is smart enough, he should first calculate the number of Partitiona (the,3), the             //It then calculates the number of the PARTITIONB (the,2) and finally merges and re-splits the entire partition (shuffle), doing the subsequent calculations +             //shuffle after the partition partition and the previous partition inside the data will be different.  -             //the type of Wordcountparis here is the             //class Anonymous TypeBayi             // {  the             //Public Word {get;set;} the             //Public count {get;set} -             //} -             varWordcountparis = Wordparis. GroupBy (t = t.word). Select (t =New{word = T.key, Count = t.select (p = = P.tag). Aggregate ((A, B) + A +b)}); the             //Print the results. Due to the chaos of thread execution, you can see that the output of PartitionID is also out of order.  theWordcountparis.forall (t = Console.WriteLine ($"Parationid:{thread.currentthread.managedthreadid} ({t.word}-{t.count})")); the         } the     } -}

Program Run Results

Through the above example of C #, we see how parallel an article into a number of partition and parallel computing on different partition, in the calculation process, may need to "shuffle", the original partition to re-shuffle.

We assume that if this program runs on a cluster, these partition are distributed across different machines, so that you can take advantage of the power of multiple machines rather than the power of multiple threads of a machine to do the calculations, yeah!, you guessed it, that's spark, The following Scala's Wordcountflow function is a function of counting the frequency of words on spark, like the wordcountflow of C #, and the five lines of code, and the logic of the five lines is exactly the same. It's just that spark distributes the data across different machines and allows the machine to calculate, and of course, if you want to shuffle in some cases, the data on different machines will be aggregated and re-partitioned into new partitions. Although the partition in partition and net parallel in spark do not correspond fully (there may be multiple paratition on one machine in spark), shuffle is also a special term for spark, but the basic principle is similar.

Package Wordcountexampleimport Org.apache.spark. {sparkconf, Sparkcontext, taskcontext}/**  * Created by stevenchennet on 2017/3/10.  */object WordCount {  def main (args:array[string]): Unit = {    //file path    val filepath= "d:\\bigdatasoftware\\ Spark-2.1.0-bin-hadoop2.7\\readme.md "    wordcountflow (FilePath)  }  def wordcountflow (filepath:string): unit={    //Sparkcontext objects use a Sparkconf object to construct    //sparkconf The main settings, such as local "*" means to open more threads parallel processing    // Sparkcontext is the core object of the Spark Execution task    //The following five lines code corresponds to the C # Wordcountflow five lines code one by one corresponding to the    new Sparkcontext (New sparkconf (). Setappname ("WordCount"). Setmaster ("local[*]"). Textfile (FilePath). FlatMap (      _.split (")")      . Map ((_,1))      . Reducebykey (_+_)      . foreach (T=>println (S "Partition: ${Taskcontext.getpartitionid ()}  (${t._1}} -${t._2}})  }}

Program Run Results

In net parallel, if a thread crashes during the calculation, it can cause the whole program to crash, and if it's a cluster operation, it's not a good decision to have the entire cluster crash because of one outage, and spark can persist the content to be evaluated before computing. If a machine is crash, it can be re-calculated by pulling the machine's computational tasks to another machine.

Interpreting spark from a. NET parallel perspective

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More