Introduction to Monoids and semigroups with Spark

Source: Internet
Author: User

What is monoid on earth??Defined:

Monoid (Semi-group: Refer to note 1 translation, continue to use English name below) is a two-dollar operation (+) and a unit element (the original is the identity element) I makes for any x,x+i=i+x=x. Note that it is not like a group, which is mathematically translated as a group, and it has no inverse elements. It can also be said that a semigroup with a unit element (semigroup)

Wow, it's no use. Let's look at some examples and look at a simple definition again.

https://blog.safaribooksonline.com/2013/05/15/monoids-for-programmers-a-scala-example/

1. Integers and additions
Binding law = (a+b) +c = = + (b+c) and unit =>0+n==n+0==n
1. Integers and multiplication
Binding law = (a*b)c==a(b*c) and unit element =>1*n==n*1==n
1. Lists and associations
Combined Law =>list + (list (3,4) +list (5,6)) = = (List +list (3,4)) +list (5,6) ==list (1,2,3,4,5,6) and unit =>list (1) + List () ==list () +list (1) ==list (1)

This looks like any two-dollar operation is monoid. Can we give some counter-examples?

For example the average

AVG (ON)
AVG (10,avg (20,30))!=avg (avg (10,20), 30)

Subtraction!

Originally a semigroup is a monoid in addition to it does not need a unit element, so more inclusive.
The key is that you imagine the two-dollar operational binding requirement, which means you can ignore the order of calculations! It also means that it is easy to calculate concurrently.
A monoid is essentially a protocol that is followed by a particular type. Does this give us any clues to implement Monoids/semigroups in Scala?

Typeclass Implementation Monoids
It is the developer to enforce the associativity rule!!! Trait semigroup[t]{def op (a:t, b:t): t}trait Monoid[t] extends semigroup[t]{def zero:t}123456789
Now that we're familiar with how to implement monoid in Scala, can we achieve intadditionmonoid?
Object monoids{Implicit Object Intadditionmonoid extends monoid[int]{def op (A:int, b:int): Int = a + b def zeru : Int = 0}}123456

Good. Where can we use it now? See how to use in methods such as reduce ...

Trait semigroup[t]{def op (a:t, b:t): t}trait Monoid[t] extends semigroup[t]{def zero:t}implicit object intadditionm Onoid extends monoid[int]{def op (A:int, b:int): Int = a + b def zero:int = 0}val ListA = List (1,3,5,6) def Reducewith Monoid[a] (Seq:seq[a]) (implicit ev:monoid[a]): A = {seq.reduce (Ev.op (_,_))}println (Reducewithmonoid (ListA)) 12345678910111213141516171819

Define more and see how they behave

Trait semigroup[t]{  def op (a: t, b: t): t}trait monoid[t]  extends semigroup[t]{  def zero: t}implicit object intadditionmonoid  Extends monoid[int]{  def op (a: int, b: int):  Int = a +  b  def zero: int = 0}//we now have must use a  class as type parameters are required due to the fact  That tuples themselves have classes.//here our goal is to define  functionality for tuples that contain monoid abiding typesclass  TUPLE2SEMIGROUP[A,B] (implicit sg1: semigroup[a], sg2: semigroup[b)  extends  Semigroup[(A, b)]{  def op (a:  (A, b), b:  (A, B)):  (A, b)  =  (Sg1.op (A) _1, b._1), &nbsP;sg2.op (a._2, b._2))}//as we cannot make above an implicit class  because that actually does something different  (more on this  with an aside about pimp my library pattern soon)//Well we  can use another feature of implicits which are implicit  Conversions. this function provides logic on how to change a  Tuple that contains Semigroups and return a SemiGroup of  THE TUPLE ITSELFIMPLICIT DEF TUPLE2SEMIGROUP[A,B] (IMPLICIT SG1: SEMIGROUP[A],SG2 :  semigroup[b]):  semigroup[(A, b)] = {  new tuple2semigroup[a,b] () (SG1,SG2)} Val lista = list (+), (3,4), (5,2), (6,9)) Def reducewithmonoid[a] (Seq: seq[a]) ( Implicit ev: semigroup[a]): a = {  seq.reduce (Ev.op (_,_))}println (Reducewithmonoid (ListA)) 123456789101112131415161718192021222324252627282930

···

See how to include aggregation logic in the monoid definition. In fact, we can redefine the behavior of the collection object, which means high reuse and high-scale code. Look at one more example, and then we turn to spark.

Looking at one more example, semigroups can be easily applied in merging 2 maps Association keys and summing the values.

Trait semigroup[t]{  def op (a: t, b: t): t}trait monoid[t]  extends semigroup[t]{  def zero: t}implicit object intadditionmonoid  Extends monoid[int]{  def op (a: int, b: int):  Int = a +  b  def zero: Int = 0}//Here we only need to  Assume that the values can form a semigroup as the keys  ARE JUST BEING COMBINED.CLASS MAPSEMIGROUP[K,V] (implicit sg1: semigroup[ V])  extends SemiGroup[Map[K,V]]{  //We are aggregating where the  Initial map is one of the maps and we loop through key  values of other one and combine.  //this way any keys  that don ' T&NBSp;appear in the looping map are there already,all keys that  appear in both are overwritten  def op (IteratingMap: Map[K,V],  STARTINGMAP: MAP[K,V]):  map[k,v] = iteratingmap.aggregate (STARTINGMAP) ({      (currentmap: map[k,v], kv:  (k,v))  => {       val newvalue: v = startingmap.get (kv._1). Map (V => sg1.op (v, kv._2) ). Getorelse (kv._2)       currentMap +  (Kv._1 -> newvalue)       }    },    //this is the  combine part  (if done in parallel, could have two different  maps that need to be combined)  this assumes that all  Keys are already combined....    {       (MapOne: Map[K,V],  MAPTWO: MAP[K,V])  => mapOne ++ mapTwo    }  )} as we cannot make above an implicit class because that  actually does something diferent  (more on this with an aside  about pimp my library pattern soon)//well we can use  Another feature of implicits which are implicit conversions. this  function provides logic on how to change a Tuple that  contains semigroups and return a semigroup of the tuple  ITSELFIMPLICIT DEF MAPSEMIGROUP[K,V] (Implicit sg1: semigroup[v]):  SemiGroup[Map[K,V]  = {  new mapsemIGROUP[K,V] (SG1)}val mapa = map ("A"  -> 1,  "B"  -> 2,  "D"  -> 5) Val mapb = map ("A"  -> 3,  "C"  -> 3,  "D"  -> 1) Val mapc = map ("B"  -> 10,  "D"  -> 3) def  Reducewithmonoid[a] (Seq: seq[a]) (Implicit ev: semigroup[a]): a = {   Seq.reduce (Ev.op (_,_))}println (Reducewithmonoid (List (MAPA, MAPB, MAPC))) 12345678910111213141516171819202122232425262728293031323334353637383940414243

Pimp My Library Example and why don't we implicit classes

An implicit class takes a constructor which are the class to be pimpd. You can then define methods etc. which would be ' available ' on this type as though it was native functionality!!!! Implicit class pimpedstring (s:string) {def pimpmystring (): String = S + "is pimped"}println ("My String". pimpmystring ()) 123456

Spark uses Pimp My Library mode to add specific methods that are available only on specific types of rdds. such as: Key Value Pair RDDs
Https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala

/*** Extra functions available on RDDs of (key, value) pairs through an implicit conversion.*/implicit class Pairrddfuncti Ons[k, V] (self:rdd[(K, V)]) (implicit kt:classtag[k], vt:classtag[v], ord:ordering[k] = null) extends Loggingwith Sparkh Adoopmapreduceutilwith serializable{}12345678910

It looks like you're looking to design your own Monoid/semigroup library. Don't worry, Twitter has done it and made it available for spark!. (This means that everything is serializable.) They also write to make it effective rate (this sentence does not understand, original: They has also written it in a way such that it performs very)
Https://github.com/twitter/algebird

Together, monoids and Spark

We need to write a lot of functions when we gather rdds in spark, but unfortunately, these functions are roughly the same, but are difficult to write in a general way. Using Monoids is a way to achieve the goal, which is a practical example:

 //this is a call from an aggregation section that updates  state with the HyperLogLog object val stateUniques =  Makemodeluniquestime.updatestatebykey (Updatetotalcountstate[hll])  //This is a call  From an aggregation section that updates state with the long  val statepv = makemodelcountreducewithtime.updatestatebykey (UpdateTotalCountState[Long])/ /this was originally implemented as tow methods, one for hll  and one for long. with monoids we can write a singel  method that takes care of both cases.def updatetotalcountstate[u] ( values: seq[(bananatimestamp, u)], state: option[(bananatimestamp, u)] (implicit  monoid: monoid[u], ct: cLasstag[u]):  option[(bananatimestamp, u)] = {  val defaultstate =   (Null, monoid.zero)   values match {    case nil = > some (State.getorelse (defaultstate))     case _ =>       val hdt = values (0) ._1      // the  Reduction logic is now contained in the monoid definitions as  opposed to thest functions. we can instead distil this to  what is takes to update state      val v  = values.map{case  (_, a)  => a}.reduce (monoid.plus)        val statereceived = state.getorelse (defaultstate)        if (Checkresetstate (statereceIVED._1, HDT)  some ((hdt, v))  else some (Hdt, monoid.plus (v, stateReceived. _2))     }  }123456789101112131415161718

Original link: https://thewanderingmonad.wordpress.com/2015/05/17/introduction-to-monoids-and-semigroups-with-spark/

Reference
1, monoid Http://hongjiang.info/semigroup-and-monoid/
2, Https://zh.wikipedia.org/wiki/%E5%B9%BA%E5%8D%8A%E7%BE%A4
3, http://www.ituring.com.cn/article/195776


Introduction to Monoids and semigroups with Spark

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.