Main contents of the course: Construction of 1.spark experimental environment 2.4 Lab contents 3. Common functions 4. Variable sharing
1.Spark Lab Environment Setup (Windows)
A. Download, install Visualbox
run as Administrator; The course requires the latest version of 4.3.28, if you encounter a virtual machine in C cannot open , you can use 4.2.12, do not affect
B. Download, install vagrant, restart
Run as Administrator
C. Download the virtual machine
C1. adding Vagrant to Path,d:\hashicorp\vagrant\bin
C2. Creating a directory for virtual machine storage, such as Myvagrant
C3. Download file Mooc-setup-master.zip, unzip, copy Vagrantfile to myvagrant
C4. Open the Visual box graphical interface, enter CMD,CD to myvagrant, and tap command vagrant up
Start the download of the virtual machine and open it, if the download is complete, but there is an error opening the virtual machine;
You can click on the visual box graphical interface to open, hit the error, you can try to use the 4.2.12 version of the visual box
Instruction for use: I. Turn off the virtual machine: Open the Visual box interface, CD into Myvagrant
Vagrant up open virtual machine, vagrant Halt shut down virtual machine
Ii.ipython Notebook, enter http:\\localhost:8001
Stop the running notebook, click Running, stop
Click a. py file to run the note book
iii. Download the SSH software and log in to the virtual machine with the address 127.0.0.1, port 2222, username vagrant, password vagrant
After entering, knock Pyspark, can enter Pyspark interactive interface
3. Common functions
The life cycle of the RDD in spark
Create RDD (parallize, textfile, etc. )
Transform the RDD
(The new RDD will be created and will not change the original RDD, there
1. Operate on each element-map,flatmap,mapvalues
2. Filtering filter
3. Sort SortBy
3. Combined Results Reducebykey,groupbykey
4 combined two Rdd union,join,leftjoin,rightjoin)
the RDD in the above steps is only equivalent to an operating manual, and does not actually produce data in memory, called Lazy Evaluation
Cache the RDD into the in-memory cache () to determine if the cache, Access . is_cached Property
Trigger evaluation (including TOP,TAKE,TAKEORDERED,TAKESAMPLE,SUM,COUNT,DISTINCT,REDUCE,COLLECT,COLLECTASMAP)
4. Variable sharing
Spark has two different ways to share variables
A. Variables after broadcast broadcast,broadcast each partition will be stored in one copy, but can only be read and cannot be modified
>>>  b Span class= "o" style= "color: #666666;" >= sc broadcast ([ 1 2 3 4 5 ])
>>> SC . parallelize ([0,0]) . FlatMap (Lambdax:b. value )
B. Accumulator accumulator, can only write, cannot be read in worker
If the accumulator is just a scalar, it is easy to use
>>>Rdd = Sc.parallelize ([+])>>>defF(x):... GlobalA... A + = X>>>Rdd.foreach (f)>>>A.value -
If the accumulator is a vector, you need to define Accumulatorparam, and both the zero method and the Addinplace are implemented
>>> fromPyspark.accumulatorsImportAccumulatorparam>>>classVectoraccumulatorparam(Accumulatorparam):... defZero(self, value):... return [0.0] * LEN (value)... defAddinplace(Self, Val1, val2):... forIinchXrange (Len (val1)):... Val1[i] + = Val2[i]... Return Val1>>>VA = sc.accumulator ([1.0, 2.0, 3.0], Vectoraccumulatorparam ())>>>Va.value[1.0, 2.0, 3.0]>>>defg(x):... GlobalVa... VA + = [x] * 3>>>Rdd.foreach (g)>>>Va.value[7.0, 8.0, 9.0]
From for notes (Wiz)
Introduction to Big Data with Apache Spark Course Summary