Introduction to Big Data with Apache Spark Course Summary

Last Update:2015-07-13 Source: Internet

Author: User

Tags pyspark

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Main contents of the course: Construction of 1.spark experimental environment 2.4 Lab contents 3. Common functions 4. Variable sharing
1.Spark Lab Environment Setup (Windows)

A. Download, install Visualbox

run as Administrator; The course requires the latest version of 4.3.28, if you encounter a virtual machine in C cannot open , you can use 4.2.12, do not affect

B. Download, install vagrant, restart

Run as Administrator

C. Download the virtual machine

C1. adding Vagrant to Path,d:\hashicorp\vagrant\bin

C2. Creating a directory for virtual machine storage, such as Myvagrant

C3. Download file Mooc-setup-master.zip, unzip, copy Vagrantfile to myvagrant

C4. Open the Visual box graphical interface, enter CMD,CD to myvagrant, and tap command vagrant up

Start the download of the virtual machine and open it, if the download is complete, but there is an error opening the virtual machine;

You can click on the visual box graphical interface to open, hit the error, you can try to use the 4.2.12 version of the visual box

Instruction for use: I. Turn off the virtual machine: Open the Visual box interface, CD into Myvagrant

Vagrant up open virtual machine, vagrant Halt shut down virtual machine

Ii.ipython Notebook, enter http:\\localhost:8001

Stop the running notebook, click Running, stop

Click a. py file to run the note book

iii. Download the SSH software and log in to the virtual machine with the address 127.0.0.1, port 2222, username vagrant, password vagrant

After entering, knock Pyspark, can enter Pyspark interactive interface

3. Common functions

The life cycle of the RDD in spark

Create RDD (parallize, textfile, etc. )

Transform the RDD

(The new RDD will be created and will not change the original RDD, there

1. Operate on each element-map,flatmap,mapvalues

2. Filtering filter

3. Sort SortBy

3. Combined Results Reducebykey,groupbykey

4 combined two Rdd union,join,leftjoin,rightjoin)

the RDD in the above steps is only equivalent to an operating manual, and does not actually produce data in memory, called Lazy Evaluation

Cache the RDD into the in-memory cache () to determine if the cache, Access . is_cached Property

Trigger evaluation (including TOP,TAKE,TAKEORDERED,TAKESAMPLE,SUM,COUNT,DISTINCT,REDUCE,COLLECT,COLLECTASMAP)

4. Variable sharing

Spark has two different ways to share variables

A. Variables after broadcast broadcast,broadcast each partition will be stored in one copy, but can only be read and cannot be modified

>>>&NBSP; b Span class= "o" style= "color: #666666;" >= sc broadcast ([ 1 2 3 4 5 ])

>>> SC . parallelize ([0,0]) . FlatMap (Lambdax:b. value )

B. Accumulator accumulator, can only write, cannot be read in worker

If the accumulator is just a scalar, it is easy to use

>>>Rdd = Sc.parallelize ([+])>>>defF(x):... GlobalA... A + = X>>>Rdd.foreach (f)>>>A.value -

If the accumulator is a vector, you need to define Accumulatorparam, and both the zero method and the Addinplace are implemented

>>> fromPyspark.accumulatorsImportAccumulatorparam>>>classVectoraccumulatorparam(Accumulatorparam):...  defZero(self, value):... return [0.0] * LEN (value)...  defAddinplace(Self, Val1, val2):...     forIinchXrange (Len (val1)):... Val1[i] + = Val2[i]... Return Val1>>>VA = sc.accumulator ([1.0, 2.0, 3.0], Vectoraccumulatorparam ())>>>Va.value[1.0, 2.0, 3.0]>>>defg(x):... GlobalVa... VA + = [x] * 3>>>Rdd.foreach (g)>>>Va.value[7.0, 8.0, 9.0]

From for notes (Wiz)

Introduction to Big Data with Apache Spark Course Summary

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More