Introduction to Big Data with Apache Spark Course Summary

Source: Internet
Author: User
Tags pyspark

Main contents of the course: Construction of 1.spark experimental environment 2.4 Lab contents 3. Common functions 4. Variable sharing
1.Spark Lab Environment Setup (Windows)

A. Download, install Visualbox

run as Administrator; The course requires the latest version of 4.3.28, if you encounter a virtual machine in C cannot open , you can use 4.2.12, do not affect

B. Download, install vagrant, restart

Run as Administrator

C. Download the virtual machine

C1. adding Vagrant to Path,d:\hashicorp\vagrant\bin

C2. Creating a directory for virtual machine storage, such as Myvagrant

C3. Download file Mooc-setup-master.zip, unzip, copy Vagrantfile to myvagrant

C4. Open the Visual box graphical interface, enter CMD,CD to myvagrant, and tap command vagrant up

Start the download of the virtual machine and open it, if the download is complete, but there is an error opening the virtual machine;

You can click on the visual box graphical interface to open, hit the error, you can try to use the 4.2.12 version of the visual box

Instruction for use: I. Turn off the virtual machine: Open the Visual box interface, CD into Myvagrant

Vagrant up open virtual machine, vagrant Halt shut down virtual machine

Ii.ipython Notebook, enter http:\\localhost:8001

Stop the running notebook, click Running, stop

Click a. py file to run the note book

iii. Download the SSH software and log in to the virtual machine with the address 127.0.0.1, port 2222, username vagrant, password vagrant

After entering, knock Pyspark, can enter Pyspark interactive interface

3. Common functions


The life cycle of the RDD in spark

Create RDD (parallize, textfile, etc. )

Transform the RDD

(The new RDD will be created and will not change the original RDD, there

1. Operate on each element-map,flatmap,mapvalues

2. Filtering filter

3. Sort SortBy

3. Combined Results Reducebykey,groupbykey

4 combined two Rdd union,join,leftjoin,rightjoin)

the RDD in the above steps is only equivalent to an operating manual, and does not actually produce data in memory, called Lazy Evaluation


Cache the RDD into the in-memory cache () to determine if the cache, Access . is_cached Property

Trigger evaluation (including TOP,TAKE,TAKEORDERED,TAKESAMPLE,SUM,COUNT,DISTINCT,REDUCE,COLLECT,COLLECTASMAP)


4. Variable sharing

Spark has two different ways to share variables

A. Variables after broadcast broadcast,broadcast each partition will be stored in one copy, but can only be read and cannot be modified

>>>  b Span class= "o" style= "color: #666666;" >= sc broadcast ([ 1 2 3 4 5 ])

>>> SC . parallelize ([0,0]) . FlatMap (Lambdax:b. value )

B. Accumulator accumulator, can only write, cannot be read in worker

If the accumulator is just a scalar, it is easy to use

>>>Rdd = Sc.parallelize ([+])>>>defF(x):... GlobalA... A + = X>>>Rdd.foreach (f)>>>A.value -

If the accumulator is a vector, you need to define Accumulatorparam, and both the zero method and the Addinplace are implemented

>>> fromPyspark.accumulatorsImportAccumulatorparam>>>classVectoraccumulatorparam(Accumulatorparam):...  defZero(self, value):... return [0.0] * LEN (value)...  defAddinplace(Self, Val1, val2):...     forIinchXrange (Len (val1)):... Val1[i] + = Val2[i]... Return Val1>>>VA = sc.accumulator ([1.0, 2.0, 3.0], Vectoraccumulatorparam ())>>>Va.value[1.0, 2.0, 3.0]>>>defg(x):... GlobalVa... VA + = [x] * 3>>>Rdd.foreach (g)>>>Va.value[7.0, 8.0, 9.0]




From for notes (Wiz)

Introduction to Big Data with Apache Spark Course Summary

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.