Spark is typically memory-based, and in some cases disk-based
Spark first puts the data in memory, and if the memory doesn't fit, it's put into the disk.
Not only can calculate the data under the storage, but also can calculate the data that the memory can't fit
If the data is larger than the memory, consider the data placement strategy and the optimization algorithm, because Spark was originally designed to be a single-walled processing
From the scale of 5~10 to 8000 units, spark can run
Big Data Computing Issues: Interactive queries (shell-based, sparksql), batch processing, machine learning and computing, and more
Bottom-based RDD, distributed elastic data-level support for a wide variety of paradigms such as stream processing, SQL, SPARKR, etc.
==========spark features ============
To understand Spark, understand the following
1, distributed multi-machine operation
The different nodes will deal with some of the data, each node processing non-interference, distributed to do parallelization
Cluster Manager is responsible for allocating resources to each node, after each node is counted, and then aggregated to Cluster manager and then unified output
650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>
2, in-memory + disk Computing
For example: 3 million data, allocated to 3 machines, such as each machine 1 million (also may not be average), a machine 1 million memory put down memory, not fit memory
3. Iterative computing is the true essence of spark
Divide the calculation into n processes, one process ends and the next
Shuffle is a node to another node
========== Development ============
We wrote the program through driver to each machine.
650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>
Why do most of the bulk write programs in Java
Because there are more people in Java, there are fewer people in Scala.
Java and EE fusion is more convenient
Follow-up maintenance is more convenient
Cons: Java development spark is too cumbersome.
The following example uses both Scala and Java to implement
Development is a separate machine, the submission machine is on another machine
Data sources that can be processed: Sparkworker can be derived from a variety of data, in addition to HDFs, HBase, and from Hive,oracle,mysql
Note: Hive Data Warehouse, data engine, Sparksql can implement this, but not completely replace hive
Processing data output: HDFS, HBase, Hive, Oracle, S3, or directly back to the client, etc.
650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>
========== running ============
All based on RDD (elastic distributed data Set),
One of the elasticity:
Data sharding, by default in memory, if the memory is not placed, part will be placed on disk to save, the user does not need to care about where the data is, RDD will automatically switch between memory and disk
Elasticity of the second:
Based on lineage's efficient fault tolerance, suppose that a job step has 1000 steps, assuming that there is an error at 901 steps, it automatically corrects the error from 900 steps to recalculate.
Elasticity of the third:
If a task fails, it will automatically retry for a specific number of times, assuming a task has 1000 steps, assuming a 901 step error, if you retry from 900 steps, there will be a certain number of retries, or the failure will fail.
Flex Four:
If the stage fails, it will automatically retry for a specific number of times, and only the failed shards will be computed
Note: Stage is actually a phase
650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>
========== do cache timing ============
1, particularly time-consuming
2, the calculation chain is very long
3, after the shuffle, if this failed after the cache, there is no need to do shuffle
4, before the checkpoint, the previous steps have been cached, if the checkpoint broken, the previous saved
==========rdd Example ============
In addition to Hadoop, Spark's Start-all
Hadoop should start./start-df.sh
Spark should start./start-
http://master:18080 to see information about jobs that were run
Under Spark's bin./spark-shell--master spark://master:7077
Val data = Sc.textfile ("/library/wordcount/input/data") or Sc.textfile ("hdfs//master:9000/library/wordcount/input/ Data ")
650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>
Spark will create the RDD itself
650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>
Data.rodebugstring Look at data dependencies
It can be seen that the mappartitionnsrdd is fragmented and distributed across different machines.
Data.count look at the data
http://Master:4040 See job
650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>
The data is not moving, the code is moving, the data is distributed on each machine.
A block size is typically 128M, and the actual partition and block size may vary
Val flatted = Data.flatmap (_.spilit (""))
And a new Mappartitionnsrdd is created.
650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>
650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>
Val mapped = Flatted.map (word=> (word,1))//1 per word count
650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>
650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>
Val reduced = Mapped.reducebykey (_+_)
Key adds the same sum, producing shuffle
650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>
650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>
Reduced. Saveastextfile ("/library/wordcount/input/data/output/onclick4")
650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>
Other than that:
650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>
Data2222 does not exist, start loading is lazy, so will not error, only Data.count will error
Next: Eclipse develops Java and Scala perspectives to develop test and run programs
Homework: Write a blog and write a basic understanding of spark that you understand
Liaoliang Teacher's card:
China Spark first person
Sina Weibo: Http://weibo.com/ilovepains
Public Number: Dt_spark
Blog: http://blog.sina.com.cn/ilovepains
Mobile: 18610086859
qq:1740415547
Email: [Email protected]
This article from "a Flower proud Cold" blog, declined reprint!
Spark operating principle and RDD parsing (dt Big Data DreamWorks)