I heard Liaoliang's seventh lesson tonight. Spark operating principle and rdd decryption, after-school assignment is: Spark Fundamentals, my summary is as follows:
1 Spark is a distributed memory-based computing framework that is particularly suitable for iterative computing
2 MapReduce is two-stage map and reduce, and spark is constantly iterative, more flexible, more powerful, and easier to construct complex algorithms.
3 Spark does not replace hive,hive for Data Warehouse storage, spark SQL is just the COMPUTE engine that replaces hive
4 Spark intermediate data can be in memory or on disk
5 partition is a collection of data
6 Note: The beginner performs several steps to pay attention to step test, otherwise I do not know where the wrong
7 var data = Sc.textfile ("/user") does not have to write hdfs://, judging by context
8 Read the file get Hadooprdd, remove the file index, get Mappartitionsrdd, so a series of shards of data distributed in different machines.
9 Mobile computing instead of moving data
In addition, Liaoliang teacher to say a message:
Write Spark in Java: More talent, easier integration with Java EE, easier maintenance, so all the examples in the later classes are both Scala and Java
Follow-up courses can be referred to Sina Weibo Liaoliang _dt Big Data Dream Factory: Http://weibo.com/ilovepains
Liaoliang China Spark First person, public number Dt_spark
Spark3000 disciple Seventh Lesson spark operation principle and Rdd decryption summary