Spark Getting Started knowledge

Source: Internet
Author: User
Tags shuffle
1, under the Java Spark Development environment Construction

1.1. JDK Installation

Install the JDK under Oracle, I installed JDK 1.7, install the new system environment variable java_home, the variable value is "C:\ProgramFiles\Java\jdk1.7.0_79", depending on the installation of the road.

Add C:\Program Files\java\jdk1.7.0_79\bin and C:\ProgramFiles\Java\jre7\bin at the same time under the system variable path.

1.2 Spark environment variable configuration

To download the corresponding Hadoop version of the http://spark.apache.org/downloads.html website, I downloaded the Spark-1.6.0-bin-hadoop2.6.tgz,spark version is 1.6, the corresponding Hadoop version is 2.6

Unzip the downloaded file, assuming that the extracted directory is: D:\spark-1.6.0-bin-hadoop2.6. Add D:\spark-1.6.0-bin-hadoop2.6\bin to the system path variable and create a new spark_home variable with the value: D:\spark-1.6.0-bin-hadoop2.6

1.3 Hadoop Toolkit Installation

Spark is based on Hadoop, and it calls the relevant Hadoop libraries during the run, and if the relevant Hadoop runtime is not configured, it will prompt related error messages, although it does not affect the operation, but it also configures the Hadoop related libraries.

1.3.1 To download Hadoop 2.6, I downloaded the hadoop-2.6.0.tar.gz,

1.3.2 Unzip the downloaded folder, add the related library to the system PATH variable: D:\hadoop-2.6.0\bin; Create a new hadoop_home variable with the value: D:\ hadoop-2.6.0, download winutils version of Windows, add Winutils.exe to your hadoop-x.x.x/bin and put Hadoop.dll under C:/wondwos/system32.

1.4 Eclipse Environment

To create a new Java project directly, add the Spark-assembly-1.6.0-hadoop2.6.0.jar under D:\spark-1.6.0-bin-hadoop2.6\lib to the project.

Note the point:

A. Native environment variable configuration, need to restart machine after configuration: Hadoop_home (need to include Winutil.exe), Spark_home

/* Simpleapp.java */

Import org.apache.spark.api.java.*;

Import org.apache.spark.SparkConf;

Import org.apache.spark.api.java.function.Function;

public class Simpleapp {

publicstatic void Main (string[] args) {

Stringlogfile = "File:///spark-bin-0.9.1/README.md";

sparkconf conf =new sparkconf (). Setappname ("Spark application Injava");

Javasparkcontext sc = new Javasparkcontext (conf);

javardd<string> logdata = Sc.textfile (logFile). cache ();

Longnumas = Logdata.filter (New function<string, boolean> () {

Public Boolean Call (String s) {return s.contains ("a");}

}). Count ();

Longnumbs = Logdata.filter (New function<string, boolean> () {

Public Boolean Call (String s) {return s.contains ("B");}

}). Count ();

System.out.println ("Lines with a:" + Numas + ", Lines with B:" + numbs);

}

}

2, run Spark-demo on TDH ,

Touch a test.txt and put it under the TMP of HDFS and execute the following command

Spark-submit \

--master yarn-cluster \

--num-executors 2 \

--driver-memory 1g \

--executor-memory 1g \

--executor-cores 1 \

--class org.apache.spark.examples.JavaWordCount \

"/usr/lib/discover/lib/spark-examples-1.5.1-hadoop2.5.2-transwarp.jar" \

"/tmp/test.txt"

the difference and connection between 3,client and cluster,

When testing with the specified Sparkmaster as yarn, could not parse Maserurl: ' Yarn '

Here running on the Discover, not like the master URL, unable to access Port 7077, if the discover run in the code to specify that master is Yarn-cluster, will be reported Sparkexception,

Of course, if it is open source, in the cluster test code, the value of the master URL, the representative on the specified node to run, if you specify local, the representative in the native run, that is, standalone mode, yarn mode default boot 2 executor, No matter how many worker nodes you have

Standalone mode one executor per worker, cannot modify the number of executor

Partition is a dataset in the Rdd, which is typically 2 by default

The number of tasks in executor is determined by the number of partition (the number of partition in the last stage)

 broadly speaking , Yarn-cluster is suitable for production environments, while yarn-client is suitable for interaction and debugging, that is, you want to see the output of application quickly.

Before we introduce the deep differences between Yarn-cluster and Yarn-client, we first understand a concept: Applicationmaster. In yarn, each application instance has a application master process, which is the first container that application starts. It is responsible for dealing with ResourceManager and requesting resources. After getting the resource, tell NodeManager to start container for it.

 In the deep sense, Spark application runs in the yarn cluster environment

spark:driver+ Executors (executors is jvmprocess)

yarn:am+ Containers (Containers is jvmprocess)

The difference between the Yarn-cluster and Yarn-client modes is actually the difference between the application Master process, Yarn-cluster mode, driver running in AM (Application master), It is responsible for applying resources to yarn and overseeing the health of the job. After the user submits the job, the client can be shut down and the job will continue to run on yarn. However, the Yarn-cluster mode is not suitable for jobs that run interaction types. In yarn-client mode, application master simply requests executor,client to yarn and requests container traffic to dispatch their work, which means the client cannot leave. Look at the following two pictures should understand (above is the Yarn-cluster mode, the following figure is the yarn-client mode):

4,spark-submit parameters,

Parameter name

meaning

--master Master_url

Can be spark://host:port, mesos://host:port, yarn, yarn-cluster,yarn-client, Local

--deploy-mode Deploy_mode

Driver where the program runs, client or cluster

--class class_name

Main class name, with package name

--name name

Application Name

--jars Jars

Driver-dependent third-party jar packages

--py-files py_files

A comma-separated list of. zip,. Egg,. py files placed on the Python application Pythonpath

--files Files

Comma-separated list of files to be placed in each executor working directory

--properties-file File

Set the file path of the application properties, default is Conf/spark-defaults.conf

--driver-memory MEM

Driver program uses memory size

--driver-java-options

--driver-library-path

Driver the library path of the program

--driver-class-path

Driver class path of the program

--executor-memory MEM

Executor memory size, default 1G

--driver-cores NUM

The number of CPUs used in the driver program is limited to spark alone mode

--supervise

Whether to restart driver after failure, limited to spark alone mode

--total-executor-cores NUM

Executor total number of cores used, limited to spark Alone, spark on Mesos mode

--executor-cores NUM

Number of cores used per executor, default 1, spark on yarn mode only

--queue queue_name

The queue to which yarn the application is submitted to, by default, it is only in spark on yarn mode

--num-executors Num

Number of executor started, default is 2, only spark on yarn mode

--archives Archives

Spark on Yarn mode only

5, some parameters and the meaning of the explanation

Startup Parameters

/bin/spark-submit \

--master yarn-cluster \

--num-executors 100 \

--executor-memory 6G \

--executor-cores 4 \

--driver-memory 1G \

--conf spark.default.parallelism=1000\

--confspark.storage.memoryfraction=0.5 \

--confspark.shuffle.memoryfraction=0.3 \

num-executors

Parameter description: This parameter is used to set the total number of executor processes to be executed by the spark job. Driver when you request a resource from the Yarn Cluster Manager, yarn Cluster Manager starts the appropriate number of executor processes on each work node of the cluster as you set it. This parameter is very important, if not set, the default will only give you to start a small number of executor process, at this time your spark job is running very slow.

Parameter Tuning Recommendations: Run general settings for each spark job 50~100 The executor process is appropriate, setting too little or too many executor processes is not good. Too few settings to make full use of cluster resources; too many of the queues may not be able to provide sufficient resources.

executor-memory

Parameter description: This parameter is used to set the memory for each executor process. Executor the size of the memory, many times directly determines the performance of the spark job, and with the common JVM Oom exception, there is also a direct association.

Parameter Tuning Recommendations: The memory settings 4g~8g for each executor process are more appropriate. But this is only a reference value, the specific settings will have to be based on the resource queue of different departments. You can see what the maximum memory limit for your team's resource queue is, and num-executors times Executor-memory, which represents the total amount of memory your spark job is applying to (that is, the sum of all executor processes). This volume cannot exceed the maximum amount of memory in the queue. In addition, if you are sharing this resource queue with other people on your team, it is best not to exceed the total memory of the resource queue by 1/3~1/2 your own spark job that consumes all of the resources in the queue, causing the other students ' jobs to fail to run.

Executor-cores

Parameter description: This parameter is used to set the number of CPU cores per executor process. This parameter determines the ability of each executor process to execute the task thread in parallel. Because each CPU core can execute only one task thread at a time, the higher the number of CPU cores per executor process, the faster it can execute all the task threads assigned to it.

Parameter Tuning Recommendations: The Executor CPU core number is set to one to four. Also depending on the resource queue of the different departments, you can look at the maximum CPU core limit of your resource queue, and then depending on the number of executor set, each executor process can be assigned to several CPU cores. It is also recommended that if you share this queue with others, then num-executors *executor-cores do not exceed the queue total CPU core 1/3~1/2 around the appropriate, but also to avoid affecting other students of the job run.

spark.default.parallelism

Parameter description: This parameter sets the default task number for each stage. This parameter is extremely important if not set and may directly affect your spark job performance.

Parameter Tuning Recommendations: The default task number for spark jobs is 500~1000. A lot of students often make a mistake is not to set this parameter, then it will lead to spark itself according to the number of blocks in the underlying HDFS to set the number of tasks, by default, an HDFS block corresponding to a task. In general, the number of default settings for Spark is small (for example, dozens of tasks), and if the number of tasks is small, the parameters of the executor that you set earlier will be wasted. Imagine that no matter how many executor processes you have, how much memory and CPU you have, but the task is only 1 or 10, then 90% of the executor process may not have task execution at all, which means wasting resources. So the Spark website recommends setting the principle that setting this parameter to Num-executors * Executor-cores is more appropriate, such as executor total CPU core number is 300, then set 1000 task is possible, The resources of the spark cluster can be fully exploited at this time.

spark.storage.memoryFraction

Parameter description: This parameter is used to set the ratio of the RDD persisted data to executor memory, which is 0.6 by default. That is, the default executor 60% of memory, can be used to save persisted RDD data. Depending on the persistence policy you choose, the data may not persist if there is not enough memory, or the data will be written to disk.

Parameter Tuning Recommendations: If there are more RDD persistence operations in the spark job, the value of the parameter can be increased appropriately to ensure that persisted data can be accommodated in memory. Avoid insufficient memory to cache all data, resulting in data being written to disk only, reducing performance. However, if the shuffle class operation in the spark job is more, and the persistence operation is relatively small, the value of this parameter is appropriately reduced. In addition, if the discovery job is slow to run due to frequent GC (the GC of the job can be observed through the Spark Web UI), which means that the task does not have enough memory to execute user code, it is also recommended to lower the value of this parameter.

spark.shuffle.memoryFraction

Parameter description: This parameter is used to set the ratio of executor memory that can be used when a task is pulled to the output of a task in the previous stage during the shuffle process, and the default is 0.2. That is, executor defaults to only 20% of the memory used for this operation. When the shuffle operation is aggregated, if it finds that the memory used exceeds the 20% limit, the excess data is spilled into the disk file, which can greatly degrade performance.

Parameter Tuning Recommendations: If there are fewer rdd persistence operations in spark jobs and shuffle operations, it is recommended to reduce the memory footprint of the persisted operation, increase the ratio of memory to the shuffle operation, and avoid insufficient memory when the data is too high in the shuffle process, and must be spilled to disk. Reduced performance. In addition, if the discovery job is running slowly due to frequent GC, which means that the task does not have enough memory to execute user code, it is also recommended to lower the value of this parameter.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.