Spark1.0.0 attribute Configuration

Last Update:2014-10-14 Source: Internet

Author: User

Tags list of attributes

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1: spark1.0.0 attribute configuration method

The spark attribute provides control items for most applications and can be configured for each application separately.

Spark1.0.0 provides three methods of attribute Configuration:

Sparkconf Mode
- The sparkconf method can directly pass the attribute value to sparkcontext;
- Sparkconf can be directly configured for some common attributes, such as using setmaster for Master and setappname for appname;
- You can also use the Set () method to configure key-value pairs for attributes, such as set ("spark.exe cutor. Memory", "1G ").
Command Line Parameters
- This method uses command line parameters to submit an application using spark-submit or spark-shell;
- In this way, you can flexibly configure the running environment of each application;
- You can use spark-submit -- help or spark-shell-help to display a complete list of attributes.
File configuration method
- This method writes attribute configuration items to a text file using key-value pairs. One configuration item occupies one line;
- The default file is CONF/spark-defaults.conf, Spark-submit will check whether the file exists when submitting the application, if yes, it will load the relevant property configuration;
- This file can define different locations in the spark-submit command parameter -- properties-file.
Priority
- Sparkconf> command line parameter> file configuration
View spark property Configuration
- You can view the spark property configuration through webui (http: // <driver>: 4040) of the application to check whether the property configuration is correct;
- Only display the attribute configuration explicitly specified in the preceding three methods. For other attributes, you can use the default configuration;
- For most internal control attributes, the system has provided reasonable default configurations.

2: common attributes in spark1.0.0

A: application attributes

Attribute name	Default	Description
Spark. App. Name	None	Application name
Spark. Master	None	Cluster Manager to be connected
Spark.exe cutor. Memory	512 m	Total memory used by each executor
Spark. serializer	Org. Apache. Spark. serializer. Javaserializer	The sequencer used for network data transmission or caching. The default sequencer is a Java sequencer. Although this sequencer can be used for any Java object, it has good compatibility, however, the processing speed is quite slow. If you want to achieve the processing speed, you are advised to use Org. apache. spark. serializer. kryoserializer sequencer. Of course, it can also be defined as a sequencer Of The Org. Apache. Spark. serializer subclass.
Spark. kryo. registrator	None	To use the kryo sequencer, you need to create a class that inherits kryoregistrator and set the system property spark. kryo. registrator to point to this class.
Spark. Local. dir	/Tmp	A directory used to store space. This directory is used to save map output files or dump RDD. This directory is located on a high-speed local disk, or on multiple different disks separated by commas. Note: In Spark 1.0 and later versions, this attribute will be replaced by the Environment Variable spark_local_dirs (standalone, mesos) or local_dirs (yarn) configured by the cluster manager.
Spark. logconf	False	Valid sparkconf information is recorded when sparkcontext is started.

B: runtime environment

Attribute name	Default	Description
Spark.exe cutor. Memory	512 m	Total memory allocated to each executor process (in a format similar to 512 MB or 2 GB)
Spark.exe cutor. extrajavaoptions	None	The additional JVM option to be passed to executor. Note that you cannot use it to set the spark attribute or heap size.
Spark.exe cutor. extraclasspath	None	Append the class path to the executor class path to be compatible with the earlier version of spark.
Spark.exe cutor. extralibrarypath	None	The special library path used to start executor JVM.
Spark. Files. userclasspathfirst	False	Whether the executor preferentially uses the custom jar package when loading the class, instead of the jar package included in spark. This feature can be used to resolve conflicts between spark dependency packages and user dependent packages. Currently, this property is only a test function.

C: shuffle operation

Attribute name	Default	Description
Spark. shuff le. Sort lidatefiles	False	If it is set to true, the intermediate files will be merged during shuffle. For shuffle with a large number of reduce tasks, merging files can improve the file system performance. If you are using an ext4 or XFS file system, it is recommended to set it to true. For ext3, setting it to true due to file system restrictions will reduce the performance of machines with kernel> 8.
Spark. Shuffle. Spill	True	If it is set to true, the total memory usage is reduced by overflow data to the disk during shuffle. The overflow threshold is specified by spark. Shuffle. memoryfraction.
Spark. Shuffle. Spill. Compress	True	Whether to compress data that overflows during shuffle. If it is compressed, spark. Io. Compression. codec is used.
Spark. Shuffle. Compress	True	Whether to compress the map output file. Spark. Io. Compression. codec will be used for compression.
Spark. Shuffle. file. Buffer. KB	100	The size of the memory buffer for each shuffle file output stream, in KB. These buffers can reduce the number of disk seek operations and reduce system calls when creating shuffle intermediate files.
Spark. reducer. maxmbinflight	48	Each reduce task obtains the maximum size (in MB) of map output at the same time ). Since each map output requires a buffer to receive it, this means that each reduce task has a fixed memory overhead, so you need to set a small point unless there is a large memory.

D: spark UI

Attribute name	Default	Description
Spark. UI. Port	4040	Application webui Port
Spark. UI. retainedstages	1000	Number of stages retained by webui before GC
Spark. UI. killenabled	True	Allow the stage and corresponding job to be killed in webui
Spark. EventLog. Enabled	False	Whether to record spark events, used to reconstruct webui after the application completes.
Spark. EventLog. Compress	False	Whether to compress and record spark events, provided that spark. EventLog. enabled is true.
Spark. EventLog. dir	File: // tmp/spark-Events	If spark. EventLog. enabled is true, this attribute is the root directory that records spark events. In this root directory, spark creates sub-directories for each application and records application events in this directory. You can set this attribute to the HDFS directory so that the history server can read history files.

E: compression and Ordering

Attribute name	Default	Description
Spark. Broadcast. Compress	True	Whether to compress the broadcast variable before sending it.
Spark. RDD. Compress	False	Whether to compress the ordinal RDD partition can save a lot of space, but it will consume some additional CPU time.
Spark. Io. Compression. Codec	Org. Apache. Spark. Io. Lzfcompressioncodec	The decoder used to compress internal data such as RDD partitions and shuffle output. Spark provides two codecs: org. Apache. Spark. Io. lzfcompressioncodec and org. Apache. Spark. Io. snappycompressioncodec. Among them, snappy provides faster compression and decompression, while lzf provides better compression ratio.
Spark. Io. Compression. Snappy . Block. Size	32768	The block size (in bytes) used by the snappy decoder ).
Spark. Closure. serializer	Org. Apache. Spark. serializer. Javaserializer	Sequencer used for closures. Currently, only Java sequencer is supported.
Spark. serializer. Objectstreamreset	10000	When org. Apache. Spark. serializer. javaserializer is executed, the sequencer caches objects to prevent redundant data writing. In this case, the garbage collection of these objects is stopped. You can call the reset sequencer to refresh the information to collect old objects. To disable this reset function, set it to <= 0. By default, the sequencer is reset for every 10000 objects.
Spark. kryo. referencetracking	True	Whether to trace references to the same object when kryo sequence data is used. If your object graph has a loop or the same object has multiple copies, it is necessary to set it to true; otherwise, it can be disabled to improve performance.
Spark. kryoserializer. Buffer. MB	2	Maximum object size allowed in kryo (kryo creates a buffer, at least as large as a single ordinal object ). If the buffer limit of kryo exceeds the exception, add this value. Note that each core of a worker has only one buffer zone.

F: perform the operation.

Attribute name	Default	Description
Spark. Default. Parallelism	Local Mode: Number of Local Machine kernels Mesos fine mode: 8 Others: Total number of cores of all executors Or 2, whichever is larger	If this parameter is not set, the system uses the default number of tasks (such as groupbykey and performancebykey) for the shuffle operation in the cluster ).
Spark. Broadcast. Factory	Org. Apache. Spark. broadcast. Httpbroadcastfactory	Broadcast implementation class
Spark. Broadcast. blocksize	4096	The size of the torrentbroadcastfactory block (in KB ). If the value is too large, the concurrency is reduced during broadcast (making the speed slow). If the value is too small, the performance of blockmanager may be affected.
Spark. Files. Overwrite	False	Whether to overwrite the target file when the file added through sparkcontext. AddFile () already exists in the target and the content does not match.
Spark. Files. fetchtimeout	False	Whether to use the communication time timeout when obtaining the file added by the driver through sparkcontext. AddFile.
Spark. Storage. memoryfraction	0.6	Ratio of Java heap to cache
Spark. tachyonstore. basedir	System. getproperty ("Java. Io. tmpdir ")	The techyon directory used to store RDD. the URL of the tachyon file system is set by spark. tachyonstore. URL. You can also use multiple techyon directories separated by commas.
Spark. Storage. Memorymapthreshold	8192	The block size in bytes. It is used for the disk to read a block size for memory ing. This prevents spark from using small pieces of memory ing. In general, the overhead of memory ing for blocks is close to or below the page size of the operating system.
Spark. tachyonstore. url	Tachyon: // localhost: 19998	Based on the URL of the techyon file.
Spark. Cleaner. TTL	Unlimited	Spark records the duration of any metadata (stages generation, task generation, and so on. Regular cleanup ensures that expired metadata is forgotten, which is useful for running long-time tasks such as running sparkstreaming tasks of 24/7. Note that the expired data of RDD persistence in the memory will also be cleared.

G: Network Communication

Attribute name	Default	Description
Spark. Driver. Host	Local host name	The host name or IP address of the driver.
Spark. Driver. Port	Random	The port on which the driver listens.
Spark. akka. framesize	10	The size of the communication information between the driver and executor in MB. The larger the value, the larger the driver can accept the calculation result.
Spark. akka. threads	4	The number of actor threads used for communication. Drivers with more CPU cores in large clusters can increase the number of actor threads.
Spark. akka. Timeout	100	Communication timeout between spark nodes in seconds.
Spark. akka. Heartbeat. pauses	600	The following three parameters are used to set the fault detector that comes with akka. If it is very difficult to set, you can disable the Fault Detector. To enable the Fault Detector, set the parameters in seconds. A fault detector is usually enabled with special requirements. A sensitive Fault Detector helps malicious executor locating. However, when GC pause or network delay occurs, you do not need to enable the Fault Detector. In addition, the opening of the Fault Detector will lead to network flooding caused by frequent heartbeat information exchange. This parameter sets an acceptable heartbeat pause time.
Spark. akka, failure-detector.threshold	300.0	Corresponds to akka. Remote. transport-failure-detector.threshold of akka
Spark. akka. Heartbeat. Interval	1000	Heartbeat Interval

H: Scheduling

Attribute name	Default	Description
Spark. task. CPUs	1	The number of cores allocated to each task.
Spark. task. maxfailures	4	The number of failed tasks before a job abandons a task. The value is greater than or equal to 1.
Spark. scheduler. Mode	FIFO	The sparkcontext scheduling mode. Fair mode can be used for multiple users.
Spark. cores. Max	Not Set	When an application runs in a standalone cluster or a mesos cluster in coarse-grained sharing mode, the maximum number of CPU cores requested by the application to the cluster (not each machine, but the entire cluster ). If this parameter is not set, the standalone cluster uses the value in spark. Deploy. defaultcores, and mesos uses the available kernel in the cluster.
Spark. mesos. coarse	False	If set to true, use the coarse-grained sharing mode when running in the mesos cluster.
Spark. Speculation	False	The following parameters are related to spark speculative execution mechanism. This parameter sets whether to use the speculative execution mechanism. If it is set to true, spark uses the speculative execution mechanism and restarts tasks in other nodes in the stage, and calculate the result of the first completed task as the final result.
Spark. Speculation. Interval	100	How long spark checks the running status of tasks for speculation, in milliseconds.
Spark. Speculation. quantile	0.75	Predict the percentage of tasks that must be completed in the stage before startup.
Spark. Speculation. Multiplier	1.5	How many times slower than the median of completed tasks to enable speculation
Spark. locality. Wait	3000	The following parameters are related to the local spark data. This parameter is the waiting time for starting a local data task in milliseconds. If the time limit is exceeded, the task with the next local priority will be started. This setting can also be applied to the local nature of each priority (local process> local node> Local rack> Any node). Of course, you can also use spark. locality. wait. node and other parameters are set to local with different priorities.
Spark. locality. Wait. Process	Spark. locality. Wait	Local process-level local wait time
Spark. locality. Wait. Node	Spark. locality. Wait	Local wait time at the local node level
Spark. locality. Wait. Rack	Spark. locality. Wait	Local wait time at the local rack level
Spark. scheduler. Revive. Interval	1000	The longest time interval (in milliseconds) between tasks for re-obtaining resources after the task is restarted occurs when the task allocates resources to other tasks due to insufficient local resources, if you obtain enough resources again within the waiting time, the calculation continues.

I: Security

Attribute name	Default	Description
Spark. Authenticate	False	Whether internal authentication is enabled for spark.
Spark. Authenticate. Secret	None	Set the key that spark uses for authentication between components. If spark. Authenticate is not running on Yarn and is set to true, you need to set the key.
Spark. Core. Connection. Auth. Wait. Timeout	30	Spark is used for component time-out for identity authentication.
Spark. UI. Filters	None	List of filter names to be used by Spark web UI separated by commas. The filter must comply with the javax servlet filter standard. The parameters of each filter can be specified by setting Java System attributes: Spark. <Class Name of filter>. Params = 'param1 = value1, param2 = value2' For example: -Dspark. UI. Filters = com. Test. filter1 -Dspark.com. Test. filter1.params = 'param1 = Foo, param2 = testing'
Spark. UI. ACLS. Enable	False	Whether the spark webui access permission is enabled. If enabled, the system checks whether the user has access permissions when browsing the Web interface.
Spark. UI. View. ACLS	Null	Use commas to separate the list of spark webui access users. By default, only the spark job user has access permissions.

J: spark streaming

Attribute name	Default	Description
Spark. streaming. blockinterval	200	Within the interval (MS), the spark streaming receiver combines the received data into data blocks and stores them in spark.
Spark. streaming. unpersist	True	If this parameter is set to true, the sparkstreaming persistent RDD data is forcibly cleared from the spark memory. Similarly, the original input data received by sparkstreaming is automatically cleared. If this parameter is set to false, the original input data and persistent RDD data can be accessed by external streaming applications because the data is not automatically cleared.

3: cluster-specific attributes

A: Standalone attributes

Standalone can also set properties through the Environment Variable file CONF/spark-env.sh, related configuration items are:

Spark_master_opts: configure the attributes used by the master
Spark_worker_opts: Configure attributes used by worker
Spark_daemon_java_opts: configure the attributes used by both master and work.

When configuring, use a similar statement:

Export spark_master_opts = "-dx1 = Y1-dx2 = Y2"

#-X indicates the property, and y indicates the property.

Spark_master_opts supports the following attributes:

Attribute name	Default	Description
Spark. Deploy. spreadout	True	Whether standalone Cluster Manager can select nodes as few as possible. The former has better data locality and the latter is more effective for computing-intensive workloads.
Spark. Deploy. defaultcores	Unlimited	If spark. cores. Max is not set, this parameter sets the maximum number of kernels that the standalone cluster assigns to the application. If not set, the application obtains all valid kernels. Note that in a shared cluster, setting a low value prevents all kernels from being captured and affects others' usage.
Spark. Worker. Timeout	60	Because the master does not receive the heartbeat information, it determines that the worker is lost in seconds)

Spark_worker_opts supports the following attributes:

Attribute name	Default	Description
Spark. Worker. Cleanup. Enabled	False	Whether to regularly clean the worker application working directory, applicable only to standalone mode, not to yarn mode. During cleaning, the application is not running.
Spark. Worker. Cleanup. Interval	1800	Interval (in seconds) of clearing the working directory of the locally expired applications of the worker)
Spark. Worker. Cleanup. appdatattl	7243600	Worker retains the validity period of the application working directory. This time is set by the disk space, application logs, application jar packages, and application submission frequency.

Spark_daemon_java_opts supports the following attributes:

Attribute name	Description
Spark. Deploy. recoverymode	The following three parameters are used to configure master ha in zookeeper mode. Zookeeper indicates that the master Backup recovery mode is enabled. The default value is none.
Spark. Deploy. zookeeper. url	Zookeeper cluster URL
Spark. Deploy. zookeeper. dir	Zookeeper stores the recovery directory. The default value is/spark.
Spark. Deploy. recoverymode	Set to filesystem to enable single-node master recovery mode. The default value is none.
Spark. Deploy. recoverydirectory	Spark stores the recovery directory

B: unique properties of Yarn

The configuration of the properties specific to yarn should support the sparkconf mode and the conf/spark-defaults.conf file configuration mode ,.

Attribute name	Default	Description
Spark. yarn. applicationmaster. waittries	10	The number of times RM waits for spark appmaster to start, that is, the number of sparkcontext initialization times. If the value is exceeded, startup fails.
Spark. yarn. Submit. file. Replication	3	Copy factor of the File Uploaded By the application to HDFS
Spark. yarn. Preserve. Staging. Files	False	Set to true. After the job ends, save the stage-related files instead of deleting them.
Spark. yarn. scheduler. Heartbeat. interval-MS	5000	Interval at which spark appmaster sends heartbeat information to yarn RM
Spark.yarn.max.exe cutor. Failures	2 times the number of executors	Maximum number of executor failures that cause an application to fail to be declared
Spark. yarn. historyserver. Address	None	The address of the spark history server (which must contain http ://). This address will be submitted to yarn RM after the application is completed, so that information is connected from the rm ui to the history server UI.

Spark1.0.0 attribute Configuration

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More