1: spark1.0.0 attribute configuration method
The spark attribute provides control items for most applications and can be configured for each application separately.
Spark1.0.0 provides three methods of attribute Configuration:
- Sparkconf Mode
- The sparkconf method can directly pass the attribute value to sparkcontext;
- Sparkconf can be directly configured for some common attributes, such as using setmaster for Master and setappname for appname;
- You can also use the Set () method to configure key-value pairs for attributes, such as set ("spark.exe cutor. Memory", "1G ").
- Command Line Parameters
- This method uses command line parameters to submit an application using spark-submit or spark-shell;
- In this way, you can flexibly configure the running environment of each application;
- You can use spark-submit -- help or spark-shell-help to display a complete list of attributes.
- File configuration method
- This method writes attribute configuration items to a text file using key-value pairs. One configuration item occupies one line;
- The default file is CONF/spark-defaults.conf, Spark-submit will check whether the file exists when submitting the application, if yes, it will load the relevant property configuration;
- This file can define different locations in the spark-submit command parameter -- properties-file.
- Priority
- Sparkconf> command line parameter> file configuration
- View spark property Configuration
- You can view the spark property configuration through webui (http: // <driver>: 4040) of the application to check whether the property configuration is correct;
- Only display the attribute configuration explicitly specified in the preceding three methods. For other attributes, you can use the default configuration;
- For most internal control attributes, the system has provided reasonable default configurations.
2: common attributes in spark1.0.0
A: application attributes
Attribute name |
Default |
Description |
Spark. App. Name |
None |
Application name |
Spark. Master |
None |
Cluster Manager to be connected |
Spark.exe cutor. Memory |
512 m |
Total memory used by each executor |
Spark. serializer |
Org. Apache. Spark. serializer. Javaserializer |
The sequencer used for network data transmission or caching. The default sequencer is a Java sequencer. Although this sequencer can be used for any Java object, it has good compatibility, however, the processing speed is quite slow. If you want to achieve the processing speed, you are advised to use Org. apache. spark. serializer. kryoserializer sequencer. Of course, it can also be defined as a sequencer Of The Org. Apache. Spark. serializer subclass. |
Spark. kryo. registrator |
None |
To use the kryo sequencer, you need to create a class that inherits kryoregistrator and set the system property spark. kryo. registrator to point to this class. |
Spark. Local. dir |
/Tmp |
A directory used to store space. This directory is used to save map output files or dump RDD. This directory is located on a high-speed local disk, or on multiple different disks separated by commas. Note: In Spark 1.0 and later versions, this attribute will be replaced by the Environment Variable spark_local_dirs (standalone, mesos) or local_dirs (yarn) configured by the cluster manager. |
Spark. logconf |
False |
Valid sparkconf information is recorded when sparkcontext is started. |
B: runtime environment
Attribute name |
Default |
Description |
Spark.exe cutor. Memory |
512 m |
Total memory allocated to each executor process (in a format similar to 512 MB or 2 GB) |
Spark.exe cutor. extrajavaoptions |
None |
The additional JVM option to be passed to executor. Note that you cannot use it to set the spark attribute or heap size. |
Spark.exe cutor. extraclasspath |
None |
Append the class path to the executor class path to be compatible with the earlier version of spark. |
Spark.exe cutor. extralibrarypath |
None |
The special library path used to start executor JVM. |
Spark. Files. userclasspathfirst |
False |
Whether the executor preferentially uses the custom jar package when loading the class, instead of the jar package included in spark. This feature can be used to resolve conflicts between spark dependency packages and user dependent packages. Currently, this property is only a test function. |
C: shuffle operation
Attribute name |
Default |
Description |
Spark. shuff le. Sort lidatefiles |
False |
If it is set to true, the intermediate files will be merged during shuffle. For shuffle with a large number of reduce tasks, merging files can improve the file system performance. If you are using an ext4 or XFS file system, it is recommended to set it to true. For ext3, setting it to true due to file system restrictions will reduce the performance of machines with kernel> 8. |
Spark. Shuffle. Spill |
True |
If it is set to true, the total memory usage is reduced by overflow data to the disk during shuffle. The overflow threshold is specified by spark. Shuffle. memoryfraction. |
Spark. Shuffle. Spill. Compress |
True |
Whether to compress data that overflows during shuffle. If it is compressed, spark. Io. Compression. codec is used. |
Spark. Shuffle. Compress |
True |
Whether to compress the map output file. Spark. Io. Compression. codec will be used for compression. |
Spark. Shuffle. file. Buffer. KB |
100 |
The size of the memory buffer for each shuffle file output stream, in KB. These buffers can reduce the number of disk seek operations and reduce system calls when creating shuffle intermediate files. |
Spark. reducer. maxmbinflight |
48 |
Each reduce task obtains the maximum size (in MB) of map output at the same time ). Since each map output requires a buffer to receive it, this means that each reduce task has a fixed memory overhead, so you need to set a small point unless there is a large memory. |
D: spark UI
Attribute name |
Default |
Description |
Spark. UI. Port |
4040 |
Application webui Port |
Spark. UI. retainedstages |
1000 |
Number of stages retained by webui before GC |
Spark. UI. killenabled |
True |
Allow the stage and corresponding job to be killed in webui |
Spark. EventLog. Enabled |
False |
Whether to record spark events, used to reconstruct webui after the application completes. |
Spark. EventLog. Compress |
False |
Whether to compress and record spark events, provided that spark. EventLog. enabled is true. |
Spark. EventLog. dir |
File: // tmp/spark-Events |
If spark. EventLog. enabled is true, this attribute is the root directory that records spark events. In this root directory, spark creates sub-directories for each application and records application events in this directory. You can set this attribute to the HDFS directory so that the history server can read history files. |
E: compression and Ordering
Attribute name |
Default |
Description |
Spark. Broadcast. Compress |
True |
Whether to compress the broadcast variable before sending it. |
Spark. RDD. Compress |
False |
Whether to compress the ordinal RDD partition can save a lot of space, but it will consume some additional CPU time. |
Spark. Io. Compression. Codec |
Org. Apache. Spark. Io. Lzfcompressioncodec |
The decoder used to compress internal data such as RDD partitions and shuffle output. Spark provides two codecs: org. Apache. Spark. Io. lzfcompressioncodec and org. Apache. Spark. Io. snappycompressioncodec. Among them, snappy provides faster compression and decompression, while lzf provides better compression ratio. |
Spark. Io. Compression. Snappy . Block. Size |
32768 |
The block size (in bytes) used by the snappy decoder ). |
Spark. Closure. serializer |
Org. Apache. Spark. serializer. Javaserializer |
Sequencer used for closures. Currently, only Java sequencer is supported. |
Spark. serializer. Objectstreamreset |
10000 |
When org. Apache. Spark. serializer. javaserializer is executed, the sequencer caches objects to prevent redundant data writing. In this case, the garbage collection of these objects is stopped. You can call the reset sequencer to refresh the information to collect old objects. To disable this reset function, set it to <= 0. By default, the sequencer is reset for every 10000 objects. |
Spark. kryo. referencetracking |
True |
Whether to trace references to the same object when kryo sequence data is used. If your object graph has a loop or the same object has multiple copies, it is necessary to set it to true; otherwise, it can be disabled to improve performance. |
Spark. kryoserializer. Buffer. MB |
2 |
Maximum object size allowed in kryo (kryo creates a buffer, at least as large as a single ordinal object ). If the buffer limit of kryo exceeds the exception, add this value. Note that each core of a worker has only one buffer zone. |
F: perform the operation.
Attribute name |
Default |
Description |
Spark. Default. Parallelism |
Local Mode: Number of Local Machine kernels Mesos fine mode: 8 Others: Total number of cores of all executors Or 2, whichever is larger |
If this parameter is not set, the system uses the default number of tasks (such as groupbykey and performancebykey) for the shuffle operation in the cluster ). |
Spark. Broadcast. Factory |
Org. Apache. Spark. broadcast. Httpbroadcastfactory |
Broadcast implementation class |
Spark. Broadcast. blocksize |
4096 |
The size of the torrentbroadcastfactory block (in KB ). If the value is too large, the concurrency is reduced during broadcast (making the speed slow). If the value is too small, the performance of blockmanager may be affected. |
Spark. Files. Overwrite |
False |
Whether to overwrite the target file when the file added through sparkcontext. AddFile () already exists in the target and the content does not match. |
Spark. Files. fetchtimeout |
False |
Whether to use the communication time timeout when obtaining the file added by the driver through sparkcontext. AddFile. |
Spark. Storage. memoryfraction |
0.6 |
Ratio of Java heap to cache |
Spark. tachyonstore. basedir |
System. getproperty ("Java. Io. tmpdir ") |
The techyon directory used to store RDD. the URL of the tachyon file system is set by spark. tachyonstore. URL. You can also use multiple techyon directories separated by commas. |
Spark. Storage. Memorymapthreshold |
8192 |
The block size in bytes. It is used for the disk to read a block size for memory ing. This prevents spark from using small pieces of memory ing. In general, the overhead of memory ing for blocks is close to or below the page size of the operating system. |
Spark. tachyonstore. url |
Tachyon: // localhost: 19998 |
Based on the URL of the techyon file. |
Spark. Cleaner. TTL |
Unlimited |
Spark records the duration of any metadata (stages generation, task generation, and so on. Regular cleanup ensures that expired metadata is forgotten, which is useful for running long-time tasks such as running sparkstreaming tasks of 24/7. Note that the expired data of RDD persistence in the memory will also be cleared. |
G: Network Communication
Attribute name |
Default |
Description |
Spark. Driver. Host |
Local host name |
The host name or IP address of the driver. |
Spark. Driver. Port |
Random |
The port on which the driver listens. |
Spark. akka. framesize |
10 |
The size of the communication information between the driver and executor in MB. The larger the value, the larger the driver can accept the calculation result. |
Spark. akka. threads |
4 |
The number of actor threads used for communication. Drivers with more CPU cores in large clusters can increase the number of actor threads. |
Spark. akka. Timeout |
100 |
Communication timeout between spark nodes in seconds. |
Spark. akka. Heartbeat. pauses |
600 |
The following three parameters are used to set the fault detector that comes with akka. If it is very difficult to set, you can disable the Fault Detector. To enable the Fault Detector, set the parameters in seconds. A fault detector is usually enabled with special requirements. A sensitive Fault Detector helps malicious executor locating. However, when GC pause or network delay occurs, you do not need to enable the Fault Detector. In addition, the opening of the Fault Detector will lead to network flooding caused by frequent heartbeat information exchange. This parameter sets an acceptable heartbeat pause time. |
Spark. akka, failure-detector.threshold |
300.0 |
Corresponds to akka. Remote. transport-failure-detector.threshold of akka |
Spark. akka. Heartbeat. Interval |
1000 |
Heartbeat Interval |
H: Scheduling
Attribute name |
Default |
Description |
Spark. task. CPUs |
1 |
The number of cores allocated to each task. |
Spark. task. maxfailures |
4 |
The number of failed tasks before a job abandons a task. The value is greater than or equal to 1. |
Spark. scheduler. Mode |
FIFO |
The sparkcontext scheduling mode. Fair mode can be used for multiple users. |
Spark. cores. Max |
Not Set |
When an application runs in a standalone cluster or a mesos cluster in coarse-grained sharing mode, the maximum number of CPU cores requested by the application to the cluster (not each machine, but the entire cluster ). If this parameter is not set, the standalone cluster uses the value in spark. Deploy. defaultcores, and mesos uses the available kernel in the cluster. |
Spark. mesos. coarse |
False |
If set to true, use the coarse-grained sharing mode when running in the mesos cluster. |
Spark. Speculation |
False |
The following parameters are related to spark speculative execution mechanism. This parameter sets whether to use the speculative execution mechanism. If it is set to true, spark uses the speculative execution mechanism and restarts tasks in other nodes in the stage, and calculate the result of the first completed task as the final result. |
Spark. Speculation. Interval |
100 |
How long spark checks the running status of tasks for speculation, in milliseconds. |
Spark. Speculation. quantile |
0.75 |
Predict the percentage of tasks that must be completed in the stage before startup. |
Spark. Speculation. Multiplier |
1.5 |
How many times slower than the median of completed tasks to enable speculation |
Spark. locality. Wait |
3000 |
The following parameters are related to the local spark data. This parameter is the waiting time for starting a local data task in milliseconds. If the time limit is exceeded, the task with the next local priority will be started. This setting can also be applied to the local nature of each priority (local process> local node> Local rack> Any node). Of course, you can also use spark. locality. wait. node and other parameters are set to local with different priorities. |
Spark. locality. Wait. Process |
Spark. locality. Wait |
Local process-level local wait time |
Spark. locality. Wait. Node |
Spark. locality. Wait |
Local wait time at the local node level |
Spark. locality. Wait. Rack |
Spark. locality. Wait |
Local wait time at the local rack level |
Spark. scheduler. Revive. Interval |
1000 |
The longest time interval (in milliseconds) between tasks for re-obtaining resources after the task is restarted occurs when the task allocates resources to other tasks due to insufficient local resources, if you obtain enough resources again within the waiting time, the calculation continues. |
I: Security
Attribute name |
Default |
Description |
Spark. Authenticate |
False |
Whether internal authentication is enabled for spark. |
Spark. Authenticate. Secret |
None |
Set the key that spark uses for authentication between components. If spark. Authenticate is not running on Yarn and is set to true, you need to set the key. |
Spark. Core. Connection. Auth. Wait. Timeout |
30 |
Spark is used for component time-out for identity authentication. |
Spark. UI. Filters |
None |
List of filter names to be used by Spark web UI separated by commas. The filter must comply with the javax servlet filter standard. The parameters of each filter can be specified by setting Java System attributes: Spark. <Class Name of filter>. Params = 'param1 = value1, param2 = value2' For example: -Dspark. UI. Filters = com. Test. filter1 -Dspark.com. Test. filter1.params = 'param1 = Foo, param2 = testing' |
Spark. UI. ACLS. Enable |
False |
Whether the spark webui access permission is enabled. If enabled, the system checks whether the user has access permissions when browsing the Web interface. |
Spark. UI. View. ACLS |
Null |
Use commas to separate the list of spark webui access users. By default, only the spark job user has access permissions. |
J: spark streaming
Attribute name |
Default |
Description |
Spark. streaming. blockinterval |
200 |
Within the interval (MS), the spark streaming receiver combines the received data into data blocks and stores them in spark. |
Spark. streaming. unpersist |
True |
If this parameter is set to true, the sparkstreaming persistent RDD data is forcibly cleared from the spark memory. Similarly, the original input data received by sparkstreaming is automatically cleared. If this parameter is set to false, the original input data and persistent RDD data can be accessed by external streaming applications because the data is not automatically cleared. |
3: cluster-specific attributes
A: Standalone attributes
Standalone can also set properties through the Environment Variable file CONF/spark-env.sh, related configuration items are:
- Spark_master_opts: configure the attributes used by the master
- Spark_worker_opts: Configure attributes used by worker
- Spark_daemon_java_opts: configure the attributes used by both master and work.
When configuring, use a similar statement:
Export spark_master_opts = "-dx1 = Y1-dx2 = Y2"
#-X indicates the property, and y indicates the property.
Spark_master_opts supports the following attributes:
Attribute name |
Default |
Description |
Spark. Deploy. spreadout |
True |
Whether standalone Cluster Manager can select nodes as few as possible. The former has better data locality and the latter is more effective for computing-intensive workloads. |
Spark. Deploy. defaultcores |
Unlimited |
If spark. cores. Max is not set, this parameter sets the maximum number of kernels that the standalone cluster assigns to the application. If not set, the application obtains all valid kernels. Note that in a shared cluster, setting a low value prevents all kernels from being captured and affects others' usage. |
Spark. Worker. Timeout |
60 |
Because the master does not receive the heartbeat information, it determines that the worker is lost in seconds) |
Spark_worker_opts supports the following attributes:
Attribute name |
Default |
Description |
Spark. Worker. Cleanup. Enabled |
False |
Whether to regularly clean the worker application working directory, applicable only to standalone mode, not to yarn mode. During cleaning, the application is not running. |
Spark. Worker. Cleanup. Interval |
1800 |
Interval (in seconds) of clearing the working directory of the locally expired applications of the worker) |
Spark. Worker. Cleanup. appdatattl |
7*24*3600 |
Worker retains the validity period of the application working directory. This time is set by the disk space, application logs, application jar packages, and application submission frequency. |
Spark_daemon_java_opts supports the following attributes:
Attribute name |
Description |
Spark. Deploy. recoverymode |
The following three parameters are used to configure master ha in zookeeper mode. Zookeeper indicates that the master Backup recovery mode is enabled. The default value is none. |
Spark. Deploy. zookeeper. url |
Zookeeper cluster URL |
Spark. Deploy. zookeeper. dir |
Zookeeper stores the recovery directory. The default value is/spark. |
Spark. Deploy. recoverymode |
Set to filesystem to enable single-node master recovery mode. The default value is none. |
Spark. Deploy. recoverydirectory |
Spark stores the recovery directory |
B: unique properties of Yarn
The configuration of the properties specific to yarn should support the sparkconf mode and the conf/spark-defaults.conf file configuration mode ,.
Attribute name |
Default |
Description |
Spark. yarn. applicationmaster. waittries |
10 |
The number of times RM waits for spark appmaster to start, that is, the number of sparkcontext initialization times. If the value is exceeded, startup fails. |
Spark. yarn. Submit. file. Replication |
3 |
Copy factor of the File Uploaded By the application to HDFS |
Spark. yarn. Preserve. Staging. Files |
False |
Set to true. After the job ends, save the stage-related files instead of deleting them. |
Spark. yarn. scheduler. Heartbeat. interval-MS |
5000 |
Interval at which spark appmaster sends heartbeat information to yarn RM |
Spark.yarn.max.exe cutor. Failures |
2 times the number of executors |
Maximum number of executor failures that cause an application to fail to be declared |
Spark. yarn. historyserver. Address |
None |
The address of the spark history server (which must contain http ://). This address will be submitted to yarn RM after the application is completed, so that information is connected from the rm ui to the history server UI. |
Spark1.0.0 attribute Configuration