Spark1.0.0 attribute Configuration

Source: Internet
Author: User
Tags list of attributes

1: spark1.0.0 attribute configuration method

The spark attribute provides control items for most applications and can be configured for each application separately.

Spark1.0.0 provides three methods of attribute Configuration:

  • Sparkconf Mode
    • The sparkconf method can directly pass the attribute value to sparkcontext;
    • Sparkconf can be directly configured for some common attributes, such as using setmaster for Master and setappname for appname;
    • You can also use the Set () method to configure key-value pairs for attributes, such as set ("spark.exe cutor. Memory", "1G ").
  • Command Line Parameters
    • This method uses command line parameters to submit an application using spark-submit or spark-shell;
    • In this way, you can flexibly configure the running environment of each application;
    • You can use spark-submit -- help or spark-shell-help to display a complete list of attributes.
  • File configuration method
    • This method writes attribute configuration items to a text file using key-value pairs. One configuration item occupies one line;
    • The default file is CONF/spark-defaults.conf, Spark-submit will check whether the file exists when submitting the application, if yes, it will load the relevant property configuration;
    • This file can define different locations in the spark-submit command parameter -- properties-file.
  • Priority
    • Sparkconf> command line parameter> file configuration
  • View spark property Configuration
    • You can view the spark property configuration through webui (http: // <driver>: 4040) of the application to check whether the property configuration is correct;
    • Only display the attribute configuration explicitly specified in the preceding three methods. For other attributes, you can use the default configuration;
    • For most internal control attributes, the system has provided reasonable default configurations.
 

2: common attributes in spark1.0.0

A: application attributes

Attribute name Default Description
Spark. App. Name None Application name
Spark. Master None Cluster Manager to be connected
Spark.exe cutor. Memory 512 m Total memory used by each executor
Spark. serializer Org. Apache. Spark. serializer.
Javaserializer
The sequencer used for network data transmission or caching. The default sequencer is a Java sequencer. Although this sequencer can be used for any Java object, it has good compatibility, however, the processing speed is quite slow. If you want to achieve the processing speed, you are advised to use Org. apache. spark. serializer. kryoserializer sequencer. Of course, it can also be defined as a sequencer Of The Org. Apache. Spark. serializer subclass.
Spark. kryo. registrator None To use the kryo sequencer, you need to create a class that inherits kryoregistrator and set the system property spark. kryo. registrator to point to this class.
Spark. Local. dir /Tmp A directory used to store space. This directory is used to save map output files or dump RDD. This directory is located on a high-speed local disk, or on multiple different disks separated by commas. Note: In Spark 1.0 and later versions, this attribute will be replaced by the Environment Variable spark_local_dirs (standalone, mesos) or local_dirs (yarn) configured by the cluster manager.
Spark. logconf False Valid sparkconf information is recorded when sparkcontext is started.
B: runtime environment
Attribute name Default Description
Spark.exe cutor. Memory 512 m Total memory allocated to each executor process (in a format similar to 512 MB or 2 GB)
Spark.exe cutor. extrajavaoptions None The additional JVM option to be passed to executor. Note that you cannot use it to set the spark attribute or heap size.
Spark.exe cutor. extraclasspath None Append the class path to the executor class path to be compatible with the earlier version of spark.
Spark.exe cutor. extralibrarypath None The special library path used to start executor JVM.
Spark. Files. userclasspathfirst False Whether the executor preferentially uses the custom jar package when loading the class, instead of the jar package included in spark. This feature can be used to resolve conflicts between spark dependency packages and user dependent packages. Currently, this property is only a test function.

C: shuffle operation

Attribute name Default Description
Spark. shuff le. Sort lidatefiles False If it is set to true, the intermediate files will be merged during shuffle. For shuffle with a large number of reduce tasks, merging files can improve the file system performance. If you are using an ext4 or XFS file system, it is recommended to set it to true. For ext3, setting it to true due to file system restrictions will reduce the performance of machines with kernel> 8.
Spark. Shuffle. Spill True If it is set to true, the total memory usage is reduced by overflow data to the disk during shuffle. The overflow threshold is specified by spark. Shuffle. memoryfraction.
Spark. Shuffle. Spill. Compress True Whether to compress data that overflows during shuffle. If it is compressed, spark. Io. Compression. codec is used.
Spark. Shuffle. Compress True Whether to compress the map output file. Spark. Io. Compression. codec will be used for compression.
Spark. Shuffle. file. Buffer. KB 100 The size of the memory buffer for each shuffle file output stream, in KB. These buffers can reduce the number of disk seek operations and reduce system calls when creating shuffle intermediate files.
Spark. reducer. maxmbinflight 48 Each reduce task obtains the maximum size (in MB) of map output at the same time ). Since each map output requires a buffer to receive it, this means that each reduce task has a fixed memory overhead, so you need to set a small point unless there is a large memory.
D: spark UI
Attribute name Default Description
Spark. UI. Port 4040 Application webui Port
Spark. UI. retainedstages 1000 Number of stages retained by webui before GC
Spark. UI. killenabled True Allow the stage and corresponding job to be killed in webui
Spark. EventLog. Enabled False Whether to record spark events, used to reconstruct webui after the application completes.
Spark. EventLog. Compress False Whether to compress and record spark events, provided that spark. EventLog. enabled is true.
Spark. EventLog. dir File: // tmp/spark-Events If spark. EventLog. enabled is true, this attribute is the root directory that records spark events. In this root directory, spark creates sub-directories for each application and records application events in this directory. You can set this attribute to the HDFS directory so that the history server can read history files.
E: compression and Ordering
Attribute name Default Description
Spark. Broadcast. Compress True Whether to compress the broadcast variable before sending it.
Spark. RDD. Compress False Whether to compress the ordinal RDD partition can save a lot of space, but it will consume some additional CPU time.
Spark. Io. Compression. Codec Org. Apache. Spark. Io.
Lzfcompressioncodec
The decoder used to compress internal data such as RDD partitions and shuffle output. Spark provides two codecs: org. Apache. Spark. Io. lzfcompressioncodec and org. Apache. Spark. Io. snappycompressioncodec. Among them, snappy provides faster compression and decompression, while lzf provides better compression ratio.
Spark. Io. Compression. Snappy
. Block. Size
32768 The block size (in bytes) used by the snappy decoder ).
Spark. Closure. serializer Org. Apache. Spark. serializer.
Javaserializer
Sequencer used for closures. Currently, only Java sequencer is supported.
Spark. serializer.
Objectstreamreset
10000 When org. Apache. Spark. serializer. javaserializer is executed, the sequencer caches objects to prevent redundant data writing. In this case, the garbage collection of these objects is stopped. You can call the reset sequencer to refresh the information to collect old objects. To disable this reset function, set it to <= 0. By default, the sequencer is reset for every 10000 objects.
Spark. kryo. referencetracking True Whether to trace references to the same object when kryo sequence data is used. If your object graph has a loop or the same object has multiple copies, it is necessary to set it to true; otherwise, it can be disabled to improve performance.
Spark. kryoserializer. Buffer. MB 2 Maximum object size allowed in kryo (kryo creates a buffer, at least as large as a single ordinal object ). If the buffer limit of kryo exceeds the exception, add this value. Note that each core of a worker has only one buffer zone.
F: perform the operation.
Attribute name Default Description
Spark. Default. Parallelism Local Mode: Number of Local Machine kernels

Mesos fine mode: 8

Others: Total number of cores of all executors

Or 2, whichever is larger

If this parameter is not set, the system uses the default number of tasks (such as groupbykey and performancebykey) for the shuffle operation in the cluster ).
Spark. Broadcast. Factory Org. Apache. Spark. broadcast.
Httpbroadcastfactory
Broadcast implementation class
Spark. Broadcast. blocksize 4096 The size of the torrentbroadcastfactory block (in KB ). If the value is too large, the concurrency is reduced during broadcast (making the speed slow). If the value is too small, the performance of blockmanager may be affected.
Spark. Files. Overwrite False Whether to overwrite the target file when the file added through sparkcontext. AddFile () already exists in the target and the content does not match.
Spark. Files. fetchtimeout False Whether to use the communication time timeout when obtaining the file added by the driver through sparkcontext. AddFile.
Spark. Storage. memoryfraction 0.6 Ratio of Java heap to cache
Spark. tachyonstore. basedir System. getproperty ("Java. Io. tmpdir ") The techyon directory used to store RDD. the URL of the tachyon file system is set by spark. tachyonstore. URL. You can also use multiple techyon directories separated by commas.
Spark. Storage.
Memorymapthreshold
8192 The block size in bytes. It is used for the disk to read a block size for memory ing. This prevents spark from using small pieces of memory ing. In general, the overhead of memory ing for blocks is close to or below the page size of the operating system.
Spark. tachyonstore. url Tachyon: // localhost: 19998 Based on the URL of the techyon file.
Spark. Cleaner. TTL Unlimited Spark records the duration of any metadata (stages generation, task generation, and so on. Regular cleanup ensures that expired metadata is forgotten, which is useful for running long-time tasks such as running sparkstreaming tasks of 24/7. Note that the expired data of RDD persistence in the memory will also be cleared.
G: Network Communication
Attribute name Default Description
Spark. Driver. Host Local host name The host name or IP address of the driver.
Spark. Driver. Port Random The port on which the driver listens.
Spark. akka. framesize 10 The size of the communication information between the driver and executor in MB. The larger the value, the larger the driver can accept the calculation result.
Spark. akka. threads 4 The number of actor threads used for communication. Drivers with more CPU cores in large clusters can increase the number of actor threads.
Spark. akka. Timeout 100 Communication timeout between spark nodes in seconds.
Spark. akka. Heartbeat. pauses 600 The following three parameters are used to set the fault detector that comes with akka. If it is very difficult to set, you can disable the Fault Detector. To enable the Fault Detector, set the parameters in seconds. A fault detector is usually enabled with special requirements. A sensitive Fault Detector helps malicious executor locating. However, when GC pause or network delay occurs, you do not need to enable the Fault Detector. In addition, the opening of the Fault Detector will lead to network flooding caused by frequent heartbeat information exchange.
This parameter sets an acceptable heartbeat pause time.
Spark. akka, failure-detector.threshold 300.0 Corresponds to akka. Remote. transport-failure-detector.threshold of akka
Spark. akka. Heartbeat. Interval 1000 Heartbeat Interval
H: Scheduling
Attribute name Default Description
Spark. task. CPUs 1 The number of cores allocated to each task.
Spark. task. maxfailures 4 The number of failed tasks before a job abandons a task. The value is greater than or equal to 1.
Spark. scheduler. Mode FIFO The sparkcontext scheduling mode.
Fair mode can be used for multiple users.
Spark. cores. Max Not Set When an application runs in a standalone cluster or a mesos cluster in coarse-grained sharing mode, the maximum number of CPU cores requested by the application to the cluster (not each machine, but the entire cluster ). If this parameter is not set, the standalone cluster uses the value in spark. Deploy. defaultcores, and mesos uses the available kernel in the cluster.
Spark. mesos. coarse False If set to true, use the coarse-grained sharing mode when running in the mesos cluster.
Spark. Speculation False The following parameters are related to spark speculative execution mechanism. This parameter sets whether to use the speculative execution mechanism. If it is set to true, spark uses the speculative execution mechanism and restarts tasks in other nodes in the stage, and calculate the result of the first completed task as the final result.
Spark. Speculation. Interval 100 How long spark checks the running status of tasks for speculation, in milliseconds.
Spark. Speculation. quantile 0.75 Predict the percentage of tasks that must be completed in the stage before startup.
Spark. Speculation. Multiplier 1.5 How many times slower than the median of completed tasks to enable speculation
Spark. locality. Wait 3000 The following parameters are related to the local spark data.
This parameter is the waiting time for starting a local data task in milliseconds. If the time limit is exceeded, the task with the next local priority will be started. This setting can also be applied to the local nature of each priority (local process> local node> Local rack> Any node). Of course, you can also use spark. locality. wait. node and other parameters are set to local with different priorities.
Spark. locality. Wait. Process Spark. locality. Wait Local process-level local wait time
Spark. locality. Wait. Node Spark. locality. Wait Local wait time at the local node level
Spark. locality. Wait. Rack Spark. locality. Wait Local wait time at the local rack level
Spark. scheduler. Revive. Interval 1000 The longest time interval (in milliseconds) between tasks for re-obtaining resources after the task is restarted occurs when the task allocates resources to other tasks due to insufficient local resources, if you obtain enough resources again within the waiting time, the calculation continues.
I: Security
Attribute name Default Description
Spark. Authenticate False Whether internal authentication is enabled for spark.
Spark. Authenticate. Secret None Set the key that spark uses for authentication between components. If spark. Authenticate is not running on Yarn and is set to true, you need to set the key.
Spark. Core. Connection.
Auth. Wait. Timeout
30 Spark is used for component time-out for identity authentication.
Spark. UI. Filters None List of filter names to be used by Spark web UI separated by commas. The filter must comply with the javax servlet filter standard. The parameters of each filter can be specified by setting Java System attributes:
Spark. <Class Name of filter>. Params = 'param1 = value1, param2 = value2'
For example:
-Dspark. UI. Filters = com. Test. filter1
-Dspark.com. Test. filter1.params = 'param1 = Foo, param2 = testing'
Spark. UI. ACLS. Enable False Whether the spark webui access permission is enabled. If enabled, the system checks whether the user has access permissions when browsing the Web interface.
Spark. UI. View. ACLS Null Use commas to separate the list of spark webui access users. By default, only the spark job user has access permissions.

J: spark streaming

Attribute name Default Description
Spark. streaming. blockinterval 200 Within the interval (MS), the spark streaming receiver combines the received data into data blocks and stores them in spark.
Spark. streaming. unpersist True If this parameter is set to true, the sparkstreaming persistent RDD data is forcibly cleared from the spark memory. Similarly, the original input data received by sparkstreaming is automatically cleared. If this parameter is set to false, the original input data and persistent RDD data can be accessed by external streaming applications because the data is not automatically cleared.

3: cluster-specific attributes

A: Standalone attributes

Standalone can also set properties through the Environment Variable file CONF/spark-env.sh, related configuration items are:

  • Spark_master_opts: configure the attributes used by the master
  • Spark_worker_opts: Configure attributes used by worker
  • Spark_daemon_java_opts: configure the attributes used by both master and work.

When configuring, use a similar statement:

Export spark_master_opts = "-dx1 = Y1-dx2 = Y2"

#-X indicates the property, and y indicates the property.

Spark_master_opts supports the following attributes:

Attribute name Default Description
Spark. Deploy. spreadout True Whether standalone Cluster Manager can select nodes as few as possible. The former has better data locality and the latter is more effective for computing-intensive workloads.
Spark. Deploy. defaultcores Unlimited If spark. cores. Max is not set, this parameter sets the maximum number of kernels that the standalone cluster assigns to the application. If not set, the application obtains all valid kernels. Note that in a shared cluster, setting a low value prevents all kernels from being captured and affects others' usage.
Spark. Worker. Timeout 60 Because the master does not receive the heartbeat information, it determines that the worker is lost in seconds)

Spark_worker_opts supports the following attributes:

Attribute name Default Description
Spark. Worker. Cleanup. Enabled False Whether to regularly clean the worker application working directory, applicable only to standalone mode, not to yarn mode. During cleaning, the application is not running.
Spark. Worker. Cleanup. Interval 1800 Interval (in seconds) of clearing the working directory of the locally expired applications of the worker)
Spark. Worker. Cleanup. appdatattl 7*24*3600 Worker retains the validity period of the application working directory. This time is set by the disk space, application logs, application jar packages, and application submission frequency.

Spark_daemon_java_opts supports the following attributes:

Attribute name Description
Spark. Deploy. recoverymode The following three parameters are used to configure master ha in zookeeper mode.
Zookeeper indicates that the master Backup recovery mode is enabled. The default value is none.
Spark. Deploy. zookeeper. url Zookeeper cluster URL
Spark. Deploy. zookeeper. dir Zookeeper stores the recovery directory. The default value is/spark.
Spark. Deploy. recoverymode Set to filesystem to enable single-node master recovery mode. The default value is none.
Spark. Deploy. recoverydirectory Spark stores the recovery directory
 

B: unique properties of Yarn

The configuration of the properties specific to yarn should support the sparkconf mode and the conf/spark-defaults.conf file configuration mode ,.

Attribute name Default Description
Spark. yarn. applicationmaster. waittries 10 The number of times RM waits for spark appmaster to start, that is, the number of sparkcontext initialization times. If the value is exceeded, startup fails.
Spark. yarn. Submit. file. Replication 3 Copy factor of the File Uploaded By the application to HDFS
Spark. yarn. Preserve. Staging. Files False Set to true. After the job ends, save the stage-related files instead of deleting them.
Spark. yarn. scheduler. Heartbeat. interval-MS 5000 Interval at which spark appmaster sends heartbeat information to yarn RM
Spark.yarn.max.exe cutor. Failures 2 times the number of executors Maximum number of executor failures that cause an application to fail to be declared
Spark. yarn. historyserver. Address None The address of the spark history server (which must contain http ://). This address will be submitted to yarn RM after the application is completed, so that information is connected from the rm ui to the history server UI.

Spark1.0.0 attribute Configuration

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.