spark2.3.0+kubernetes Application Deployment
Spark can be run in Kubernetes managed clusters, using native kubernetes scheduling features have been added to spark. At present, kubernetes scheduling is experimental, in future versions, Spark may have behavioral changes in configuration, container images, and portals.
(1) Prerequisites.
Run on Spark 2.3 or later. Run Kubernetes cluster version 1.6, access is configured to use Kubectl. If you have more than one kubernetes cluster, you can use Minikube to set up a test cluster locally. It is recommended that you use the Minikube latest version of the DNS plug-in. Note that the default Minikube configuration is not enough to run the spark application, it is recommended to use 3 CPUs, 4G memory configuration, can launch the spark application containing a executor. You must have the appropriate permissions in the cluster list, create, edit, and delete pods, Kubectl auth can-i<list|create|edit|delete> pods can list these resources. The service account credentials used by Driver pods must allow the creation of pods, services, and configuration configmaps. Configure Kubernetes DNS in the cluster.
(2) Working principle. The kubernetes works as shown in Figure 2-1.
Figure 2-1 kubernetes schematic diagram
Spark-submit can be directly submitted to the Kubernetes cluster for Spark application, the submission mechanism is:
L Spark Create Spark Driver, run in a kubernetes pod.
L driver creates executors, runs in kubernetes pods, and executes application code.
L When the application completes, the Executor pods is terminated and cleaned, but driver pod is persisted to the log, and the "Finish" state remains in Kubernetesapi until the final garbage collection or manual cleanup. Note: In the completed state, Driverpod does not use any calculation or memory resources.
Driver and executor pod are scheduled by kubernetes. It is possible to use configuration properties to dispatch a subgroup of the driver and executor nodes through the node selector. This will probably use more advanced scheduling hints in future releases, such as Node/podaffinities.
(3) Submit application to kubernetes.
Docker mirroring: Kubernetes requires a user-supplied mirror to be deployed to the pods within the container. Mirror
is to establish a Kubernetes supported container runtime environment. Docker is a kubernetes often used container running environment. Spark (starting with version 2.3) uses dockerfile, or custom personalized applications, which can be found in the directory kubernetes/dockerfiles/. Spark also comes with a bin/docker-image-tool.sh script that can be used to build and publish Docker mirrors, using kubernetes backend.
For example:
$./bin/docker-image-tool.sh-r <repo>-T My-tagbuild
$./bin/docker-image-tool.sh-r <repo>-T My-tagpush
(4) Cluster mode.
Submit Spark PI Program in cluster mode.
$ bin/spark-submit \
--masterk8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
--deploy-modecluster \
--name spark-pi\
--CLASSORG.APACHE.SPARK.EXAMPLES.SPARKPI \
--confspark.executor.instances=5 \
--confspark.kubernetes.container.image=<spark-image> \
Local:///path/to/examples.jar
Spark Master, specifying the--master command-line arguments in Spark-submit, or setting Spark.master in the application configuration file, must be a URL format k8s://<api_server_url> The prefix of master begins with k8s://, and the spark application is started in the Kubernetes cluster, using the Api_server_url contact API server. If the HTTP protocol is not specified in the URL, HTTPS is the default. For example, setting to master is equivalent to setting k8s://example.com:443 to k8s://https://example.com:443, but on a different port connection (without TLS), master can be set to k8s://http:// example.com:8080.
In kubernetes mode, the application name of Spark is specified through the spark.app.name or--name parameters, Spark-submit is committed with the default name, kubernetes the resources created, such as drivers and executors. The application name must be lowercase, "-", and "." Composition, the start and end characters must be an alphanumeric character.
If you have a kubernetes cluster setup, one way to discover the server's API URL is to query the cluster information by executing KUBECTL cluster-info.
$ kubectl Cluster-info
Kubernetes Master is running at http://127.0.0.1:6443
In the above example, you can use Spark-submit to submit the program in the Kubernetes cluster by specifying--masterk8s://http://127.0.0.1:6443. In addition, you can also use the authentication Agent Kubectl proxy to contact the Kubernetes API.
Local agent Startup:
$ KUBECTL Proxy
If the local agent is running in Localhost:8001,--master k8s://http://127.0.0.1:8001 can be submitted as spark-submit parameter application. Finally, note that in the above example, we specify a specific URI scheme local://, and this URI example's jar package is already in the Docker mirror.
(5) Dependency management.
If the dependencies of the application are hosted at the remote location of the HDFs or HTTP server, they can be referenced by the appropriate remote URI. In addition, application dependencies can be preinstalled to a custom Docker mirror. Dependencies can be added to the local URI local://and/by reference to the Classpath, or set the environment variable in dockerfiles, setting the Spark_extra_classpath. The local://mode requires a custom-dependent Docker mirroring. Note that the use of the local file system from the submitting client is not currently supported.
(6) Use of remote dependencies.
When applications rely on HDFS or HTTP servers hosted in remote locations, driver and executor pods
Kubernetes initialization container Download dependencies are required so that the driver and executor containers can be used locally.
The initialization container handles remote dependencies by specifying Spark.jars (or Spark-submit--jars), Spark.files (or spark-submit--files parameters), and processing remote hosting primary application resources. For example: The main program jar, below shows an example of spark-submit using remote dependencies:
$ bin/spark-submit \
--masterk8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
--deploy-modecluster \
--name spark-pi\
--CLASSORG.APACHE.SPARK.EXAMPLES.SPARKPI \
--jarshttps://path/to/dependency1.jar,https://path/to/dependency2.jar
--fileshdfs://host:port/path/to/file1,hdfs://host:port/path/to/file2
--confspark.executor.instances=5 \
--confspark.kubernetes.container.image=<spark-image> \
Https://path/to/examples.jar
(7) Key management.
The Kubernetes key can be used to provide spark application access security Service certificates. Installs the user-specified key to the driver container, and the user can use the configuration Properties spark.kubernetes.driver.secrets. [secretname]=<mountpath>. Similarly, a user-specified key is installed to the executor container, and the user can use the configuration Properties spark.kubernetes.executor.secrets. [Secretname]=<mount path>. Note that the installed key is assumed to be the same namespace in the driver and executor pods. For example, install a key spark-secret to the path/etc/secrets in the driver and executor containers, and add the following commit command in Spark-submit:
--confspark.kubernetes.driver.secrets.spark-secret=/etc/secrets
--confspark.kubernetes.executor.secrets.spark-secret=/etc/secrets
Note that if an initialization container is used, any key that is installed into the driver container will also be installed into the driver initialization container. Similarly, any container that installs the key to the executor will also be installed into the executor initialization container.
(8) Introspection and debugging.
These are the different methods of observing spark application operation, application completion and progress monitoring.
L Access logs. You can use the Kubernetes API or the KUBECTL CLI to access logs, and when a spark application is running, the application may log flow logs:
$ kubectl-n=<namespace> logs-f <driver-pod-name>
If you are installing in a cluster, you can also access the log through the Kubernetes dashboard.
L access the interface of the driver UI. The user interface associated with the application can be accessed locally using Kubectl port-forward.
$ kubectl Port-forward <driver-pod-name> 4,040:4,040
L Debugging. There may be several failures: if the Kubernetes API server rejects the SPARK-SUBMIT submission, or if a different reason rejects the connection, the submission logic should indicate the error encountered. However, if you are running an application, the best approach may be through KUBERNETESCLI.
To get some basic information on driverpod scheduling decisions, you can run:
$ kubectl describepod <spark-driver-pod>
If the pod encounters a run-time error, the status can be further verified and can be used:
$ kubectl logs<spark-driver-pod>
The status and logs of failed executor pods can be checked in a similar way. Finally, deleting the driverpod will clean up the entire spark application, including all executors, related services, and so on. Driverpod can be considered to be the expression of spark applied in kubernetes.
(9) The characteristics of kubernetes.
L namespaces. Kubernetes has the concept of a namespace. Namespaces are a way of dividing cluster resources across multiple users (through resource quotas). Spark runs in kubernetes, you can use namespaces to start spark applications. This can be done by using the Spark.kubernetes.namespace configuration. Kubernetes allows you to use Resourcequota to set the namespace for the individual, such as resource limits, number of objects, and so on. Namespaces and Resourcequota can be used in combination, and administrators control the allocation of resources that spark applications run in the Kubernetes cluster.
• role-based access control. Kubernetes Cluster role-based access control is enabled, users can configure kubernetes RBAC roles and service accounts, run by Spark kubernetes various components to access KUBERNETESAPI servers.
Spark Driver pod uses the Kubernetes service account to access the Kubernetes API server, creating and inspecting Executor pods. The service account used by Driver pod must have the appropriate permissions on the Driver operation. Specifically, at least service accounts must be granted role or clusterrole roles, and running driverpods can create pods and services. By default, if the pod is created without a service-specified account, Driver pod automatically assigns the default service account Spark.kubernetes.namespace for the specified namespace.
Depending on the version and Kubernetes deployment settings, the default service account may have a role that allows driverpods to create pods and services under the default Kubernetes and role-based access control policies. Sometimes, a user may need to specify a custom service account to grant a role. Spark runs in Kubernetes, customizing the service account by configuring properties Spark.kubernetes.authenticate.driver.serviceaccountname=<serviceaccount name> Specifies the custom service account to be used by the driver pod. For example: Let Driver pod use the Spark service account, and the user simply adds the following Spark-submit Submit command:
--confspark.kubernetes.authenticate.driver.serviceaccountname=spark
Create a custom service account that users can use with the Kubectl create ServiceAccount command. For example, the following command creates a service account called Spark:
$ kubectl createserviceaccount Spark
Grant Service account role or Clusterrole, a rolebinding or clusterrolebinding is necessary. Create a rolebinding or clusterrolebinding that the user can use Kubectl create rolebinding (or clusterrolebinding) command. For example, the following command creates a editclusterrole in the default namespace and grants it to the Spark service account:
$ kubectl createclusterrolebinding spark-role--clusterrole=edit--serviceaccount=default:spark–
Namespace=default
Note that role can only be used to grant access to resources (such as pods) in a single namespace, and Clusterrole can be used to grant access to a cluster-wide resource (such as a node) and namespace resources (such as pods) for all namespaces. Spark runs in Kubernetes, because driver is always created in the same namespace executorpods,role is sufficient, although the user can use Clusterrole. More information on RBAC authorization and how to configure the Kubernetes service account, see Using RBAC Licensing Content (https://kubernetes.io/docs/admin/authorization/rbac/) and Pods configuring service accounts (https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/).
(10) Client mode. Client mode is not currently supported.
(11) Future work. Spark runs in the Kubernetes function, is Apache-spark-on-k8s/spark branching hatch (Https://github.com/apache-spark-on-k8s/spark), Eventually it will go into the Spark-kubernetes integrated version.
Some of these include:
L Pyspark
L R
L Dynamic Executor Scaling
• Dependency management for local files
Application management of L-Spark
L Work queues and resource management
You can refer to the documentation (https://apache-spark-on-k8s.github.io/userdocs/) to try these features and provide feedback to the development team.
(12) configuration.
Spark configuration information can be viewed on the page (http://spark.apache.org/docs/latest/configuration.html). Spark Run Kubernetes configuration information can be viewed on the page (http://spark.apache.org/docs/latest/running-on-kubernetes.html).
Spark Properties
Property
Name |
Default |
meaning |
Spark.kubernetes.namespace |
Default |
The namespace that would be used for running the driver and executor pods. |
Spark.kubernetes.container.image |
(none) |
Container image to the Spark application. This is usually the form example.com/repo/spark:v1.0.0. This configuration is required and must to provided by the user, unless explicit images are to provided for each different co Ntainer type. |
Spark.kubernetes.driver.container.image |
(Value of Spark.kubernetes.container.image) |
Custom container image to the driver. |
Spark.kubernetes.executor.container.image |
(Value of Spark.kubernetes.container.image) |
Custom container image to a for executors. |
Spark.kubernetes.container.image.pullPolicy |
Ifnotpresent |
Container image pull policy used when pulling images within. |
Spark.kubernetes.allocation.batch.size |
5 |
Number of pods to launch at once on each round of executor pod allocation. |
Spark.kubernetes.allocation.batch.delay |
1s |
Between each round of executor pod allocation. Specifying values less than 1 second may leads to excessive CPU usage on the spark driver. |
Spark.kubernetes.authenticate.submission.caCertFile |
(none) |
Path to the CA cert file for connecting to the Kubernetes API server over TLS when starting the driver. This file must is located on the submitting machine ' s disk. Specify this as a path as opposed to a URI (i.e. does not provide a scheme). |
Spark.kubernetes.authenticate.submission.clientKeyFile |
(none) |
Path to the client key file for authenticating against the Kubernetes API server when starting the driver. This file must is located on the submitting machine ' s disk. Specify this as a path as opposed to a URI (i.e. does not provide a scheme). |
Spark.kubernetes.authenticate.submission.clientCertFile |
(none) |
Path to the client cert file for authenticating against the Kubernetes API server when starting the driver. This file must is located on the submitting machine ' s disk. Specify this as a path as opposed to a URI (i.e. does not provide a scheme). |
Spark.kubernetes.authenticate.submission.oauthToken |
(none) |
OAuth token to use when authenticating against the Kubernetes API server when starting the driver. Note So unlike the other authentication options, this is expected to being the exact string value of the token to use for T He authentication. |
Spark.kubernetes.authenticate.submission.oauthTokenFile |
(none) |
Path to the OAuth token file containing the token to use when authenticating against the Kubernetes API server when Starti ng the driver. This file must is located on the submitting machine ' s disk. Specify this as a path as opposed to a URI (i.e. does not provide a scheme). |
Spark.kubernetes.authenticate.driver.caCertFile |
(none) |
Path to the CA cert file for connecting to the Kubernetes API server over TLS from the driver pod when requesting executor S. This file must is located on the submitting machine ' s disk, and'll be uploaded to the driver pod. Specify this as a path as opposed to a URI (i.e. does not provide a scheme). |
spark.kubernetes.authenticate.driver.clientKeyFile |
(none) |
Path to the client key file for Authenticating against the Kubernetes API server from the driver pod when requesting executors. This file must is located on the submitting machine ' s disk, and would be uploaded to the driver pod. Specify this as a path as opposed to a URI (i.e. does not provide a scheme). If It is specified, it's highly recommended to set up TLS for the driver submission server, as this value is sensitive Information that would is passed to the driver pod in plaintext otherwise. |
spark.kubernetes.authenticate.driver.clientCertFile |
(none) |
Path to the client Cert file F or authenticating against the Kubernetes API server from the driver pod when requesting executors. This file must is located on the submitting machine ' s disk, and would be uploaded to the driver pod. Specify this as a path as opposed to a URI (i.e. does not provide a scheme). |
Spark.kubernetes.authenticate.driver.oauthToken |
(none) |
OAuth token to if Authentica Ting against the Kubernetes API server from the driver pod when requesting executors. This is unlike the other authentication options and this must is the exact string value of the token to use for the Authent Ication. This token value are uploaded to the driver pod. If It is specified, it's highly recommended to set up TLS for the driver submission server, as this value is sensitive Information that would is passed to the driver pod in plaintext otherwise. |
spark.kubernetes.authenticate.driver.oauthTokenFile |
(none) |
Path to the OAuth token file C Ontaining the token to use when authenticating against the Kubernetes API server from the driver pod when requesting Tors. Note So unlike the other authentication options, this file must contain the exact string value of the token to use for T He authentication. This token value are uploaded to the driver pod. If It is specified, it's highly recommended to set up TLS for the driver submission server, as this value is sensitive Information that would is passed to the driver pod in plaintext otherwise. |
Spark.kubernetes.authenticate.driver.mounted.caCertFile |
(none) |
Path to the CA cert file for connecting to the Kubernetes API server over TLS from the driver pod when requesting executor S. This path must is accessible from the driver pod. Specify this as a path as opposed to a URI (i.e. does not provide a scheme). |
Spark.kubernetes.authenticate.driver.mounted.clientKeyFile |
(none) |
Path to the client key file for authenticating against the Kubernetes API server from the driver pod when requesting EXECU Tors. This path must is accessible from the driver pod. Specify this as a path as opposed to a URI (i.e. does not provide a scheme). |
Spark.kubernetes.authenticate.driver.mounted.clientCertFile |
(none) |
Path to the client cert file for authenticating against the Kubernetes API server from the driver pod when requesting exec Utors. This path must is accessible from the driver pod. Specify this as a path as opposed to a URI (i.e. does not provide a scheme). |
Spark.kubernetes.authenticate.driver.mounted.oauthTokenFile |
(none) |
Path to the file containing the "OAuth token to" when authenticating against the Kubernetes API server from the driver P OD when requesting executors. This path must is accessible from the driver pod. Note So unlike the other authentication options, this file must contain the exact string value of the token to use for T He authentication. |
Spark.kubernetes.authenticate.driver.serviceAccountName |
Default |
Service account This is used when running the driver pod. The driver pod uses this service is requesting executor pods from the API server. This is cannot be specified alongside a CA cert file, client key file, client cert file, and/or OAuth. |
Spark.kubernetes.driver.label. [LabelName] |
(none) |
ADD the label specified by LabelName to the driver pod. For example, Spark.kubernetes.driver.label.something=true. Note This Spark also adds its own labels to the driver pod for bookkeeping. |
Spark.kubernetes.driver.annotation. [Annotationname] |
(none) |
ADD the annotation specified by annotationname to the driver pod. For example, Spark.kubernetes.driver.annotation.something=true. |
Spark.kubernetes.executor.label. [LabelName] |
(none) |
ADD the label specified by LabelName to the executor pods. For example, Spark.kubernetes.executor.label.something=true. Note This Spark also adds its own labels to the driver pod for bookkeeping. |
Spark.kubernetes.executor.annotation. [Annotationname] |
(none) |
ADD the annotation specified by Annotationname to the executor pods. For example, Spark.kubernetes.executor.annotation.something=true. |
Spark.kubernetes.driver.pod.name |
(none) |
Name of the driver pod. If not set, the driver pod name was set to "Spark.app.name" suffixed by the current timestamp to avoid name conflicts. |
Spark.kubernetes.executor.lostCheck.maxAttempts |
10 |
Number of times this driver would try to ascertain the loss reason for a specific executor. The loss reason is used to ascertain whether the executor failure are due to a framework or of application error which in T Urn decides whether the executor is removed and replaced, or placed into a failed state for debugging. |
Spark.kubernetes.submission.waitAppCompletion |