How to run a Spark cluster in a kubernetes environment

Source: Internet
Author: User
Tags shuffle value of pi pyspark k8s

With such a large amount of data, there may be thousands of machines behind it, unable to manually monitor the state of the machine. Therefore, this article introduces the use of the Kubernetes container management tool and, through a simple example, tells you how to build a Spark cluster. Preparation Phase

1. You need to have a running kubernetes cluster and use KUBECTL to configure access permissions for it. If you do not have a kubernetes cluster available, you can use Minikube to set up a test cluster on your local computer.

We recommend that you update the Minikube to the latest version (0.19.0 when writing this document), because some earlier versions might not be able to start a kubernetes cluster with all the necessary components.

2. It is necessary to have the appropriate permissions to create and list Pod,configmaps and secrets in the cluster. To confirm, you can list these resources by running the Kubectl get Pods,kubectl configmap and Kubectl go secrets.

The service account or credentials that you use must also have the appropriate permissions to edit the Pod template.

3. A Spark configuration with kubernetes support is required. This can be obtained from a stable version of the Tarball or through the creation of a kubernetes supported Spark.

You need to configure Kubernetes DNS in the cluster. preparing for mirroring and driving

Kubernetes requires that the user-supplied mirrors be deployed to the container's Pod. Mirroring is a run-time environment that is built to run in a kubernetes-supported container. Docker is a container runtime environment that is often used with kubernetes. So Spark provides some support for getting started with Docker.

If you want to use a Docker image that has already been built, you can use a mirror that is published in Kubespark. The mirrors are as follows:

You can also build these Docker mirrors from source code, or customize them as needed. Distributed Spark includes basic mirroring, driver, execution procedures, and corresponding dockerfiles/spark-base/dockerfile,dockerfiles/driver/dockerfile, dockerfiles/ Executor/dockerfile, and Dockerfiles/init-container/dockerfile.

Use these Docker files to build Docker mirrors, which are then sent to the registry for marking. Finally, the mirror is pushed to the registry.

For example, if the registry host is Registry-host, 5000 ports are being used:

Note that this spark-base is the underlying mirror image of the other mirrors. Must be built before other mirrors, and other mirrors can be built in any order. submit the program to Kubernetes

Kubernetes applications can be executed through spark-submit.

For example, to calculate the value of Pi, suppose mirroring is set as described above:

The Spark host that is passed to spark-submit through the--master command line or set by Spark.master in the configuration of the application to be specified must be a URL with that format k8s://<api_server_url& gt;.

Using the main string prefix k8s://will enable the Spark application to start on the Kubernetes cluster, which the API server interacts with through Api_server_url.

If the HTTP protocol is not specified in the URL, HTTPS is the default.

For example, the primary server setting k8s://example.com:443 is equivalent to setting it to k8s://https://example.com:443, but the primary server is set to k8s://http://when the TLS connection is not used on different ports example.com:8443.

If you have kubernetes cluster settings, you can discover Apiserver URLs by performing Kubectl Cluster-info.

In the above example, a specific kubernetes cluster can be used with spark-submit by specifying the--master k8s://http://127.0.0.1:8080 parameter.

Note that applications can only be executed in cluster mode at this moment, and that drivers and their execution programs also run on the cluster.

Finally, note that in the above example, we specified a Jar package local://with a specific URI. The URI is the location of the sample Jar package already in the Docker mirror. The following discusses the use of dependencies on local disks on the computer.

When Kubernetes RBAC is enabled, the service account used by the default driver may not have the appropriate pod editing rights to start the execution program pod. We recommend adding another service account, such as a Spark with the necessary permissions. For example:

This allows you to add--conf spark.kubernetes.authenticate.driver.serviceaccountname=spark to the Spark-submit command line above to specify the service account to use. Dependency Management

Application dependencies that are submitted from the machine need to be sent to the resource relay server, and then the driver and the executing program can communicate with them to retrieve the dependencies.

The YAML file that represents the smallest Kubernetes resource running this service is located in the file Conf/kubernetes-resource-staging-server.yaml. This YAML file configures the Pod of a configmap resource relay server and exposes services through a service with a fixed node port.

Deploying a resource to a server using the include YAML files requires that you have permission to create a deployment, service, and configuration mapping.

To run a resource relay server using the default configuration, you can create Kubernetes resources:

You can then calculate the Pi value as shown in the following example:

Docker mirroring of a resource relay server can also be built from source code, similar to driver and executor mirroring.

Dockerfile provides a dockerfiles/resource-staging-server/dockerfile.

The YAML file provided specifically sets the Nodeport in the service specification to 31000. If Port 31000 is not available on any node of the cluster, the Nodeport field should be removed from the service specification and the Kubernetes cluster will be allowed to determine nodeport.

Make sure that when you submit your application, the Nodeport selected according to the Kubernetes cluster is the available port in the resource relay server URI.

No resources to transit server dependency management

Note that this resource relay server is used only for submitting local dependencies. If your application dependencies are all hosted in remote locations (such as HDFS or HTTP servers), they may be referenced by their appropriate remote URIs. In addition, application dependencies can be pre-installed in a custom Docker mirror. These dependencies can be added to the classpath by referencing them in Local://uri and setting environment variables in the dockerfiles with/or Spark_extra_classpath.

Accessing the Kubernetes cluster

Spark-submit also supports submission via local KUBECTL proxy.

You can use an authentication agent to communicate directly with an API server without having to pass credentials to Spark-submit. The local agent can start by running the following command:

If our local agent is listening on port 8001, we will submit the code shown below:

Communication between the Spark and kubernetes clusters is performed using the Fabric8 kubernetes-client library. This mechanism can be used when we have a certification provider that is not supported by the Fabric8 Kubernetes-client Library. Authentication using the X509 client certificate and the OAuth token is currently supported.

Run Pyspark

The same spark-submit logic is used when running Pyspark on Kubernetes and when starting on Yarn and Mesos. Python files can be set in--py-files,

The following is an example submission:

dynamic Assignment in the Kubernetes

The Spark on the kubernetes supports dynamic allocation in cluster mode. This mode requires the operation of an external Shuffle service. This is a typical daemonset with provisioned Hostpath Volume. This Shuffle service can be shared by the performer who belongs to a different sparkjobs.

A sample configuration file, Conf/kubernetes-shuffle-service.yaml, can be tailored to a specific cluster as needed. Note that it is important to properly set up spec.template.metadata.labels for the Shuffle service, because there may be more than one Shuffle service instance running in the cluster. These tags provide a way for the Spark application to locate the specified Shuffle service.

For example, if we are going to use the Shuffle service in the default namespace, with Pods App=spark-shuffle-service and spark-version=2.2.0, we can use these tags to locate the specified Shuffle at work startup Service. In order to run a job that enables dynamic assignment, the command might look like the following:

The external Shuffle service must have a directory that can be shared with the executing program Pod. Provides a sample YAML specification that mounts Hostpath volumes to external Shuffle service pods, but these hostpath volumes must also be mounted in the execution program.

When using an external random playback service, the directory specified in the Spark.local.dir configuration is mounted as a hostpath volume into all execution containers.

To ensure that incorrect hostpath volumes are not accidentally mounted, Spark.Local.Dir must specify a value in the configuration of the application when using Kubernetes, although the default is the temporary directory of the JVM when using the other Cluster Administrator. Advanced

Using TLS to protect resource relay servers

The default configuration for a resource relay server is not secured through TLS. It is strongly recommended that you configure it to protect the keys and Jar files that are submitted through the staging server.

The YAML file Conf/kubernetes-resource-staging-server.yaml contains Configmap resources to save the server configuration for the resource transfer.

You can adjust the properties here so that the resource relay server listens to TLS.

See the Security page for available settings related to TLS.

The namespace of the resource relay server is Kubernetes.resourcestagingserver, for example, the path of the server's KeyStore will be set Spark.ssl.Kubernetes.ResourceStagingServer.KeyStore.

In addition to the settings specified by the previously linked Security page, the resource relay server also supports the following additional configurations:

Note that although you can set properties in Configmap, you still need to consider installing the appropriate key files into the container for the resource relay server.

A common mechanism is to use kubernetes secrets as secret volumes. Refer to the appropriate Kubernetes documentation for guidance and adjust the Resource Relay server specification in the provided YAML file accordingly.

Finally, when submitting an application, you must specify a truststore or a PEM-encoded certificate file to communicate with the resource relay server through TLS.

You can set up Truststore, or you can set up Spark.ssl.kubernetes.resourceStagingServer.trustStore certificate files Spark.ssl.kubernetes.resourceStagingServer.clientCertPem.

For example, our SPARKPI example shows the following:

Video link: https://www.youtube.com/watch?v=UywgL70FQ3s

Author: Ziang Yang, Jfrog Research and development engineer

Has many years of software development experience, Java mainstream technology, cutting-edge framework has a wealth of development experience; good at Linux server, optimization, deployment and other in-depth research, familiar with Jenkins, continuous integration and delivery, devops and so on.

Welcome reprint, but reprint please indicate the author and source. Thank you.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.