Guide to Questions
1. What is kubernetes.
2. Try new features in the Kubernetes cluster and how to implement it.
3. Watch the spark resource created on the cluster and how to operate it.
We need to know before we start.
What is Kubernetes
Kubernetes (usually written as "k8s") is the first open source container cluster management project that was ultimately contributed by Google Design and development to Cloud Native Computing Foundation. It is designed to provide a platform between host clusters that automates deployment, expands, and enables application containers to operate. Kubernetes typically works with Docker container tools and consolidates multiple host clusters running Docker containers.
Introduced
The open source community has been dedicated over the past year to support kubernetes data processing, analytics and machine learning workloads. New extension features in kubernetes, such as custom resources and custom controllers, can be used to create deep integration with individual applications and frameworks.
Traditionally, data processing workloads are already running in specialized settings such as the Yarn/hadoop stack. However, a unified control layer for all workloads on the kubernetes can simplify cluster management and increase resource utilization.
Apache Spark 2.3, with native kubernetes support, combines the large-scale data-processing framework with two famous Open-source projects; and Kubernetes.
The Apache Spark is an essential tool for data scientists, providing a powerful platform for various applications ranging from large-scale data conversion to analysis to machine learning. Data scientists consistently adopt containers to improve their workflows by implementing benefits such as dependency packaging and creating reproducible artifacts. Given that Kubernetes is the de facto standard for managing container environments, it is appropriate to support the Kubernetes API in Spark.
Specifically, the local spark application in Kubernetes acts as a custom controller that creates kubernetes resources to respond to requests made by the Spark scheduler. Instead of deploying Apache Spark in standalone mode in Kubernetes, the local approach provides fine-grained management of spark applications, improves resiliency, and integrates seamlessly with logging and monitoring solutions. The community is also exploring advanced use cases, such as managing streaming workloads and leveraging service grids such as Istio.
To try it on your kubernetes cluster, simply download the official Apache Spark 2.3 release binaries. For example, here we describe a simple spark application that calculates the mathematical constant pi between three spark executing programs, each running in a separate pane. Note that this requires running a cluster of kubernetes 1.7 or later, configured to access its KUBECTL client, and the RBAC rules required by the default namespace and service account.
[Bash Shell] Plain text view copy code
?
01 02 03 04 05 06 07 08 09 10 11-12 |
$ kubectl cluster-info kubernetes Master is running at https://xx. YY.ZZ.WW $ bin/spark-submit \--master k8s:// https://xx. yy.zz.ww \--deploy-mode cluster \--name spark-pi \--class Org.apache.spark.examples.SparkP i \--conf spark.executor.instances=5 \--conf spark.kubernetes.container.image=<spark-image> \--con F spark.kubernetes.driver.pod.name=spark-pi-driver \ Local:///opt/spark/examples/jars/spark-examples_2.11-2.3.0.J Ar |
To view the spark resources created on the cluster, you can use the following KUBECTL commands in a separate terminal window.
[Bash Shell] Plain text view copy code
?
1 2 3 4 5 |
$ kubectl Get pods-l ' spark-role in (driver, executor) '-W NAME READY STATUS restarts age spark-pi- Driver 1/1 Running 0 14s spark-pi-da1968a859653d6bab93f8e6503935f2-exec-1 0/1 Pending 0 0s ... |
The results can be streamed by running during job execution:
[Bash Shell] Plain text view copy code
?
1 |
$ kubectl logs-f Spark-pi-driver |
When the application completes, you should see the calculated value of pi in the driver log.
in Spark 2.3, we first support spark applications written in Java and Scala, and support resource localization from a variety of data sources, including Http,gcs,hdfs. We also pay close attention to the failure and restoration semantics of the spark performer, laying a solid foundation for future development. Start using the Open source document (https://spark.apache.org/docs/latest/running-on-kubernetes.html) immediately.
Participate
There's a lot of exciting work to do in the near future. We are actively studying such functions as dynamic resource allocation, dependency clustering, support for PYSPARK&SPARKR, support for the kerberized HDFs cluster, and client-side mode and the interactive execution environment of popular notebooks. For those who fall in love with Kubernetes's way of managing applications declaratively, we are also committed to kubernetes operator Spark-submit, which allows users to declaratively specify and submit spark applications.
We just started. We hope you will participate in and help us to further develop the project.
to join the Spark-dev and Spark-user mailing lists [https://spark.apache.org/community.html].
The Apache Spark jira[https://issues.apache.org/jira/issues/?jql=project+%3d+spark+and+component+ under the Kubernetes component %3d+kubernetes] to ask the question.
attend our SIG meeting at 10 O ' Wednesday morning [https://github.com/kubernetes/community/tree/master/sig-big-data].
Thank you very much, Apache Spark and kubernetes contributors are distributed across multiple organizations (Google,databricks,red Hat,palantir,bloomberg,cloudera,pepperdata, Datalayer,hyperpilot and so on), they spent hundreds of hours to finish the work. We look forward to seeing more people contribute to the project and help it develop further.
Document Download:
Describe
Pdf
57 page
Content
Link: Https://pan.baidu.com/s/1y4P2jYZ3aFHxSk3MwWa1MA Password: q1y7