using k8s to build spark cluster
These days try to build spark cluster in k8s, trample some pit, share with everybody.
Spark's component introduction can refer to official documentation
A brief introduction to the large data biosphere This article is based on the k8s official example
Specific reference GitHub k8s FAQ image Pull Problem
This method requires access to the Gcr.io download mirror (VPN is generally required in China), it should be noted that GCR.IO/GOOGLE_CONTAINERS/SPARK:1.5.2_V1 mirror can not be used Index.tenxcloud.com/google_ Containers/spark Replace, the "Docker:filesystem Layer verification Failed" error occurs when the mirror is fetched after the replacement.
You can modify the mirror used by Zeppelin-controller.yaml to INDEX.TENXCLOUD.COM/GOOGLE_CONTAINERS/ZEPPELIN:V0.5.6_V1 WebUI service usage issues
The kubectl proxy–port=8001 directive in the document can only listen to 127.0.0.1 proxy requests, not the test environment and the virtual machine environment, because the IP address used is not 127.0.0.1.
Use Kubectl proxy–port=8001–address=\ Pyspark example Run error at this time
There is a problem with the data source in the example, which can be run using local files, such as "Sc.textfile ("/opt/spark/licenses/* "). Map (Lambda S:len (S.split ()). SUM ()" Zeppelin WebUI Usage Issues
The same can only be accessed through localhost or 127.0.0.1, by configuring the Zeppelin service type as Nodeport. Refer to the Zeppelin-service.yaml in Spark-20160427.zip.
Using Zeppelin-service.yaml to create the Zeppelin service, you can specify the port by Spec.ports.nodePort, and the port is random when not specified. Use the Kubectl describe Svc zeppelin|grep nodeport command to view the port. Access any node in the browser Ip:nodeport Access Zeppelin WebUI. Click "Create New Note" To enter the note Name.
In the new page, do the following:
%pyspark
Print sc.textfile ("/opt/spark/licenses/*"). Map (Lambda S:len (S.split ()). SUM ()
The example counts the number of words for all files in the local/opt/spark/licenses/directory, and then zeppelin the execution results after a few seconds. build based on Tenxcloud Mirror Image Library
According to the k8s source code in the examples/spark/under the Yaml file to build, all Yaml files copied to the working directory.
Modify Spark-master-controller.yaml and Spark-worker-controller.yaml:
* Spec.template.spec.containers.command are modified to "/start.sh"
* Spec.template.spec.containers.images modified to index.tenxcloud.com/google_containers/spark-master:1.5.2_ respectively V1 and INDEX.TENXCLOUD.COM/GOOGLE_CONTAINERS/SPARK-WORKER:1.5.2_V1
The mirror used by Zeppelin-controller.yaml is modified to INDEX.TENXCLOUD.COM/GOOGLE_CONTAINERS/ZEPPELIN:V0.5.6_V1
Once the modifications are complete, start by following the steps in the official K8s example. a simple spark-driver
Because Zeppelin mirrors are very large, pull takes a lot of time. You can use the following spark-driver.yaml to create a simple spark-driver:
Kind:replicationcontroller
apiversion:v1
metadata:
name:spark-driver
Spec:
replicas:1
Selector:
component:spark-driver
Template:
metadata: Labels
:
component:spark-driver
Spec:
containers:
-Name:spark-driver
image:index.tenxcloud.com/google_containers/spark-driver : 1.5.2_V1 Resources
:
requests:
cpu:100m
Once created, it can be accessed using KUBECTL exec <spark-driver-podname>-it pyspark.
YAML Configuration Reference here