Learning notes TF064: TensorFlow Kubernetes, tf064tensorflow
AlphaGo: each experiment has 1000 nodes and each node has 4 GPUs and 4000 GPUs. Siri: 2 nodes and 8 GPUs for each experiment. AI research relies on massive data computing, instead of performance computing resources. The larger cluster running model shortens the weekly training time to the day-level hour level. Kubernetes, the most widely used container cluster management tool, distributed TensorFlow monitoring, and scheduling lifecycle management. The open-source platform for automatic deployment, resizing, and O & M of container clusters provides task scheduling, monitoring, and failure restart. TensorFlow and Kubernetes are open-source companies of Google. Https://kubernetes.io /. Google cloud platform-based solutions. Https://cloud.google.com /.
Distributed TensorFlow runs in Kubernetes.
Deployment and running. Install Kubernetes. Minikube creates a local Kubernetes cluster. Install VirtualBox Virtual Machine on Mac first. Https://www.virtualbox.org /. The Minikube Go language is compiled and released as an independent binary file, which is downloaded to the corresponding directory. Command:
Curl-Lo minikube https://storage.googleapis.com/minikube/releases/v0.14.0/minikube-darwin-amd64 & chmod + x minikube & sudo mv minikube/usr/local/bin/
The command line of the client kubectl and kubectl interacts with the cluster. Installation:
Http://storage.googleapis.com/kubernetes-release/release/v1.5.1/bin/darwin/amd64/kubectl & chmod + x kubectl & sudo mv kubectl/usr/local/bin/
Minikube starts a Kubernetes cluster:
Minikube start
Docker Hub latest image tensorflow/tensorflow (version 1.0) https://hub.docker.com/r/tensorflow/tensorflow. Configure the parameter server deployment file named tf-ps-deployment.json:
{
"ApiVersion": "extensions/v1beta1 ",
"Kind": "Deployment ",
"Metadata ":{
"Name": "tensorflow-ps2"
},
"Spec ":{
"Replicas": 2,
"Template ":{
"Metadata ":{
"Labels ":{
"Name": "tensorflow-ps2 ",
"Role": "ps"
}
}
},
"Spec ":{
"Containers ":[
{
"Name": "ps ",
"Image": "tensorflow/tensorflow ",
"Ports ":[
{
"ContainerPort": 2222
}
]
}
]
}
}
}
Configure the parameter Server Service file named tf-ps-service.json:
{
"ApiVersion": "v1 ",
"Kind": "Service ",
"Spec ":{
"Ports ":[
{
"Port": 2222,
"TargetPort": 2222
}
],
"Selector ":{
"Name": "tensorflow-ps2"
}
},
"Metadata ":{
"Labels ":{
"Name": "tensorflow ",
"Role": "service"
}
},
"Name": "tensorflow-ps2-service"
}
Configure the computing server location file, named tf-worker-deployment.json:
{
"ApiVersion": "extensions/v1beta1 ",
"Kind": "Deployment ",
"Metadata ":{
"Name": "tensorflow-worker2"
},
"Spec ":{
"Replicas": 2,
"Template ":{
"Metadata ":{
"Labels ":{
"Name": "tensorflow-worker2 ",
"Role": "worker"
}
}
},
"Spec ":{
"Containers ":[
{
"Name": "worker ",
"Image": "tensorflow/tensorflow ",
"Ports ":[
{
"ContainerPort": 2222
}
]
}
]
}
}
}
Configure the computing Server service file named tf-worker-servic.json:
{
"ApiVersion": "v1 ",
"Kind": "Service ",
"Spec ":{
"Ports ":[
{
"Port": 2222,
"TargetPort": 2222
}
],
"Selector ":{
"Name": "tensorflow-worker2"
}
},
"Metadata ":{
"Labels ":{
"Name": "tensorflow-worker2 ",
"Role": "service"
}
},
"Name": "tensorflow-wk2-service"
}
Run the following command:
Kubectl create-f tf-ps-deployment.json
Kubectl create-f tf-ps-service.json
Kubectl create-f tf-worker-deployment.json
Kubectl create-f tf-worker-service.json
Run kubectl get pod to check whether all the parameter servers and computing servers have been created.
Enter each server (Pod) and deploy the mnist_replica.py file. Run the command to view the IP addresses of ps_host and worker_host.
Kubectl describe service tensorflow-ps2-service
Kubectl describe service tensorflow-wk2-service
Open four terminals and enter four pods respectively.
Kubectl exec-ti tensorflow-ps2-3073558082-3b08h/bin/bash
Kubectl exec-ti tensorflow-ps2-3073558082-4x3j2/bin/bash
Kubectl exec-ti tensorflow-worker2-3070479207-k6z8f/bin/bash
Kubectl exec-ti tensorflow-worker2-3070479207-6hvsk/bin/bash
Mnist_replica.py is deployed to four pods.
Https://raw.githubusercontent.com/tensorflow/tensorflow/master/tensorflow/tools/dist_test/python/mnist_replica.py-o mnist_replica.py
Run the following command in the parameter server container:
Python mnist_replica.py -- ps_hosts = 172.17.0.16: 2222,172.17 .0.17: 2222 -- worker_bosts = 172.17.0.3: 2222,172.17 .0.8: 2222 -- job_name = "ps" -- task_index = 0
Python mnist_replica.py -- ps_hosts = 172.17.0.16: 2222,172.17 .0.17: 2222 -- worker_bosts = 172.17.0.3: 2222,172.17 .0.8: 2222 -- job_name = "ps" -- task_index = 1
Run the following command in the computing server container:
Python mnist_replica.py -- ps_hosts = 172.17.0.16: 2222,172.17 .0.17: 2222 -- worker_bosts = 172.17.0.3: 2222,172.17 .0.8: 2222 -- job_name = "worker" -- task_index = 0
Python mnist_replica.py -- ps_hosts = 172.17.0.16: 2222,172.17 .0.17: 2222 -- worker_bosts = 172.17.0.3: 2222,172.17 .0.8: 2222 -- job_name = "worker" -- task_index = 1
Put the source code to be executed into the training data and test data in the persistent volume to share the data among multiple pods and avoid separate deployment in each Pod.
TensorFlow GPU Docker cluster deployment. Nvidia provides the nvidia-docker method, which uses host machine GPU devices to map to containers. Https://github.com/NVIDIA/nvidia-docker.
After models are trained, independent images are packaged and created to facilitate testers to deploy consistent environments, label models of different versions, and compare the accuracy of different models. This reduces the overall complexity of testing, deployment, and launch.
References:
Analysis and Practice of TensorFlow Technology
Welcome to the Shanghai Machine Learning job opportunity, my qingxingfengzi