AlphaGo, 1000 nodes per experiment, 4 gpu,4000 GPUs per node. Siri, 2 nodes per experiment, 8 GPUs. AI research relies on massive data calculations, away from performance computing resources. A larger cluster runs the model, shortening the weekly training time to the day-level hours. Kubernetes, the most widely used container cluster management tool, distributed tensorflow monitoring, scheduling life cycle management. Container cluster automation deployment, expansion, operation and maintenance of open source platform, to provide task scheduling, monitoring, failure restart. TensorFlow, kubernetes are open source of Google company. https://kubernetes.io/. Google Cloud Platform-based solution. https://cloud.google.com/.
Distributed TensorFlow run in Kubernetes.
Deploy, run. Install Kubernetes. Minikube Create a local kubernetes cluster. MAC installs the VirtualBox virtual machine first. https://www.virtualbox.org/. Minikube go language, publish the form of independent binary files, download into the corresponding directory. Command:
Curl-lo minikube https://storage.googleapis.com/minikube/releases/v0.14.0/minikube-darwin-amd64 && chmod +x Minikube && sudo mv minikube/usr/local/bin/
The client kubectl,kubectl the command line to interact with the cluster. Installation:
Curl-lo kubectl Http://storage.googleapis.com/kubernetes-release/release/v1.5.1/bin/darwin/amd64/kubectl & & chmod +x kubectl && sudo mv kubectl/usr/local/bin/
Minikube Start the Kubernetes cluster:
Minikube start
Docker Hub Latest Image Tensorflow/tensorflow (version 1.0) https://hub.docker.com/r/tensorflow/tensorflow/. Configure the parameter server Deployment (deployment) file, named Tf-ps-deployment.json:
{
"Apiversion": "Extensions/v1beta1",
"Kind": "Deployment",
"Metadata": {
"Name": "TENSORFLOW-PS2"
},
"Spec": {
"Replicas": 2,
"Template": {
"Metadata": {
"Labels": {
"Name": "TENSORFLOW-PS2",
"Role": "PS"
}
}
},
"Spec": {
"Containers": [
{
"Name": "PS",
"Image": "Tensorflow/tensorflow",
"Ports": [
{
"Containerport": 2222
}
]
}
]
}
}
}
Configure the Parameter Server service file, named Tf-ps-service.json:
{
"Apiversion": "V1",
"Kind": "Service",
"Spec": {
"Ports": [
{
"Port": 2222,
"Targetport": 2222
}
],
"Selector": {
"Name": "TENSORFLOW-PS2"
}
},
"Metadata": {
"Labels": {
"Name": "TensorFlow",
"Role": "Service"
}
},
"Name": "Tensorflow-ps2-service"
}
Configure the Compute Server department file, name Tf-worker-deployment.json:
{
"Apiversion": "Extensions/v1beta1",
"Kind": "Deployment",
"Metadata": {
"Name": "Tensorflow-worker2"
},
"Spec": {
"Replicas": 2,
"Template": {
"Metadata": {
"Labels": {
"Name": "Tensorflow-worker2",
"Role": "Worker"
}
}
},
"Spec": {
"Containers": [
{
"Name": "Worker",
"Image": "Tensorflow/tensorflow",
"Ports": [
{
"Containerport": 2222
}
]
}
]
}
}
}
To configure the Compute Server service file, name Tf-worker-servic.json:
{
"Apiversion": "V1",
"Kind": "Service",
"Spec": {
"Ports": [
{
"Port": 2222,
"Targetport": 2222
}
],
"Selector": {
"Name": "Tensorflow-worker2"
}
},
"Metadata": {
"Labels": {
"Name": "Tensorflow-worker2",
"Role": "Service"
}
},
"Name": "Tensorflow-wk2-service"
}
Execute command:
Kubectl create-f Tf-ps-deployment.json
Kubectl create-f Tf-ps-service.json
Kubectl create-f Tf-worker-deployment.json
Kubectl create-f Tf-worker-service.json
Run the Kubectl get pod to see that the parameter server and the compute server are all created.
Go to each server (POD) and deploy the mnist_replica.py file. Run the command to view the ps_host, Worker_host IP address.
Kubectl Describe service Tensorflow-ps2-service
Kubectl Describe service Tensorflow-wk2-service
Open 4 Terminals and enter 4 pods respectively.
Kubectl Exec-ti Tensorflow-ps2-3073558082-3b08h/bin/bash
Kubectl Exec-ti Tensorflow-ps2-3073558082-4x3j2/bin/bash
Kubectl Exec-ti Tensorflow-worker2-3070479207-k6z8f/bin/bash
Kubectl Exec-ti Tensorflow-worker2-3070479207-6hvsk/bin/bash
Mnist_replica.py deployed to 4 pods.
Curl Https://raw.githubusercontent.com/tensorflow/tensorflow/master/tensorflow/tools/dist_test/python/mnist_ Replica.py-o mnist_replica.py
In the parameter server container, execute:
Python mnist_replica.py--ps_hosts=172.17.0.16:2222,172.17.0.17:2222--worker_bosts= 172.17.0.3:2222,172.17.0.8:2222--job_name= "PS"--task_index=0
Python mnist_replica.py--ps_hosts=172.17.0.16:2222,172.17.0.17:2222--worker_bosts= 172.17.0.3:2222,172.17.0.8:2222--job_name= "PS"--task_index=1
In the Compute server container execution:
Python mnist_replica.py--ps_hosts=172.17.0.16:2222,172.17.0.17:2222--worker_bosts= 172.17.0.3:2222,172.17.0.8:2222--job_name= "worker"--task_index=0
Python mnist_replica.py--ps_hosts=172.17.0.16:2222,172.17.0.17:2222--worker_bosts= 172.17.0.3:2222,172.17.0.8:2222--job_name= "worker"--task_index=1
Put the required code into the training data, test data in the persistent volume (persistent volume), shared among the multiple pods, to avoid deploying each pod separately.
TensorFlow GPU Docker cluster deployment, NVIDIA provides a nvidia-docker way to use host GPU devices to map to containers. Https://github.com/NVIDIA/nvidia-docker.
Training good model, packaging production environment independent image, convenient for testers to deploy a consistent environment, to mark different versions of the model, compare the accuracy rate of different models, from the overall reduction of testing, deployment on-line work complexity.
Resources:
"TensorFlow Technical Analysis and actual combat"
Welcome to Shanghai Machine learning jobs, my: Qingxingfengzi
Study Notes Tf064:tensorflow Kubernetes