Learning notes TF064: TensorFlow Kubernetes, tf064tensorflow

Last Update:2017-11-12 Source: Internet

Author: User

Tags docker hub

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Learning notes TF064: TensorFlow Kubernetes, tf064tensorflow

AlphaGo: each experiment has 1000 nodes and each node has 4 GPUs and 4000 GPUs. Siri: 2 nodes and 8 GPUs for each experiment. AI research relies on massive data computing, instead of performance computing resources. The larger cluster running model shortens the weekly training time to the day-level hour level. Kubernetes, the most widely used container cluster management tool, distributed TensorFlow monitoring, and scheduling lifecycle management. The open-source platform for automatic deployment, resizing, and O & M of container clusters provides task scheduling, monitoring, and failure restart. TensorFlow and Kubernetes are open-source companies of Google. Https://kubernetes.io /. Google cloud platform-based solutions. Https://cloud.google.com /.

Distributed TensorFlow runs in Kubernetes.

Deployment and running. Install Kubernetes. Minikube creates a local Kubernetes cluster. Install VirtualBox Virtual Machine on Mac first. Https://www.virtualbox.org /. The Minikube Go language is compiled and released as an independent binary file, which is downloaded to the corresponding directory. Command:

Curl-Lo minikube https://storage.googleapis.com/minikube/releases/v0.14.0/minikube-darwin-amd64 & chmod + x minikube & sudo mv minikube/usr/local/bin/

The command line of the client kubectl and kubectl interacts with the cluster. Installation:

Http://storage.googleapis.com/kubernetes-release/release/v1.5.1/bin/darwin/amd64/kubectl & chmod + x kubectl & sudo mv kubectl/usr/local/bin/

Minikube starts a Kubernetes cluster:

Minikube start

Docker Hub latest image tensorflow/tensorflow (version 1.0) https://hub.docker.com/r/tensorflow/tensorflow. Configure the parameter server deployment file named tf-ps-deployment.json:

{
"ApiVersion": "extensions/v1beta1 ",
"Kind": "Deployment ",
"Metadata ":{
"Name": "tensorflow-ps2"
},
"Spec ":{
"Replicas": 2,
"Template ":{
"Metadata ":{
"Labels ":{
"Name": "tensorflow-ps2 ",
"Role": "ps"
}
}
},
"Spec ":{
"Containers ":[
{
"Name": "ps ",
"Image": "tensorflow/tensorflow ",
"Ports ":[
{
"ContainerPort": 2222
}
]
}
]
}
}
}

Configure the parameter Server Service file named tf-ps-service.json:

{
"ApiVersion": "v1 ",
"Kind": "Service ",
"Spec ":{
"Ports ":[
{
"Port": 2222,
"TargetPort": 2222
}
],
"Selector ":{
"Name": "tensorflow-ps2"
}
},
"Metadata ":{
"Labels ":{
"Name": "tensorflow ",
"Role": "service"
}
},
"Name": "tensorflow-ps2-service"
}

Configure the computing server location file, named tf-worker-deployment.json:

{
"ApiVersion": "extensions/v1beta1 ",
"Kind": "Deployment ",
"Metadata ":{
"Name": "tensorflow-worker2"
},
"Spec ":{
"Replicas": 2,
"Template ":{
"Metadata ":{
"Labels ":{
"Name": "tensorflow-worker2 ",
"Role": "worker"
}
}
},
"Spec ":{
"Containers ":[
{
"Name": "worker ",
"Image": "tensorflow/tensorflow ",
"Ports ":[
{
"ContainerPort": 2222
}
]
}
]
}
}
}

Configure the computing Server service file named tf-worker-servic.json:

{
"ApiVersion": "v1 ",
"Kind": "Service ",
"Spec ":{
"Ports ":[
{
"Port": 2222,
"TargetPort": 2222
}
],
"Selector ":{
"Name": "tensorflow-worker2"
}
},
"Metadata ":{
"Labels ":{
"Name": "tensorflow-worker2 ",
"Role": "service"
}
},
"Name": "tensorflow-wk2-service"
}

Run the following command:

Kubectl create-f tf-ps-deployment.json
Kubectl create-f tf-ps-service.json
Kubectl create-f tf-worker-deployment.json
Kubectl create-f tf-worker-service.json

Run kubectl get pod to check whether all the parameter servers and computing servers have been created.
Enter each server (Pod) and deploy the mnist_replica.py file. Run the command to view the IP addresses of ps_host and worker_host.

Kubectl describe service tensorflow-ps2-service
Kubectl describe service tensorflow-wk2-service

Open four terminals and enter four pods respectively.

Kubectl exec-ti tensorflow-ps2-3073558082-3b08h/bin/bash
Kubectl exec-ti tensorflow-ps2-3073558082-4x3j2/bin/bash
Kubectl exec-ti tensorflow-worker2-3070479207-k6z8f/bin/bash
Kubectl exec-ti tensorflow-worker2-3070479207-6hvsk/bin/bash

Mnist_replica.py is deployed to four pods.

Https://raw.githubusercontent.com/tensorflow/tensorflow/master/tensorflow/tools/dist_test/python/mnist_replica.py-o mnist_replica.py

Run the following command in the parameter server container:

Python mnist_replica.py -- ps_hosts = 172.17.0.16: 2222,172.17 .0.17: 2222 -- worker_bosts = 172.17.0.3: 2222,172.17 .0.8: 2222 -- job_name = "ps" -- task_index = 0
Python mnist_replica.py -- ps_hosts = 172.17.0.16: 2222,172.17 .0.17: 2222 -- worker_bosts = 172.17.0.3: 2222,172.17 .0.8: 2222 -- job_name = "ps" -- task_index = 1

Run the following command in the computing server container:

Python mnist_replica.py -- ps_hosts = 172.17.0.16: 2222,172.17 .0.17: 2222 -- worker_bosts = 172.17.0.3: 2222,172.17 .0.8: 2222 -- job_name = "worker" -- task_index = 0
Python mnist_replica.py -- ps_hosts = 172.17.0.16: 2222,172.17 .0.17: 2222 -- worker_bosts = 172.17.0.3: 2222,172.17 .0.8: 2222 -- job_name = "worker" -- task_index = 1

Put the source code to be executed into the training data and test data in the persistent volume to share the data among multiple pods and avoid separate deployment in each Pod.
TensorFlow GPU Docker cluster deployment. Nvidia provides the nvidia-docker method, which uses host machine GPU devices to map to containers. Https://github.com/NVIDIA/nvidia-docker.

After models are trained, independent images are packaged and created to facilitate testers to deploy consistent environments, label models of different versions, and compare the accuracy of different models. This reduces the overall complexity of testing, deployment, and launch.

References:
Analysis and Practice of TensorFlow Technology

Welcome to the Shanghai Machine Learning job opportunity, my qingxingfengzi

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More