Learning notes TF064: TensorFlow Kubernetes, tf064tensorflow

Source: Internet
Author: User
Tags docker hub

Learning notes TF064: TensorFlow Kubernetes, tf064tensorflow

AlphaGo: each experiment has 1000 nodes and each node has 4 GPUs and 4000 GPUs. Siri: 2 nodes and 8 GPUs for each experiment. AI research relies on massive data computing, instead of performance computing resources. The larger cluster running model shortens the weekly training time to the day-level hour level. Kubernetes, the most widely used container cluster management tool, distributed TensorFlow monitoring, and scheduling lifecycle management. The open-source platform for automatic deployment, resizing, and O & M of container clusters provides task scheduling, monitoring, and failure restart. TensorFlow and Kubernetes are open-source companies of Google. Https://kubernetes.io /. Google cloud platform-based solutions. Https://cloud.google.com /.

Distributed TensorFlow runs in Kubernetes.

Deployment and running. Install Kubernetes. Minikube creates a local Kubernetes cluster. Install VirtualBox Virtual Machine on Mac first. Https://www.virtualbox.org /. The Minikube Go language is compiled and released as an independent binary file, which is downloaded to the corresponding directory. Command:

Curl-Lo minikube https://storage.googleapis.com/minikube/releases/v0.14.0/minikube-darwin-amd64 & chmod + x minikube & sudo mv minikube/usr/local/bin/

The command line of the client kubectl and kubectl interacts with the cluster. Installation:

Http://storage.googleapis.com/kubernetes-release/release/v1.5.1/bin/darwin/amd64/kubectl & chmod + x kubectl & sudo mv kubectl/usr/local/bin/

Minikube starts a Kubernetes cluster:

Minikube start

Docker Hub latest image tensorflow/tensorflow (version 1.0) https://hub.docker.com/r/tensorflow/tensorflow. Configure the parameter server deployment file named tf-ps-deployment.json:

{
"ApiVersion": "extensions/v1beta1 ",
"Kind": "Deployment ",
"Metadata ":{
"Name": "tensorflow-ps2"
},
"Spec ":{
"Replicas": 2,
"Template ":{
"Metadata ":{
"Labels ":{
"Name": "tensorflow-ps2 ",
"Role": "ps"
}
}
},
"Spec ":{
"Containers ":[
{
"Name": "ps ",
"Image": "tensorflow/tensorflow ",
"Ports ":[
{
"ContainerPort": 2222
}
]
}
]
}
}
}

Configure the parameter Server Service file named tf-ps-service.json:

{
"ApiVersion": "v1 ",
"Kind": "Service ",
"Spec ":{
"Ports ":[
{
"Port": 2222,
"TargetPort": 2222
}
],
"Selector ":{
"Name": "tensorflow-ps2"
}
},
"Metadata ":{
"Labels ":{
"Name": "tensorflow ",
"Role": "service"
}
},
"Name": "tensorflow-ps2-service"
}

Configure the computing server location file, named tf-worker-deployment.json:

{
"ApiVersion": "extensions/v1beta1 ",
"Kind": "Deployment ",
"Metadata ":{
"Name": "tensorflow-worker2"
},
"Spec ":{
"Replicas": 2,
"Template ":{
"Metadata ":{
"Labels ":{
"Name": "tensorflow-worker2 ",
"Role": "worker"
}
}
},
"Spec ":{
"Containers ":[
{
"Name": "worker ",
"Image": "tensorflow/tensorflow ",
"Ports ":[
{
"ContainerPort": 2222
}
]
}
]
}
}
}

Configure the computing Server service file named tf-worker-servic.json:

{
"ApiVersion": "v1 ",
"Kind": "Service ",
"Spec ":{
"Ports ":[
{
"Port": 2222,
"TargetPort": 2222
}
],
"Selector ":{
"Name": "tensorflow-worker2"
}
},
"Metadata ":{
"Labels ":{
"Name": "tensorflow-worker2 ",
"Role": "service"
}
},
"Name": "tensorflow-wk2-service"
}

Run the following command:

Kubectl create-f tf-ps-deployment.json
Kubectl create-f tf-ps-service.json
Kubectl create-f tf-worker-deployment.json
Kubectl create-f tf-worker-service.json

Run kubectl get pod to check whether all the parameter servers and computing servers have been created.
Enter each server (Pod) and deploy the mnist_replica.py file. Run the command to view the IP addresses of ps_host and worker_host.

Kubectl describe service tensorflow-ps2-service
Kubectl describe service tensorflow-wk2-service

Open four terminals and enter four pods respectively.

Kubectl exec-ti tensorflow-ps2-3073558082-3b08h/bin/bash
Kubectl exec-ti tensorflow-ps2-3073558082-4x3j2/bin/bash
Kubectl exec-ti tensorflow-worker2-3070479207-k6z8f/bin/bash
Kubectl exec-ti tensorflow-worker2-3070479207-6hvsk/bin/bash

Mnist_replica.py is deployed to four pods.

Https://raw.githubusercontent.com/tensorflow/tensorflow/master/tensorflow/tools/dist_test/python/mnist_replica.py-o mnist_replica.py

Run the following command in the parameter server container:

Python mnist_replica.py -- ps_hosts = 172.17.0.16: 2222,172.17 .0.17: 2222 -- worker_bosts = 172.17.0.3: 2222,172.17 .0.8: 2222 -- job_name = "ps" -- task_index = 0
Python mnist_replica.py -- ps_hosts = 172.17.0.16: 2222,172.17 .0.17: 2222 -- worker_bosts = 172.17.0.3: 2222,172.17 .0.8: 2222 -- job_name = "ps" -- task_index = 1

Run the following command in the computing server container:

Python mnist_replica.py -- ps_hosts = 172.17.0.16: 2222,172.17 .0.17: 2222 -- worker_bosts = 172.17.0.3: 2222,172.17 .0.8: 2222 -- job_name = "worker" -- task_index = 0
Python mnist_replica.py -- ps_hosts = 172.17.0.16: 2222,172.17 .0.17: 2222 -- worker_bosts = 172.17.0.3: 2222,172.17 .0.8: 2222 -- job_name = "worker" -- task_index = 1

Put the source code to be executed into the training data and test data in the persistent volume to share the data among multiple pods and avoid separate deployment in each Pod.
TensorFlow GPU Docker cluster deployment. Nvidia provides the nvidia-docker method, which uses host machine GPU devices to map to containers. Https://github.com/NVIDIA/nvidia-docker.

After models are trained, independent images are packaged and created to facilitate testers to deploy consistent environments, label models of different versions, and compare the accuracy of different models. This reduces the overall complexity of testing, deployment, and launch.

References:
Analysis and Practice of TensorFlow Technology

Welcome to the Shanghai Machine Learning job opportunity, my qingxingfengzi

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.