Study Notes Tf064:tensorflow Kubernetes

Last Update:2017-11-12 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

AlphaGo, 1000 nodes per experiment, 4 gpu,4000 GPUs per node. Siri, 2 nodes per experiment, 8 GPUs. AI research relies on massive data calculations, away from performance computing resources. A larger cluster runs the model, shortening the weekly training time to the day-level hours. Kubernetes, the most widely used container cluster management tool, distributed tensorflow monitoring, scheduling life cycle management. Container cluster automation deployment, expansion, operation and maintenance of open source platform, to provide task scheduling, monitoring, failure restart. TensorFlow, kubernetes are open source of Google company. https://kubernetes.io/. Google Cloud Platform-based solution. https://cloud.google.com/.

Distributed TensorFlow run in Kubernetes.

Deploy, run. Install Kubernetes. Minikube Create a local kubernetes cluster. MAC installs the VirtualBox virtual machine first. https://www.virtualbox.org/. Minikube go language, publish the form of independent binary files, download into the corresponding directory. Command:

Curl-lo minikube https://storage.googleapis.com/minikube/releases/v0.14.0/minikube-darwin-amd64 && chmod +x Minikube && sudo mv minikube/usr/local/bin/

The client kubectl,kubectl the command line to interact with the cluster. Installation:

Curl-lo kubectl Http://storage.googleapis.com/kubernetes-release/release/v1.5.1/bin/darwin/amd64/kubectl & & chmod +x kubectl && sudo mv kubectl/usr/local/bin/

Minikube Start the Kubernetes cluster:

Minikube start

Docker Hub Latest Image Tensorflow/tensorflow (version 1.0) https://hub.docker.com/r/tensorflow/tensorflow/. Configure the parameter server Deployment (deployment) file, named Tf-ps-deployment.json:

{
"Apiversion": "Extensions/v1beta1",
"Kind": "Deployment",
"Metadata": {
"Name": "TENSORFLOW-PS2"
},
"Spec": {
"Replicas": 2,
"Template": {
"Metadata": {
"Labels": {
"Name": "TENSORFLOW-PS2",
"Role": "PS"
}
}
},
"Spec": {
"Containers": [
{
"Name": "PS",
"Image": "Tensorflow/tensorflow",
"Ports": [
{
"Containerport": 2222
}
]
}
]
}
}
}

Configure the Parameter Server service file, named Tf-ps-service.json:

{
"Apiversion": "V1",
"Kind": "Service",
"Spec": {
"Ports": [
{
"Port": 2222,
"Targetport": 2222
}
],
"Selector": {
"Name": "TENSORFLOW-PS2"
}
},
"Metadata": {
"Labels": {
"Name": "TensorFlow",
"Role": "Service"
}
},
"Name": "Tensorflow-ps2-service"
}

Configure the Compute Server department file, name Tf-worker-deployment.json:

{
"Apiversion": "Extensions/v1beta1",
"Kind": "Deployment",
"Metadata": {
"Name": "Tensorflow-worker2"
},
"Spec": {
"Replicas": 2,
"Template": {
"Metadata": {
"Labels": {
"Name": "Tensorflow-worker2",
"Role": "Worker"
}
}
},
"Spec": {
"Containers": [
{
"Name": "Worker",
"Image": "Tensorflow/tensorflow",
"Ports": [
{
"Containerport": 2222
}
]
}
]
}
}
}

To configure the Compute Server service file, name Tf-worker-servic.json:

{
"Apiversion": "V1",
"Kind": "Service",
"Spec": {
"Ports": [
{
"Port": 2222,
"Targetport": 2222
}
],
"Selector": {
"Name": "Tensorflow-worker2"
}
},
"Metadata": {
"Labels": {
"Name": "Tensorflow-worker2",
"Role": "Service"
}
},
"Name": "Tensorflow-wk2-service"
}

Execute command:

Kubectl create-f Tf-ps-deployment.json
Kubectl create-f Tf-ps-service.json
Kubectl create-f Tf-worker-deployment.json
Kubectl create-f Tf-worker-service.json

Run the Kubectl get pod to see that the parameter server and the compute server are all created.
Go to each server (POD) and deploy the mnist_replica.py file. Run the command to view the ps_host, Worker_host IP address.

Kubectl Describe service Tensorflow-ps2-service
Kubectl Describe service Tensorflow-wk2-service

Open 4 Terminals and enter 4 pods respectively.

Kubectl Exec-ti Tensorflow-ps2-3073558082-3b08h/bin/bash
Kubectl Exec-ti Tensorflow-ps2-3073558082-4x3j2/bin/bash
Kubectl Exec-ti Tensorflow-worker2-3070479207-k6z8f/bin/bash
Kubectl Exec-ti Tensorflow-worker2-3070479207-6hvsk/bin/bash

Mnist_replica.py deployed to 4 pods.

Curl Https://raw.githubusercontent.com/tensorflow/tensorflow/master/tensorflow/tools/dist_test/python/mnist_ Replica.py-o mnist_replica.py

In the parameter server container, execute:

Python mnist_replica.py--ps_hosts=172.17.0.16:2222,172.17.0.17:2222--worker_bosts= 172.17.0.3:2222,172.17.0.8:2222--job_name= "PS"--task_index=0
Python mnist_replica.py--ps_hosts=172.17.0.16:2222,172.17.0.17:2222--worker_bosts= 172.17.0.3:2222,172.17.0.8:2222--job_name= "PS"--task_index=1

In the Compute server container execution:

Python mnist_replica.py--ps_hosts=172.17.0.16:2222,172.17.0.17:2222--worker_bosts= 172.17.0.3:2222,172.17.0.8:2222--job_name= "worker"--task_index=0
Python mnist_replica.py--ps_hosts=172.17.0.16:2222,172.17.0.17:2222--worker_bosts= 172.17.0.3:2222,172.17.0.8:2222--job_name= "worker"--task_index=1

Put the required code into the training data, test data in the persistent volume (persistent volume), shared among the multiple pods, to avoid deploying each pod separately.
TensorFlow GPU Docker cluster deployment, NVIDIA provides a nvidia-docker way to use host GPU devices to map to containers. Https://github.com/NVIDIA/nvidia-docker.

Training good model, packaging production environment independent image, convenient for testers to deploy a consistent environment, to mark different versions of the model, compare the accuracy rate of different models, from the overall reduction of testing, deployment on-line work complexity.

Resources:
"TensorFlow Technical Analysis and actual combat"

Welcome to Shanghai Machine learning jobs, my: Qingxingfengzi

Study Notes Tf064:tensorflow Kubernetes

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Study Notes Tf064:tensorflow Kubernetes

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support