Use Hadoop streaming image to classify images classification with Hadoop Streaming_hadoop

Source: Internet
Author: User
Tags scale image stdin

Note:this article is originally posted on a previous version of the 500px engineering blog. A lot has changed since it is originally posted on Feb 1, 2015. In the future posts, we'll be covering how we image classification solution has and evolved what other interesting Mach INE learning projects we have.

Tldr:this Post provides an overview the how to perform large scale image classification using Hadoop streaming. Component individually and identify things that need to is taken into account for the final Integra tion. Then we go on how to ' glue ' all pieces together to produce the desired results. Introduction

Till recently image search and discovery at 500px is based mainly on meta information provided by users which  tags, title and description. It's obvious that's quality of such search and discovery directly depends on how-so users describe their. Since This is such a tedious process, most photos are the left simply or untagged have. The RECENTUPDATE FROM Facebook illustrates the scale of the problem: ~2 billion photos are shared on Facebook each day !

Luckily, due to the recent advancements In deep neural networks  (NN) and their efficient implementations usin G GPUs, it is now possible to perform automatic image classification (tagging) with nearly human. Getting close to human capabilities or even beating them does do mean we ' ve achieved real Artificial Intelligence (AI). The deep NN algorithms does not ' understand ' content of images nor the concepts of this describe. Instead, they learn invariant representation of patterns (features) found in images and create a mapping function tha T maps pixels of an "a" combination of patterns found in the image. This mapping Function is essentially a specially crafted mathematical function that takes a set of numbers (Pixe LS) and outputs another set of numbers (features).

Despite not being real AI, deep NN algorithms applied to image classification tasks have been significantly improved The last few years. This is best illustrated by looking at the results of the imagenet competition. Improvements of image classification algorithms on Imagenet

Additional advantages of using NN algorithms include:calculating keyword weights in order to better control search result s calculating similarity between images reverse image search extracting aesthetic qualities of images

There are many open source projects like Theano, Torch, Caffe, dl4j, Toronto deepnet, then provide tools for implementing N Ns. For our approach we chosecaffe because it is one of the easiest to start with and also satisfied we Requirements:availa Bility of pre-trained models which is important, considering it takes weeks to train a model and requires large s ET like imagenet. Python Wrapper so we can use a lot of existing machine learning libraries like scikit-learn in the same environment active Community, decent documentation and examples

So we have the library and models for automatic image recognition. Let's look in the steps we need to perform to classify one image:download a image resize and crop to fit the model ' s INP UT dimensions convert RGB pixels into a array of real numbers call Caffe ' APIs extract predicted labels and features Sav E results

When operating with hundreds of millions of images, just the download step could take weeks to complete on a single machi Ne. A faster approach would include performing these steps in parallel. We achieved this by parallelizing execution both within one machine using GPU and multiprocessing, and across many s relying on Hadoop. Hadoop streaming for Image classification

The

Hadoop is a framework for distributed processing and storage. Normally, it requires to write Java code to run your jobs. Conveniently, it also offershadoop Streaming that allows to the any scripting language to create custom jobs. Hadoop streaming is essentially a job that'll run your script in a child process. stdin and stdout a Re used to pass and your code. STDERR is used for logging and debugging. Hadoop streaming assumes that your input and output data are textual and has one item per line. This means your cannot simply point Hadoop to a folder containing images. It'll try to read each image as a text file and would eventually fail.

The "big challenge" with the Hadoop is installing and configuring the cluster. It could take a full-time devops position just to do. We shortcut to the problem be to-elastic MapReduce (EMR) which is Amazon ' s Web service that provides APIs to Configu Re a Hadoop cluster and run jobs. Another advantage of the EMR is seamless integration with Amazon S3. Essentially, when running a EMR job you specify where to read data from and save results to on S3. The only limitation are the input param needs to a file or a folder with files on S3, it would not traverse Subfold ERs.

So the input to EMR should is either a text file or a folder with text files on S3. Since we are dealing with a very large collection of photos, we generated a list of files, where each file contained Aroun d 1000 Records (lines) in the following format:

photo_id  s3://input_photo.jpg  s3://output_classification.json
photo_id  s3://input_photo.jpg  S3://output_classification.json
photo_id  s3://input_photo.jpg  S3://output_classification.json
...

By default, EMR outputs data as one large list split among many part-xxxxxfiles. These split files could are very large and are not aligned with the new line boundaries the which make parsing them. For our classification pipeline we saved predictions for each image as a separate JSON file. This is why we explicitly specified destination location in the input for Hadoop so all worker can save results for each Image separately.

We performed image classification on gpu-based EC2 instances. To our surprise we are found out of that these g2.2xlarge instances are highly unreliable. The most common problem we observed is the failures of the CUDA to find the GPU card. From Hadoop side it looked like a stuck worker whose containers were by the GPU blocked. The workaround is to kill the failed instance and let EMR to bring a new node and restore HDFS cluster.

There were also problems resulting in failed jobs this HADOOP/EMR not could. Those included exceeding memory configuration limits, running out of disk spaces, S3 connectivity issues, bugs in the code And so on. This is why we needed ability to resume long running the jobs that failed. Our approach is to Write photo_id into Hadoop "s output for successfully processed images. In case of a failed job, we would parse part-xxxxx files and exclude successfully processed photo IDs from I Nput files. Bootstrapping and configuring EMR with Mrjob

Before we can schedule a job on EMR we-need to bootstrap the cluster and then Hadoop. Amazon provides command line tools to perform both actions. However, we took an easier path and used mrjob which are a Python library that provides abstractions over EMR APIs and s developing and testing of MapReduce jobs very easy. It's worth mentioning that Mrjob isn't the slowest Python framework for Hadoop as opposed to the blog post by Cloud Era. Instead, it is as fast as no Hadoop streaming job can get. It just does more than frameworks by offering protocols which serialize and deserialize raw text into native Pyth On objects. Of course, this feature can is disabled in the code.

To bootstrap and configure EMR via Mrjob you create a YAML file. This is what we classifier_job.conf looked like:

It's important to highlight the memory specific settings both for YARN (Hadoop ' s resource manager) and MapReduce. One map operation required around 2.7GB of RAM:1.7GB to keep Caffe model in memory + 1GB for downloading and preparing 5 Images in parallel. This is why we allocated 3GB per one YARN container. Each g2.2xlarge worker had 15GB RAM out of which 12GB is allocated to MapReduce and 3GB is left to the OS. This gave us 4 containers/node or containers for nodes cluster.

The bash script classifier_job.sh that installed NumPy, Caffe and other libraries would run on the master and Worker nodes . To do sure it did not fail on the master node without GPU we had the following switch:running the classification job

The image classification job consisted of a single map operation that included downloading images, preparing batch data, p Erforming classification on GPU and saving results. To justify overhead of the loading data in/from GPU, images needed to is processed in batches. Unfortunately, Mrjob does not support batch out of the box. The workaround was-override Run_mappermethod and manually control reading lines from stdin.

The batch size is set to images based on the available GPU memory ofg2.2xlarge. Below are the two main methods this handled batch Operation:run_mapper for preparing CH Classification Task.

Among all the major steps of Process_batch, downloading images is the slowest step:

Download_preprocess_for_caffe_batch is parallelized to perform downloading using 5 processes. Normally this download step would is a good candidate for a separate map operation. However, the limitation-is-to-store all resized images in HDFS (Hadoop distributed File System), which would To have a very large EMR cluster. Alternatively, it is possible to increase the number of parallel processes from 5 to ten, for example. But doing so would require to increase the map task's memory footprint and as to reduce number of containers PE R node (e.g. from 4 containers to 3).

Here are the stats for the final RUN:20 nodes EMR cluster-containers (4 containers per node) ~600 images/s (images/ S/node) ~80% average CPU load across the cluster several days to complete general tips

Just a compilation of random tips we found useful while working in this project. Use EMR job flow to start a persistent cluster I can avoid provisioning and boostraping every time for your want to run a Job. An example of possible workflow using Mrjob:

# Start Cluster. It'll return job flow ID
python-m mrjob.tools.emr.create_job_flow \
--conf-path=classifier_job.conf

# Schedule a job using the job flow ID
python classifier_job.py-r emr \
--EMR-JOB-FLOW-ID=J-396IAJN5Y6BCB \
--c onf-path=classifier_job.conf \
--no-output--output-dir=s3://output s3://input
Use EC2 spot instances to save EMR costs by specifying the bid price in the config file:
Ec2_master_instance_bid_price: ' 0.07 '
ec2_core_instance_bid_price: ' 0.07 '
Use Hadoop ' s counters and stderr to output debug and profiling info. For example:
...
Except Exception as E:
  self.increment_counter (' Error ', str (e))
  self.increment_counter (' Count ', ' failed Photos ', count)
  stderr.write (str (e) + ' \ n ')
...
SSH to EMR cluster and run:
# to view worker nodes and # of running containers
yarn node-list

# to view running jobs and get URLs for the Mon Itoring Dashboard
Yarn application-list
Specify S3 location where to store EMR logs in the Mrjob ' s config file.
S3_log_uri:s3n://path_to_store_emr_logs

These logs contain when and where each map/reduce attempt is started and whether it succeeded or failed. For the failed tasks can look at the corresponding container logs that contain your code ' s stderr output and other POS sible exceptions. Summary

We showed how to perform a large scale image classification task using Caffe, Elastic MapReduce (EMR) and Mrjob. In our approach we optimized for simplicity and speed of development.

The combination of Hadoop streaming and machine learning libraries available in Python opens up interesting For large scale data processing.


From:https://developers.500px.com/image-classification-with-hadoop-streaming-1aa18b81e22b#.b27jerwf8

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.