Machine Learning 11th Week notes: Photo OCR

Source: Internet
Author: User

Blog has migrated to Marcovaldo's blog (http://marcovaldong.github.io/)

Just completed the last week of Cousera on machine Learning , this week introduced one of the applications of machine learning: Photo OCR (optimal character recognition, Optical character recognition), follow the notes below.

Photo Ocrproblem Description and Pipeline

The last few sections introduce an application of machine learning--photo OCR (optimal character recognition, optical character recognition), the content of this example can also be applied to computer vision (computor vision). The problem with photo OCR is how to make the computer recognize the text in the image. Given an image, the first thing that photo OCR does is to determine where the text is located, as shown in:

The text is then transcribed correctly. Photo OCR is still one of the difficulties of machine learning, it can help the blind people "see" The things before, can help the car automatically identify the objects on the road, promote the realization of automatic driving technology.

To achieve photo OCR, here are a few steps we'll take:
-text detection, which determines the position of the text in the image
-Character cut (character segmentation), cut out the image fragment containing the text and press one word segmentation
-Character recognition (character classification) that accurately identifies the characters in the image
Of course the actual operation may be much more complicated, but in general it is these steps, which are called Photo OCR pipeline

Sliding Windows

This section describes one of the details of photo OCR-sliding windows (sliding windows). We use the pedestrian detection (Pedestrain detection) example to elicit a sliding window, containing multiple passers-by, with the size of the distance between the passers-by and the camera, representing the pedestrian's rectangular size differently, but the same principle.

We use supervised learning in pedestrian detection, our data is a piece of image, size 82 x Span style= "Font-family:mathjax_main; padding-left:0.222em; "id=" mathjax-span-5 "class=" mn ">36 (The actual size is selected according to the need), gives the classification as y = 1 The positive example and classified as y = 0 The negative example. Assuming that there are thousands of parts of the image fragment, we can train it to a hypothesis to determine if the new image fragment contains passers-by. We take the image above as test set to find the passers-by. We select a size from the top left corner of the image 82 x 36 The image fragment becomes the window that determines whether there are passers-by. Then slide the window to the right, assuming that the step is 4 (of course, the step is 1 the highest accuracy, but the speed is down, adjusted according to the actual situation), each slide to judge once. After sliding to the left, swipe down one step from the left and then swipe to the right until you scroll to the far right corner of the image to scan the entire image.

Following back to photo OCR, this issue positive examples and negative examples as shown. Then use the sliding window above to scan the entire image to find the image fragment where the text is located.

is one of the entire process, the white part of the two images below corresponds to the position of the text in the original image, where the right side of the image is the integration of the left one (presumably, the white fragments next to the drawing are integrated into a large chunk). Next is the character cut, we still use the sliding window, given the positive examples and negative examples. Note that the positive examples we want is this position that happens to be in the middle of a two-character position where we can accurately divide the word segmentation.

The final step is to identify the characters:

Getting Lots of data and Artificial data

This section describes the synthetic data synthesis (artificial). Given the actual data encountered, we should be able to accurately identify the characters from these images (we are using grayscale images, which are a bit better than color) fragments. Other characters may use a variety of different fonts, how to get more training samples? We can randomly paste different characters of different fonts into different backgrounds to get the training samples synthesized, the second picture below is indicative. In this way, we can get a lot of synthetic data that is very similar to the original data.

Another method is to obtain a new training sample by distorting the original image fragment, as shown in the following example:

The video also gives examples of speech recognition by introducing distortions to synthesize the data, and by processing the original recording (original audio), the following synthetic data is obtained: recording of a bad murmur with a telephone signal (audio on cellphone connection), Recording in noisy environments (audio on crowd background), recording in machine operating environment (audio on machinery background). Finally, we say that all synthetic data should be based on the original data (that is, the original data must contain valid information), and you cannot add meaningless noise to the data set.

Before adding more training data, we should make sure that the model is low bias, because only such a model can improve its performance by increasing the training set. For example, for a neural network, we can make sure the model is low bias by increasing the number of features, and then increase the training set.

The last problem of video introduction is the time cost of acquiring data, in practical application, we should consider it as a cost, and do not write it. (I think, there is nothing to write, give a picture)

Ceiling analysis:what part of the Pipeline to work on Next

This section describes the upper-bound analysis (ceiling analyses). The upper limit analysis can help us to analyze which step of the whole pipeline is more worthy of our optimization to get better performance. Suppose we tested our model on test set and got 72% accuracy. The first step, we use the manual to do the text detection part of the work (at this time, the portion of the accuracy reached 100%), this time the model accuracy increased to 89%. The second step, we use the manual to complete the character segmentation part of the work (this part of the accuracy also reached 100%), and then the model accuracy reached 90%. The third step, we use manual to complete character recognition work, the final model of the accuracy reached 100%. We get the following table:

Analyzing the above table, we find that by increasing the three steps in pipeline, we can add 17%, 1%, 10% respectively to the accuracy of the model. We have reached the upper limit of three steps in advance (the performance of three steps is optimized to 100%, not better), the resulting three sets of data is also the upper limit, this is the upper limit analysis. As a result, we know that optimization of the two steps of the text detection and character recognition can improve the performance of the entire pipeline, so we have to prioritize these two steps.

The pipeline of facial recognition is given for us to deepen our understanding.

For the upper limit analysis of the pipeline, we know from this diagram that the most optimized step is the face detection.

Conclusionsummary and Thank You

The last section summarizes all the contents of this course, see.

After studying the course carefully, we learned some basic algorithms and some skills of machine learning, which was barely into the door. And then there's more to come and wait for us to learn.

Machine Learning 11th Week notes: Photo OCR

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.