Blog has migrated to Marcovaldo's blog (http://marcovaldong.github.io/)
Just finished the last week of Cousera on machine learning . This week introduced one of the applications of machine learning: Photo OCR (optimal character recognition, optical character recognition), and the following are the notes organized below.
Photo Ocrproblem Description and Pipeline
The last sections introduce an application of machine learning--photo OCR (optimal character recognition. Optical character recognition), this same example of content can also be applied to computer vision (computor vision). Photo OCR to solve the problem is how to let the computer recognize the text in the image. Given an image, the first thing that photo OCR does is to determine where the text is located, for example, as seen in:
The text is then transcribed correctly.
Photo OCR is still one of the difficulties in machine learning. It helps the blind people "see" the things that are in front of them and helps them to proactively identify the objects on the road. Promote the realization of your own initiative driving technology.
In order to achieve photo OCR, we have a few procedures to do such as the following:
-Text detection to determine the position of the text in the image
-Character segmentation (character segmentation), splits the image fragment containing the text, and presses one word segmentation points
-Character recognition (character classification) that accurately identifies the characters in the image
Of course, the actual operation may be much more complex, but the overall is these steps, which is called Photo OCR pipeline
Sliding Windows
This section describes one of the details of photo OCR-the sliding form (sliding windows). We used a passer-by probe (Pedestrain detection) sample to elicit a sliding form that contained multiple passers-by, with the size of the distance between the passers-by and the camera. The rectangular dimensions representing passers-by are different, but the same principle.
We use supervised learning in passers-by, our data is a piece of image, size 82 x Span style= "Font-family:mathjax_main; padding-left:0.222em; "id=" mathjax-span-5 "class=" mn ">36 (actual size depends on selection). The classification is given as y = 1 The positive example and classified as y = 0 The negative example.
If there are thousands of copies of this image fragment. Then we can train to a hypothesis to infer whether the new image fragment contains passers-by. We take the image above as test set to find the passers-by. We select a size from the top left corner of the image 82 x 36 The image fragment that becomes the form. Infer whether there are passers-by.
Then swipe the form to the right, assuming that the step is 4 (of course, the step is 1 the highest precision, but the speed is down, adjusted according to the actual situation). Each slide is inferred once. Slide to the left. Swipe down one step from the left and then swipe right. Scroll to the bottom right corner of the entire image and scan the entire image.
Following back to photo OCR, this issue positive examples and negative examples for example as seen.
Then use the slide form above to scan the entire image. Locate the image fragment where the text is located.
is one of the whole process. The white portion of the following two images corresponds to the position of the text in the original image, and the right side of the image is the integrated treatment of the left (presumably the white fragment next to the drawing is integrated into a large chunk). Next is the character segmentation, and we still use the sliding form. The positive examples and negative examples are given. Note that we want the positive examples is such a good advantage in two characters in the middle of such a position, in such a position we have the ability to accurately divide the word segmentation.
The final step is to identify the characters:
Getting Lots of data and Artificial data
This section describes the synthetic data synthesis (artificial). Given the actual data encountered, we should be able to accurately identify the characters from these images (we are using grayscale images, which are a bit better than color) fragments.
Other characters may use a variety of different fonts, how to get a lot of other training samples? We are able to randomly paste different characters of different fonts into different backgrounds to get the training samples synthesized, the second picture below is indicative.
In this way, we can get a lot of synthetic data that is very similar to the original data.
The second method is to obtain a new training sample by distorting the original image fragment. In detail for example with what is seen:
The video also gives a sample of speech recognition by introducing distortions to synthesize the data, and by processing the original recording (original audio), the following synthetic data is obtained: recording of a bad murmur with a telephone signal (audio on cellphone connection), Recording in noisy environments (audio on crowd background), recording in machine operating environment (audio on machinery background).
Finally, we say that all synthetic data should be based on raw data (i.e., valid information must be included in the original data) and cannot add meaningless noise to the data set.
Before you add a lot of other training data. We should make sure that the model is low bias, because only such a model can improve its performance by increasing the training set.
For example, for a neural network, we can make sure that the model is low bias by adding the number of features, the model layer, and then increase the training set.
The last problem with video presentation is the time spent on acquiring data, which we should consider as a cost in practical applications. Don't write it in detail. (I think so.) There's nothing to write about. Give a picture)
Ceiling analysis:what part of the Pipeline to work on Next
This section describes the upper-bound analysis (ceiling analyses). The upper limit analysis can help us to analyze which step of the whole pipeline is more worthy of our optimization to achieve better performance. If we test our model on test set, we get 72% accuracy. The first step. We do the work of the text detection section manually (at this point the accuracy of the section reaches 100%), and the accuracy of the model is raised to 89%. In the second step, we use the manual to complete the work of the character segmentation part (at this point the accuracy also reaches 100%). Then the accuracy of the model reaches 90%. The third step. We use the manual to complete the work of character recognition. Finally the accuracy of the model reached 100%. We get the following table:
Analyzing the above table, we found that by upgrading the three steps in pipeline, we were able to add 17%, 1%, 10% respectively to the accuracy of the model. We have reached the upper limit of three steps in advance (the performance of three steps is optimized to 100%, not better), the resulting three sets of data is also the upper limit, this is the upper limit analysis. As a result, we know that the optimization of the two steps of the text detection and character recognition can improve the performance of the entire pipeline to a greater degree. So we have to prioritize and optimize these two steps.
The pipeline of facial recognition is given for us to deepen our understanding.
For the upper limit analysis of the pipeline, we know from this diagram that the most optimized step is the face detection.
Conclusionsummary and Thank You
The last section. Summarize all the contents of this course, see.
After studying the course carefully, we learned some basic algorithms and some skills of machine learning, which was barely into the door. There are a lot of other things waiting for us to learn.
Machine Learning 11th Week notes: Photo OCR