Python generates a Chinese character image Font Library

Source: Internet
Author: User

I recently worked on a document recognition project, and I needed to build a font for Chinese Character Recognition. I found all kinds of OCR on the internet, and I don't feel good. This technology should be quite mature. I have a lot of OCR software, however, I did not find a few papers with gold content, nor did I see any big bull public font. I used the pygame rendering font to generate the font, And I used PIL to cut the neat pictures to get the font. Pygame rendering font to generate font with pygame rendering font I reference this article, according to The GB2323-8 standard, commonly used Chinese characters 3500, covering 99.7% of the usage, plus a total of 6763 commonly used, overwrite the usage of 99.99%. To create a font image, you can find 3500 frequently used Chinese characters on the Internet and render each sub-item in the font. Copy the Code 1 def pasteWord (word): 2 ''' and enter a text, output an image containing the text ''' 3 pygame. init () 4 font = pygame. font. font (OS. path. join (". /fonts ",". ttf "), 22) 5 text = word. decode ('utf-8') 6 imgName = "E:/dataset/chinesedb/chinese/" + text + ". png "7 paste (text, font, imgName) 8 9 def paste (text, font, imgName, area = (0,-9): 10''' Based on the font, paste a text to the Image and save '''11 im = Image. new ("RGB", (32, 32 ),( 255,255,255) 12 rtext = font. render (text, True, (0, 0, 0), (255,255,255) 13 SiO2 = StringIO. stringIO () 14 pygame. image. save (rtext, SiO2) 15. seek (0) 16 line = Image. open (SiO2) 17 im. paste (line, area) 18 # im. show () 19 im. an error is always reported when the number of images rendered by save (imgName) is large. I tried again for the rendered failed text and finally got a font containing 3510 words (plus 10 digits: another way to generate a font by character segmentation is to arrange 3500 words in word and convert them to PDF files to save them as an image, as shown below: dense words, but very neat, no image processing algorithms are required. You only need to find blank rows and columns, and cut them by row or column. You only need to save the ordered cut. Cut, the cut image can still correspond to the word, the following is the cut code: Copy code 1 #! Encoding = UTF-8 2 import Image 3 import OS 4 5 def yStart (gray): 6 m, n = gray. size 7 for j in xrange (n): 8 for I in xrange (m): 9 if gray. getpixel (I, j) = 0: 10 return j11 def yEnd (gray): 12 m, n = gray. size13 for j in xrange (n-1,-1,-1): 14 for I in xrange (m): 15 if gray. getpixel (I, j) = 0: 16 return j17 18 def xStart (gray): 19 m, n = gray. size20 for I in xrange (m): 21 for j in xrange (n): 22 if gray. getpix El (I, j) = 0: 23 return i24 def xEnd (gray): 25 m, n = gray. size26 for I in xrange (m-1,-1,-1): 27 for j in xrange (n): 28 if gray. getpixel (I, j) = 0: 29 return i30 def xBlank (gray): 31 m, n = gray. size32 blanks = [] 33 for I in xrange (m): 34 for j in xrange (n): 35 if gray. getpixel (I, j) = break37 if j = n-1: 38 blanks. append (I) 39 return blanks40 41 def yBlank (gray): 42 m, n = gray. size43 blanks = [] 44 For j in xrange (n): 45 for I in xrange (m): 46 if gray. getpixel (I, j) = break48 if I = M-1: 49 blanks. append (j) 50 return blanks51 52 def getWordsList (): 53 f = open('3500.txt ') 54 line = f. read (). strip () 55 wordslist = line. split ('') 56 f. close () 57 return wordslist58 59 count = 060 wordslist = [] 61 def getWordsByBlank (img, path): 62''' fetch an image based on the blank spaces in the row and column, good results ''' 63 global count64 global wordslist65 gray = Img. split () [0] 66 xblank = xBlank (gray) 67 yblank = yBlank (gray) 68 # more than one consecutive blank pixel, however, we only keep the first and last blank pixels in the continuous area, as the start point and end point of the text 69 xblank = [xblank [I] for I in xrange (len (xblank) if I = 0 or I = len (xblank) -1 or not (xblank [I] = xblank [I-1] + 1 and xblank [I] = xblank [I + 1]-1)] 70 yblank = [yblank [I] for I in xrange (len (yblank) if I = 0 or I = len (yblank) -1 or not (yblank [I] = yblank [I-1] + 1 and yblank [I] = y Blank [I + 1]-1)] 71 for j in xrange (len (yblank)/2): 72 for I in xrange (len (xblank)/2 ): 73 area = (xblank [I * 2], yblank [j * 2], xblank [I * 2 + 1] + 32, yblank [j * 2] + 32) # Here the fixed word size is 32 pixels 74 # area = (xblank [I * 2], yblank [j * 2], xblank [I * 2 + 1], yblank [j * 2 + 1]) 75 word = img. crop (area) 76 word.save(path+wordslist?count={'.png ') 77 count + = 178 if count> = len (wordslist): 79 return80 81 82 def getWordsFormImg (imgName, path): 83 png = Image. open (imgName, 'R') 84 img = png. convert ('1') 85 gray = img. split () [0] 86 # first cut out the text area 87 area = (xStart (gray)-1, yStart (gray)-1, xEnd (gray) + 2, yEnd (gray) + 2) 88 img = img. crop (area) 89 getWordsByBlank (img, path) 90 91 def getWrods (): 92 global wordslist93 wordslist = getWordsList () 94 imgs = ["l1.png", "l2.png ", "l3.png"] 95 for img in imgs: 96 getWordsFormImg (img, 'words/') 97 98 if _ name _ = "_ main _": 99 getW Rods () can also produce good results: you are not familiar with the image processing, and use the methods of tubaozi. The recognition of Chinese characters is relatively difficult, corresponding to neat pictures, sampling DTW for similar items in the font library, the effect is not bad, but after cutting the Articles Taken by scanners and cameras, poor results. I used a back-propagation neural network, but 3500 Chinese characters are equivalent to 3500 classes. The classification problem of over-many classes is hard to cope with, mainly because the training data is too small, there is only one font in hand.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.