The python script crawls a font file,

Last Update:2017-05-09 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Preface

Everyone should have some experience. To improve the verification code recognition accuracy, we must first obtain enough test data. It is easy to download the verification code, but manual identification by the human brain is really unacceptable, so I came up with a compromise-self-built verification code.

To ensure diversity, we need different fonts first. We can directly use a font file similar to ttf format. There are many ttf-format fonts on the Internet for us to download. Of course, I won't be stupid enough to manually download and decompress it. I have to write a crawler.

Implementation Method

Website 1: fontsquirrel.com

The fonts of this website can be downloaded for free, but many download points are connected to other websites by external links. This part is ignored.

# Coding: utf-8import urllib2, cookielib, sys, re, OS, zipfileimport numpy as np # website login cj = cookielib. cookieJar () opener = urllib2.build _ opener (urllib2.HTTPCookieProcessor (cj) opener. addheaders = [('user-agent', 'mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36) ')] urllib2.install _ opener (opener) # search for downloadable connections def search (path): request = urllib2.Request (path) response = Urllib2.urlopen (request) html = response. read () html = html. replace ('\ n', '') # Remove all carriage returns because the regular expression matches a single line ...... Urls = re. findall (R' <a href = "(.*?) "Rel =" external nofollow "> (.*?) </A> ', html) for I in urls: url, inner = I if not re. findall (r 'Download', inner) = [] and re. findall (r 'offsite', inner) = [] and url not in items: items. append (url) items = [] # Save for I in xrange (15): host = 'HTTP: // www.fontsquirrel.com/fonts/list/find_fonts/'{str (I * 50) + '? Filter % 5 Bdownload % 5D = local 'search (host) if not OS. path. exists ('ttf'): OS. mkdir ('ttf') OS. chdir ('ttf') def unzip (rawfile, outputdir): if zipfile. is_zipfile (rawfile): print 'yes' fz = zipfile. zipFile (rawfile, 'R') for files in fz. namelist (): print (files) # print the directory fz In the zip archive. extract (files, outputdir) # extract the file else: print 'no' for I in items: print I request = urllib2.Request ('HTTP: // www.fontsquirrel.com '+ I) response = urllib2.urlopen (request) html = response. read () name = I. split ('/'{-1}}'.zip' f = open (name, 'w') f. write (html) f. close () # close the file. Otherwise, unzip (name ,'. /') OS. remove (name) OS. listdir (OS. getcwd () OS. chdir ('.. /') files = OS. listdir ('ttf/') for I in files: # Delete useless files if not (I. split ('. ') [-1] = 'ttf' or I. split ('. ') [-1] = 'otf'): if OS. path. isdir (I): OS. removedirs ('ttf/'+ I) else: OS. remove ('ttf/'+ I) print len (OS. listdir ('ttf /'))

There are more than 2000 fonts, and there are many types of fonts.

Website 2: dafont.com

This website has many font patterns and is easy to download. It seems that the encoding of the file name is a problem.

# Coding: utf-8import urllib2, cookielib, sys, re, OS, zipfileimport shutilimport numpy as npcj = cookielib. cookieJar () opener = urllib2.build _ opener (urllib2.HTTPCookieProcessor (cj) opener. addheaders = [('user-agent', 'mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36) ')] urllib2.install _ opener (opener) items = [] def search (path): request = urllib2.Request (path) r Esponse = urllib2.urlopen (request) html = response. read () html = html. replace ('\ n', '') urls = re. findall (r 'href = \ "(http://dl.dafont.com/dl /\? F = .*?) \ "> ', Html) items. extend (urls) for I in xrange (117): host = 'HTTP: // www.dafont.com/new.php? Page = '+ str (I + 1) search (host) print 'page' + str (I + 1) + 'done' items = list (set (items )) print len (items) if not OS. path. exists ('ttf2'): OS. mkdir ('ttf2') OS. chdir ('ttf2') def unzip (rawfile, outputdir): if zipfile. is_zipfile (rawfile): print 'yes' fz = zipfile. zipFile (rawfile, 'R') for files in fz. namelist (): print (files) # print the directory fz In the zip archive. extract (files, outputdir) else: print 'no' for I in items: print I request = urllib2.Request (I) response = urllib2.urlopen (request) html = response. read () namepolici.split('{'}}-1}}'.zip 'f = open (name, 'w') f. write (html) f. close () unzip (name ,'. /') OS. remove (name) print OS. listdir (OS. getcwd () for root, dire, Sox in OS. walk ('. /'): # recursively traverse the folder for I in FS: if not (I. split ('. ') [-1] = 'ttf' or I. split ('. ') [-1] = 'otf'): OS. remove (root + I) print ifor I in OS. listdir ('. /'): if OS. path. isdir (I): OS. rmdir (I) OS. chdir ('.. /')

The overall operation is similar to the previous one. It took dozens of minutes to get over 4000 fonts.

Summary

The above is all about this article. I hope this article will help you learn or use python. If you have any questions, please leave a message, thank you for your support.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

The python script crawls a font file,

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support