The python script crawls a font file,
Preface
Everyone should have some experience. To improve the verification code recognition accuracy, we must first obtain enough test data. It is easy to download the verification code, but manual identification by the human brain is really unacceptable, so I came up with a compromise-self-built verification code.
To ensure diversity, we need different fonts first. We can directly use a font file similar to ttf format. There are many ttf-format fonts on the Internet for us to download. Of course, I won't be stupid enough to manually download and decompress it. I have to write a crawler.
Implementation Method
Website 1: fontsquirrel.com
The fonts of this website can be downloaded for free, but many download points are connected to other websites by external links. This part is ignored.
# Coding: utf-8import urllib2, cookielib, sys, re, OS, zipfileimport numpy as np # website login cj = cookielib. cookieJar () opener = urllib2.build _ opener (urllib2.HTTPCookieProcessor (cj) opener. addheaders = [('user-agent', 'mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36) ')] urllib2.install _ opener (opener) # search for downloadable connections def search (path): request = urllib2.Request (path) response = Urllib2.urlopen (request) html = response. read () html = html. replace ('\ n', '') # Remove all carriage returns because the regular expression matches a single line ...... Urls = re. findall (R' <a href = "(.*?) "Rel =" external nofollow "> (.*?) </A> ', html) for I in urls: url, inner = I if not re. findall (r 'Download', inner) = [] and re. findall (r 'offsite', inner) = [] and url not in items: items. append (url) items = [] # Save for I in xrange (15): host = 'HTTP: // www.fontsquirrel.com/fonts/list/find_fonts/'{str (I * 50) + '? Filter % 5 Bdownload % 5D = local 'search (host) if not OS. path. exists ('ttf'): OS. mkdir ('ttf') OS. chdir ('ttf') def unzip (rawfile, outputdir): if zipfile. is_zipfile (rawfile): print 'yes' fz = zipfile. zipFile (rawfile, 'R') for files in fz. namelist (): print (files) # print the directory fz In the zip archive. extract (files, outputdir) # extract the file else: print 'no' for I in items: print I request = urllib2.Request ('HTTP: // www.fontsquirrel.com '+ I) response = urllib2.urlopen (request) html = response. read () name = I. split ('/'{-1}}'.zip' f = open (name, 'w') f. write (html) f. close () # close the file. Otherwise, unzip (name ,'. /') OS. remove (name) OS. listdir (OS. getcwd () OS. chdir ('.. /') files = OS. listdir ('ttf/') for I in files: # Delete useless files if not (I. split ('. ') [-1] = 'ttf' or I. split ('. ') [-1] = 'otf'): if OS. path. isdir (I): OS. removedirs ('ttf/'+ I) else: OS. remove ('ttf/'+ I) print len (OS. listdir ('ttf /'))
There are more than 2000 fonts, and there are many types of fonts.
Website 2: dafont.com
This website has many font patterns and is easy to download. It seems that the encoding of the file name is a problem.
# Coding: utf-8import urllib2, cookielib, sys, re, OS, zipfileimport shutilimport numpy as npcj = cookielib. cookieJar () opener = urllib2.build _ opener (urllib2.HTTPCookieProcessor (cj) opener. addheaders = [('user-agent', 'mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36) ')] urllib2.install _ opener (opener) items = [] def search (path): request = urllib2.Request (path) r Esponse = urllib2.urlopen (request) html = response. read () html = html. replace ('\ n', '') urls = re. findall (r 'href = \ "(http://dl.dafont.com/dl /\? F = .*?) \ "> ', Html) items. extend (urls) for I in xrange (117): host = 'HTTP: // www.dafont.com/new.php? Page = '+ str (I + 1) search (host) print 'page' + str (I + 1) + 'done' items = list (set (items )) print len (items) if not OS. path. exists ('ttf2'): OS. mkdir ('ttf2') OS. chdir ('ttf2') def unzip (rawfile, outputdir): if zipfile. is_zipfile (rawfile): print 'yes' fz = zipfile. zipFile (rawfile, 'R') for files in fz. namelist (): print (files) # print the directory fz In the zip archive. extract (files, outputdir) else: print 'no' for I in items: print I request = urllib2.Request (I) response = urllib2.urlopen (request) html = response. read () namepolici.split('{'}}-1}}'.zip 'f = open (name, 'w') f. write (html) f. close () unzip (name ,'. /') OS. remove (name) print OS. listdir (OS. getcwd () for root, dire, Sox in OS. walk ('. /'): # recursively traverse the folder for I in FS: if not (I. split ('. ') [-1] = 'ttf' or I. split ('. ') [-1] = 'otf'): OS. remove (root + I) print ifor I in OS. listdir ('. /'): if OS. path. isdir (I): OS. rmdir (I) OS. chdir ('.. /')
The overall operation is similar to the previous one. It took dozens of minutes to get over 4000 fonts.
Summary
The above is all about this article. I hope this article will help you learn or use python. If you have any questions, please leave a message, thank you for your support.