The python script crawls a font file,

Source: Internet
Author: User

The python script crawls a font file,

Preface

Everyone should have some experience. To improve the verification code recognition accuracy, we must first obtain enough test data. It is easy to download the verification code, but manual identification by the human brain is really unacceptable, so I came up with a compromise-self-built verification code.

To ensure diversity, we need different fonts first. We can directly use a font file similar to ttf format. There are many ttf-format fonts on the Internet for us to download. Of course, I won't be stupid enough to manually download and decompress it. I have to write a crawler.

Implementation Method

Website 1: fontsquirrel.com

The fonts of this website can be downloaded for free, but many download points are connected to other websites by external links. This part is ignored.

# Coding: utf-8import urllib2, cookielib, sys, re, OS, zipfileimport numpy as np # website login cj = cookielib. cookieJar () opener = urllib2.build _ opener (urllib2.HTTPCookieProcessor (cj) opener. addheaders = [('user-agent', 'mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36) ')] urllib2.install _ opener (opener) # search for downloadable connections def search (path): request = urllib2.Request (path) response = Urllib2.urlopen (request) html = response. read () html = html. replace ('\ n', '') # Remove all carriage returns because the regular expression matches a single line ...... Urls = re. findall (R' <a href = "(.*?) "Rel =" external nofollow "> (.*?) </A> ', html) for I in urls: url, inner = I if not re. findall (r 'Download', inner) = [] and re. findall (r 'offsite', inner) = [] and url not in items: items. append (url) items = [] # Save for I in xrange (15): host = 'HTTP: // www.fontsquirrel.com/fonts/list/find_fonts/'{str (I * 50) + '? Filter % 5 Bdownload % 5D = local 'search (host) if not OS. path. exists ('ttf'): OS. mkdir ('ttf') OS. chdir ('ttf') def unzip (rawfile, outputdir): if zipfile. is_zipfile (rawfile): print 'yes' fz = zipfile. zipFile (rawfile, 'R') for files in fz. namelist (): print (files) # print the directory fz In the zip archive. extract (files, outputdir) # extract the file else: print 'no' for I in items: print I request = urllib2.Request ('HTTP: // www.fontsquirrel.com '+ I) response = urllib2.urlopen (request) html = response. read () name = I. split ('/'{-1}}'.zip' f = open (name, 'w') f. write (html) f. close () # close the file. Otherwise, unzip (name ,'. /') OS. remove (name) OS. listdir (OS. getcwd () OS. chdir ('.. /') files = OS. listdir ('ttf/') for I in files: # Delete useless files if not (I. split ('. ') [-1] = 'ttf' or I. split ('. ') [-1] = 'otf'): if OS. path. isdir (I): OS. removedirs ('ttf/'+ I) else: OS. remove ('ttf/'+ I) print len (OS. listdir ('ttf /'))

There are more than 2000 fonts, and there are many types of fonts.

Website 2: dafont.com

This website has many font patterns and is easy to download. It seems that the encoding of the file name is a problem.

# Coding: utf-8import urllib2, cookielib, sys, re, OS, zipfileimport shutilimport numpy as npcj = cookielib. cookieJar () opener = urllib2.build _ opener (urllib2.HTTPCookieProcessor (cj) opener. addheaders = [('user-agent', 'mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36) ')] urllib2.install _ opener (opener) items = [] def search (path): request = urllib2.Request (path) r Esponse = urllib2.urlopen (request) html = response. read () html = html. replace ('\ n', '') urls = re. findall (r 'href = \ "(http://dl.dafont.com/dl /\? F = .*?) \ "> ', Html) items. extend (urls) for I in xrange (117): host = 'HTTP: // www.dafont.com/new.php? Page = '+ str (I + 1) search (host) print 'page' + str (I + 1) + 'done' items = list (set (items )) print len (items) if not OS. path. exists ('ttf2'): OS. mkdir ('ttf2') OS. chdir ('ttf2') def unzip (rawfile, outputdir): if zipfile. is_zipfile (rawfile): print 'yes' fz = zipfile. zipFile (rawfile, 'R') for files in fz. namelist (): print (files) # print the directory fz In the zip archive. extract (files, outputdir) else: print 'no' for I in items: print I request = urllib2.Request (I) response = urllib2.urlopen (request) html = response. read () namepolici.split('{'}}-1}}'.zip 'f = open (name, 'w') f. write (html) f. close () unzip (name ,'. /') OS. remove (name) print OS. listdir (OS. getcwd () for root, dire, Sox in OS. walk ('. /'): # recursively traverse the folder for I in FS: if not (I. split ('. ') [-1] = 'ttf' or I. split ('. ') [-1] = 'otf'): OS. remove (root + I) print ifor I in OS. listdir ('. /'): if OS. path. isdir (I): OS. rmdir (I) OS. chdir ('.. /')

The overall operation is similar to the previous one. It took dozens of minutes to get over 4000 fonts.

Summary

The above is all about this article. I hope this article will help you learn or use python. If you have any questions, please leave a message, thank you for your support.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.