Python Crawler Pragmatic Series IV

Last Update:2015-03-31 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Python crawler Pragmatic Series IV

Description:

In the previous blog post and we've done it from go on a single home page to crawl all the links and download them down and analyze them into Excel.

The objective of this session:

in this section, we'll use python multithreading techniques to crawl links and analyze from go, noting that the number of links we can capture this time can be far greater than the one captured in the previous blog post.

Analysis:

With the crawler statistics the more natural data the better, in order to get more data, we first study how to open thousands of go on the company link.

Open the Home page (http://bj.ganji.com/danbaobaoxian/o1/), you can see a row of pages at the bottom, such as:

Simple analysis can be found that its paging link request is made up of a+b form, A is (http://bj.ganji.com/danbaobaoxian/), and B is (OI), where I is a number. After verification, the range of I is: [1,300+]. From this, we can use the above links to visit each home page and to obtain the respective homepage contains the company pages link. But the problem comes, a home on the company a total of more than 90, suppose we crawl 10 home pages on the company's links, each company from download to analysis to write to Excel assume 0.2s, then total need 180s (=0.2*10*90). And when the speed is poor, the time will be longer. As a result, we need multiple threads to handle the problem.

Learning Python multithreading can be seen here:W3cshoolpython multithreading .

To meet the needs of this crawler, I made the following changes on the basis of the original code.

Multithreading

With multi-threading, each thread handles the download of the company's links on each interface and extracts writes of the information, so that concurrent processing can make the program more efficient and able to crawl more information.

Reptile class

In the previous blog, we were all using the download class and the analysis class separately, we need to run the download class first, and then run the analysis class. We find that these two operations can actually be abstracted into the go of the information on the crawl, and we also hope that the two can be run through a program, which also reduces the complexity of the operation.

So we build a go reptile that aggregates the download and analysis functionality, and, in order to accommodate multithreading, we let the class inherit threading. The thread class, overriding the __init__ () and __run__ () functions to meet the needs of our concurrent downloads.

Reuse of code

In the design of the crawler, we found that many of the original code functions are not suitable for direct use of paste, its reusability is poor, so we need to refactor several functions.

for download, we used to call GetPages () to open the URL and store the open Web page in the computer cache, using the Urlretrieve () function, and then using the Savepages () Saves the page you just saved to the specified hard disk location. We found that using the Urlretrieve () function can directly download the downloaded page to a given hard disk location, so you can use Download_pages () directly to get it done.

Code:

#-*-coding:utf-8-*-#注: Here, I put go home page called the main interface, the homepage of the company links and their pages are called sub-interface import osimport reimport sysimport xlwtimport xlrdimport Threadingfrom BS4 Import beautifulsoupfrom time import sleep, ctimefrom urllib import urlopen, urlretrievereload (SYS) sys. Setdefaultencoding (' Utf-8 ') class Ganjiwangcrawler (threading. Thread): #url表示下载的主界面, Mark identifies which process downloaded #location indicates the folder where the file was downloaded, exname indicates that the last saved Excel name #wb is the Excel object created, and WS is the corresponding sheet object def __ init__ (self, URL, mark, location, Exname, WS, WB): Threading.  Thread.__init__ (self) self.url = Urlself.mark = markself.location = Locationself.suburls = []self.exname = EXNAMESELF.WB = wbself.ws = Wsdef Run (self): #先下载主界面self. Download_pages (Self.url, ' Main%s.txt '%str (Self.mark), self.location) # Analyze the main interface and return to the company included in the main interface urlself.suburls = self.analysis_main_pages (' main%s.txt '%str (Self.mark), self.location) # The first line is based on the Suburls download sub-interface #第二行分析子界面并写入Excel中for I,SU in Enumerate (self.suburls): Self.download_pages (su,r ' file%s%s.txt '% ( STR (self.mark), str (i)), self.location) self.analysis_sub_pages (R ' file%s%s.txt '% (str (self)Mark), str (i), self.location) def analysis_main_pages (self, fname, location): Suburls = []filepath = location + Fnameif OS. Path.exists (filepath): fobj = open (filepath, ' r ') lines = Fobj.readlines () fobj.close () soup = BeautifulSoup (". Join ( Lines)) Leftbox = Soup.find (attrs={' class ': ' Leftbox '}) List_ = Leftbox.find (attrs={' class ': ' List '}) Li = List_.find_all (' li ') Href_regex = R ' href= "(. *?)" ' For L in Li:suburls.append (' http://bj.ganji.com ' + re.search (HREF_REGEX,STR (L)). Group (1)) Else:print (' The file is Missing ') #由于抓取的界面太多, causing go to reject the page request, here we modify the number of companies to crawl (take 10) return Suburls If Len (suburls) < else suburls[0:10]def Download_pages (self, URL, fname, location): Try:urlretrieve (URL, location + fname) except Exception, E:print ' download Page error: ', Urldef write_to_excel (self, Record, row): ' This function stores all values in the given record dictionary in the corresponding row row of Excel ' #写入公司名称companyName = record[' CompanyName ']self.ws.write (row,0,companyname) #写入服务特色serviceFeature = record[' servicefeature '] Self.ws.write (row,1,servicefeature) #写入服务范围serviceScope =', '. Join (record[' Servicescope ') self.ws.write (row,2,servicescope) #写入联系人contacts = record[' contacts '] Self.ws.write (Row,3,contacts.decode ("Utf-8")) #写入商家地址address = record[' address ']self.ws.write (row,4, Address.decode ("Utf-8")) #写入聊天QQqqNum = record[' Qqnum ']self.ws.write (row,5,qqnum) #写入联系电话phoneNum = record[' Phonenum ']phonenum = str (phonenum). Encode ("Utf-8") Self.ws.write (Row,6,phonenum.decode ("Utf-8")) #写入网址companySite = record[' Companysite ']self.ws.write (row,7,companysite) Self.wb.save (self.exname) def analysis_sub_pages (self, Subfname, location): filepath = location + SUBFNAMEF = open (filepath, ' r ') lines = F.readlines () f.close () # Set up a BeautifulSoup parse tree and extract the information (LI) Try:soup = BeautifulSoup (". Join (lines)) BODY = Soup.bodywrapper = Soup.find (id) of the contact owner module. = "wrapper") Clearfix = Wrapper.find_all (attrs={' class ': ' D-left-box '}) [0]dzcontactus = Clearfix.find (id= "Dzcontactus con = dzcontactus.find (attrs={' class ': ' Con '}) ul = Con.find (' ul ') Li = Ul.find_all (' li ') except Exception, E: #如果出错, That the Web page does not conform to our common mode, it ignores the returnnone# If the page does not conform to our generic model, we will cancel the analysis if Len (LI)! = 10:return none# records All information for a company, stored in a dictionary, can be accessed by key-value pairs, or replaced by a list of stored record = {}# Company Name CompanyName = Li[1].find (' H1 '). contents[0]record[' companyName '] = companyname# Service features servicefeature = Li[2].find (' P '). contents[0]record[' servicefeature ' = servicefeature# Service provides serviceprovider = []serviceproviderresultset = li[3]. Find_all (' a ') for service in ServiceProviderResultSet:serviceProvider.append (Service.contents[0]) record[' ServiceProvider '] = serviceprovider# service Range Servicescope = []servicescoperesultset = Li[4].find_all (' a ') for the scope in ServiceScopeResultSet:serviceScope.append (Scope.contents[0]) record[' servicescope '] = servicescope# contact contacts = Li [5].find (' P '). contents[0]contacts = str (contacts). Strip (). Encode ("Utf-8") record[' contacts '] = contacts# Merchant Address Addressresultset = Li[6].find (' P ') re_h=re.compile (' </?\w+[^>]*> ') #HTML标签address = Re_h.sub (", str ( Addressresultset) record[' address '] = Address.encode ("Utf-8") Restli = ' for L in range (8,len (LI)-1): Restli + = str(Li[l]) #商家QQqqNumResultSet = Restliqq_regex = ' (\d{5,10}) ' Qqnum = Re.search (qq_regex,qqnumresultset). Group () record[' Qqnum '] = qqnum# contact Phone phone_regex= ' 1[3|5|7|8|] [0-9] {9} ' Phonenum = Re.search (Phone_regex,restli). Group () record[' phonenum '] = phonenum# company URL companysite = Li[len (LI)-1]. Find (' a '). contents[0]record[' companysite ' = companysite# Save the company record in Excel Openexcel = Xlrd.open_workbook (self.exname) Table = Openexcel.sheet_by_name (R ' Companyinfosheet ') self.write_to_excel (record, table.nrows) def init_excel (Exname) : ' We'll get a table and give the table a head, so we give the head a different font ' WB = XLWT. Workbook () ws = Wb.add_sheet (R ' companyinfosheet ') #初始化样式style = XLWT. Xfstyle () #为样式创建字体font = XLWT. Font () Font.Name = ' Times New Roman ' Font.Bold = true# set font for style Style.font = font# Use style # Write company name Ws.write (0,0,u ' Company name ', style) # Write service feature Ws.write (0,1,u ' service feature ', style) #写入服务范围ws. Write (0,2,u ' service range ', style) #写入联系人ws. Write (0,3,u ' contacts ', style) # Write to Merchant address Ws.write (0,4,u ' merchant address ', style) #写入聊天QQws. Write (0,5,u ' QQ ', style) #写入联系电话ws. Write (0,6,u ' contact phone ', style) # Write URL ws.write (0,7,u ' company URL ', Style) Wb.save (exname) return [WS, Wb]def Main (): ' Start crawler thread to download ' Exname = R ' Info.xls ' print ' start crawler ' excels = Init_ Excel (exname) #初始化urlurls = [] #下载赶集网页面的个数, can be set to a maximum of more than 300, while representing the number of threads in this session pages = 2nloops = xrange (pages) for i in Nloops:url = ' http:/ /bj.ganji.com/danbaobaoxian/o%s/'% str (i + 1) urls.append (URL) threads = []for i in nloops:t = Ganjiwangcrawler (urls[i], MA Rk=i,location=r ' pagestroage\\ ', Exname=exname, Ws=excels[0], wb=excels[1]) threads.append (t) for I in Nloops:threads[i ].start () for I in Nloops:threads[i].join () print ' OK, everything was done ' if __name__ = = ' __main__ ': Main ()

Operation Result:

Two main0.txt and main1.txt files are downloaded under the Pagestroage folder, corresponding to two threads. The File0i.txt and File1j.txt files are also downloaded, where I from 0 to 9,j also from 0 to 9. That is to say, two threads finally resolved the URL from the main file and downloaded 10 (I set) the company interface respectively. The Info.xls contains 15 records of the company.

My Files directory:

Episode:

In their own open multi-threaded download, found that their programs often run directly exit, and later found that the program initiated the URL request was go to reject, reply is the robot interface, such as:

Visible go has a certain limit to the crawl strength, we can use the Sleep statement in the program to reduce the speed of page download.

after feeling:

After the examination of the first program to do after the school, finally realize that there is also a chance to good programming. Program development process always encounter a variety of strange problems, what coding problems, tab and space mixed problems. Fortunately, all of them were settled later.

Not to be continued.

Python Crawler Pragmatic Series IV

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python Crawler Pragmatic Series IV

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python Crawler Pragmatic Series IV

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support