Python Crawler Pragmatic Series III

Source: Internet
Author: User
Tags wrapper

Python crawler Pragmatic Series III
Description:

In our previous blog, we've learned to extract information from a company on go and store it in Excel.

The objective of this session:

In this section, we will bulk download all the company interfaces on the market's home page (note that not all of the company pages on go, we can leave this for later tasks), and batch process all company information, and save to Excel.

Note:

In the previous blog, we used only to match the information in one of the company's interfaces on go, and unfortunately, the number of information in the owner module of many other companies is not fixed, that is, there are 10 Li, and some 9 Li or 12 Li, And not all companies provide QQ contact information or company name. So, I've done a little bit of code to make it fit most of the page.

Bulk Download Pages:

Such as:


We will extract the company links contained in this page and download them in bulk. This time, we used a download class that uses the download class to encapsulate the method we want. We first give the Go homepage link, download the homepage, then we analyze the first page contains the company link and save it, and finally we download these links to the Pagestroage folder.

Code:

#-*-coding:utf-8-*-import reimport osfrom urllib import urlretrievefrom bs4 import Beautifulsoupclass Download (object): ' The class will contain a method of downloading the given URL and saving it as the corresponding file ' Def __init__ (self): Self.index = 0 #初始化 def getpages (Self,url,isma in=false): ' Download according to the given URL, if it is the go main interface, then handle ' for u in url:try:revtal = Urlretrieve (                u) [0] except ioerror:revtal = None if Revtal <> none and Ismain = = True: #是赶集网主界面 self.savepages (revtal, ismain=true) elif revtal <> None and Ismain < > True:self.savePages (revtal) else:print (' Open URL error ') def savepages (        Self,webpage,ismain=false): ' Save the given page ' F = open (webpage) lines = F.readlines () f.close () if Ismain = = False: #不是主界面则按序存储为filei. Txt,i is ordinal fobj = open ("Pagestroage\\file%s.txt"%str (self.in DEX), ' W ') selF.index + = 1 fobj.writelines (lines) Else: #是赶集网主界面, is stored as mian.txt fobj = Open ("pages Troage\main.txt ", ' W ') Fobj.writelines (lines) fobj.close () def getallcomurls (self) : ' We analyze Go's main interface, extract all company links, save ' if os.path.exists (' Pagestroage\main.txt '): #判断文件是否存在 fobj = op En (' pagestroage\main.txt ', ' r ') lines = Fobj.readlines () fobj.close () soup = BeautifulSoup (". Join (lines)) BODY = Soup.body #wrapper = Soup.find (id=" wrapper ") Leftbox = Soup.find (            Attrs={' class ': ' Leftbox '}) List_ = Leftbox.find (attrs={' class ': ' List '}) ul = List_.find (' ul ')            Li = Ul.find_all (' li ') Href_regex = R ' href= "(. *?)" ' URLs = [] for L in Li:urls.append (' http://bj.ganji.com ' + re.search (HREF_REGEX,STR (L)). Group (1)       ) #print URLs return URLs else:     Print (' The file is missing ') return None if __name__ = = ' __main__ ': #初试设定url为赶集网首页 url=[' http ://bj.ganji.com/danbaobaoxian/o1/'] #实例化下载类 download = download () #先下载赶集网首页 download.getpages (url,true) #对下 Download the Go Home information for analysis, extract all the company URL URLs = download.getallcomurls () #对上面提取的url进行下载 download.getpages (URLs)

Operation Result:

We got more than 10 text files containing the company's Web pages. Such as:

Analysis Web page:

By the above operation, we have got the HTML text of all the companies on go. Then we use the Analysiser class to process the data we get. Note that the methods in the Analysiser class are basically described in the previous blog, which is just encapsulated in class and allows it to be processed in batches.

Code:

#-*-coding:utf-8-*-import refrom BS4 Import beautifulsoupimport xlwtimport osimport sysreload (SYS) Sys.setdefaultencoding (' Utf-8 ') class Analysiser (object): ' The class stores the downloaded company information into the Excel table ' Def __init__ (self): ' Initialize a Excel ' SELF.WB = xlwt. Workbook () self.ws = Self.wb.add_sheet (' Companyinfosheet ') self.initexcel () def initexcel (self): ' We get a table and give the table a head, so we give the header a different font ' #初始化样式 style = XLWT. Xfstyle () #为样式创建字体 font = XLWT. Font () Font.Name = ' Times New Roman ' Font.Bold = True #为样式设置字体 style.font = font # using        Style #写入公司名称 Self.ws.write (0,0,u ' Company name ', style) #写入服务特色 self.ws.write (0,1,u ' service feature ', style)        #写入服务范围 self.ws.write (0,2,u ' service range ', style) #写入联系人 self.ws.write (0,3,u ' contacts ', style) #写入商家地址 Self.ws.write (0,4,u ' merchant address ', style) #写入聊天QQ self.ws.write (0,5,u ' QQ ', style) #写入联系电话 SELF.WS.W   Rite (0,6,u ' contact phone ', style)     #写入网址 Self.ws.write (0,7,u ' company URL ', style) self.wb.save (' Xinrui.xls ') def analysallfiles (self): "' Bulk analysis of the source code of the Web page, and extract information about the company" #得到pagestroage (we store the downloaded company pages folder) under all the files filenames = os.listdir (' page Stroage ') #得到所有存储的公司数目 (remove a main.txt containing Go home page) counts = Len (filenames)-1 #循环处理 for I in range ( Counts): #打开文件, read file into lines, close file object f = open ("Pagestroage\\file%s.txt"%i, ' r ') lines = F.rea                Dlines () F.close () #这两个网页的联系店主模块与其他的不一样, if you want to match only re-write code matching, then discard if i = = or I = = 7: Continue #建立一个BeautifulSoup解析树 and use the parse tree to follow #soup-->body--> (the div layer with ID wrapper)--(CLA The SS attribute is Clearfix div layer) #--> (the DIV layer with ID dzcontactus)--(class property is Con div layer)-->ul--> (each LI under UL) sou p = BeautifulSoup (". Join (lines)) BODY = Soup.body #body2 = soup.find (' body ') wrapper = Soup.find (ID = "wrapper") cLearfix = Wrapper.find_all (attrs={' class ': ' D-left-box '}) [0] dzcontactus = Clearfix.find (id= "Dzcontactus")            con = dzcontactus.find (attrs={' class ': ' Con '}) ul = Con.find (' ul ') Li = Ul.find_all (' li ') #记录一家公司的所有信息, stored in a dictionary, can be accessed by key-value pairs, or replaced with a list to store record = {} #公司名称 companyName = Li[1].find ( ' H1 '). Contents[0] record[' companyName '] = companyName #服务特色 servicefeature = Li[2].find (' P '). Contents[0] record[' servicefeature '] = servicefeature #服务提供 serviceprovid                ER = [] serviceproviderresultset = Li[3].find_all (' a ') for service in Serviceproviderresultset: Serviceprovider.append (Service.contents[0]) record[' serviceprovider '] = serviceprovider # Service range servicescope = [] Servicescoperesultset = Li[4].find_all (' a ') for the scope in ServiceS          Coperesultset:      Servicescope.append (Scope.contents[0]) record[' servicescope '] = Servicescope #联系人 C ontacts = Li[5].find (' P '). contents[0] Contacts = str (contacts). Strip ("encode") Utf-8 ' record[ CTS '] = contacts #商家地址 Addressresultset = Li[6].find (' P ') re_h=re.compile (' </?\w+[^&gt ;] *> ') #HTML标签 address = Re_h.sub (", str (addressresultset)) record[' address '] = Address.encode (" utf             -8 ") Restli = ' for L in range (8,len (LI)-1): Restli + = str (li[l]) #商家QQ Qqnumresultset = Restli Qq_regex = ' (\d{5,10}) ' Qqnum = Re.search (Qq_regex,qqnumresultse T). Group () Qqnum = Qqnum record[' qqnum '] = qqnum #联系电话 Phone_regex = ' 1[3|5|7|8|] [0-9]                    {9} ' Phonenum = Re.search (Phone_regex,restli). Group () record[' phonenum '] = Phonenum    #公司网址 companysite = Li[len (li)-1].find (' a '). Contents[0] record[' companysite '] = Companysite Self.writetoexcel (Record,i + 1) def writetoexcel (self,record,index): ' This function will give a given R All values in the Ecord dictionary are stored in the corresponding index line of Excel ' #写入公司名称 companyName = record[' CompanyName '] self.ws.write (index,0,comp        Anyname) #写入服务特色 servicefeature = record[' servicefeature '] self.ws.write (index,1,servicefeature)         #写入服务范围 servicescope = ', '. Join (record[' Servicescope ') self.ws.write (index,2,servicescope) #写入联系人        contacts = record[' contacts '] Self.ws.write (Index,3,contacts.decode ("Utf-8")) #写入商家地址 Address = record[' address '] self.ws.write (Index,4,address.decode ("Utf-8")) #写入聊天QQ Qqnum = R ecord[' Qqnum ' Self.ws.write (index,5,qqnum) #写入联系电话 phonenum = record[' Phonenum '] Phon ENum = str (phonenum). EncOde ("Utf-8") Self.ws.write (Index,6,phonenum.decode ("Utf-8")) #写入网址 companysite = record[' Comp Anysite '] Self.ws.write (index,7,companysite) self.wb.save (' Xinrui.xls ') if __name__ = = ' __main__ ': ana = A             Nalysiser () Ana.analysallfiles ()

Operation Result:

We will get Excel that contains information about all the companies included on the Go home page, such as:



after feeling:

See this Excel does not feel very cool, finally can do something practical.

However, we can also do better and smarter.

Not to be continued.


Python Crawler Pragmatic Series III

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.