[This article is from the Sky Cloud-owned blog Park]
From the 64365 Web site to get the lawyer phone number across the country, using the Python lxml library for HTML page content parsing. The page content is as follows (the goal is to crawl "name + Phone"):
The code is as follows:
#Coding:utf-8 fromlxmlImportetreeImportRequests,lxml.html,osclassMyerror (Exception):def __init__(self, value): Self.value=valuedef __str__(self):returnrepr (Self.value)defget_lawyers_info (URL): R=requests.get (URL) HTML=lxml.html.fromstring (r.content) Phones= Html.xpath ('//span[@class = "Law-tel"]') Names= Html.xpath ('//div[@class = "FL"]/p/a') if(len (phones) = =len (names)): List (Zip (names,phones)) Phone_infos= [(Names[i].text, Phones[i].text_content ()) forIinchRange (len (names))]Else: Error="Lawyers amount is not equal to the amount of phone_nums:"+URLRaisemyerror (Error) Phone_infos_list= [] forPhone_infoinchPhone_infos:if(Phone_info[1] = =""): #print Phone_info[0],u "no phone left."info = phone_info[0]+": "+u"didn't leave the phone. \ n" #Print Phone_info[0],phone_info[1] Else: Info= phone_info[0]+": "+phone_info[1]+"\ r \ n" PrintInfo Phone_infos_list.append (info)returnphone_infos_listdefget_pages_num (URL): R=requests.get (URL) HTML=lxml.html.fromstring (r.content) result= Html.xpath ('//div[@class = "U-page"]/a[last ()-1]') Pages_num=Result[0].textifpages_num.isdigit ():returnPages_numdefGet_all_lawyers (cities): Dir_path= Os.path.abspath (Os.path.dirname (__file__)) PrintDir_path File_path= Os.path.join (Dir_path,"Lawyers_info.txt") PrintFile_pathifos.path.exists (File_path): Os.remove (File_path)#input ()With open ("Lawyers_info.txt","AB") as file: forCityinchCities:#file.write ("City:" +city+ "\ n") #Print CityPages_num = Get_pages_num ("http://www.64365.com/"+city+"/lawyer/page_1.aspx") ifPages_num: forIinchRange (int (pages_num)): URL="http://www.64365.com/"+city+"/lawyer/page_"+str (i+1) +". aspx"Info=get_lawyers_info (URL) foreachinchInfo:file.write (Each.encode ("GBK"))if __name__=='__main__': Cities= ['Beijing','Shanghai','Guangdong','Guangzhou','Shenzhen','Wuhan','Hangzhou','Ningbo','Tianjin','Nanjing','Jiangsu','Zhengzhou','Jinan','Changsha','Shenyang','Chengdu','Chongqing','Xian'] Get_all_lawyers (cities)
This is a crawl of the top cities and the results are as follows (saved to the "Lawyers_info.txt" file in the current directory):
Python crawl get lawyer phone numbers all over the country