Small white made up for a long time to write, record to avoid later use when you forget to have to re-learn ~
The Learning crawler was the first to learn the Python class on the course, and then learn the MU class and NetEase cloud on the crawler tutorial. It's good to check these two yourself.
It's hard to begin with, after all, familiarity takes time, and Python is unfamiliar.
About Python version: I started to read a lot of information said Python2 better, because many libraries still do not support 3, but the use of so far still feel pythin3 more useful, because of the coding problem, think 2 is not 3 convenient. And some of the 2 data found on the Internet can still be used in a slightly changed way.
OK, start to say crawl Baidu encyclopedia thing.
The requirements set here are to crawl all the information of n attractions in Beijing, and the names of the N attractions are given in the document. The API is not used, just crawl the page information.
1. Get URLs based on keywords
Because you only need to crawl information and do not involve interaction, you can use a simple method without having to emulate the browser.
Can directly
Http://baike.baidu.com/search/word?word= "Guanjianci"
View_names: "http://baike.baidu.com/search/word?word=" # method to get URL name=urllib.parse.quote (l) nameencode (' utf-8 ') url=' Http://baike.baidu.com/search/word?word= '+name
Here to note that the key word is noon so pay attention to the coding problem, because the URL can not appear spaces, so you need to use the quote function.
About quote ():
The usage in python2.x is: urllib.quote (text). python3.x is Urllib.parse.quote (text). By standard, URLs allow only a subset of ASCII characters (alphanumeric and partial symbols), and other characters (such as Chinese characters) are not compliant with the URL standard. Therefore, the use of other characters in the URL requires URL encoding. The part of the URL that passes the parameter (query String), in the format: Name1=value1&name2=value2&name3=value3. If you have a "&" or "=" symbol in your name or value, there will be a problem, of course. Therefore, the parameter string in the URL also needs to encode "&=" symbols. URL encoding is the way to convert the characters that need to be encoded into%xx form. Usually URL encoding is based on UTF-8 (which is, of course, related to the browser platform)
Example:
For example, "I, Unicode is 0X6211,UTF-8 encoded as 0xe60x880x91,url encoding is%e6%88%91."
The Python urllib Library provides two methods of quote and Quote_plus . These two methods have different encoding ranges. But don't delve into it, it's enough to use quote here.
2. Download URL
Easily implemented with the Urllib library, see defDownload (self, url) in the code below
3. Using BeautifulSoup to get HTML
4. Data analysis
The content in the encyclopedia is a side-by-side segment, so it is not natural to be stored logically by segment when crawling (because it is all tied together). So the regular method must be used.
The basic knowledge of regular: http://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html
Click to open link
The basic idea is to think of the entire HTML file as STR, and then use the regular method to intercept the desired content, re-convert the content into a BeautifulSoup object, and then process it further.
It may take some time to look at the regular.
There are a lot of details in the code, forget to check again only, the next time should definitely write a document, or finish writing immediately ...
Sticker Code!
# coding:utf-8 ' function: Crawl Baidu Encyclopedia all Beijing Attractions, author:yi ' import urllib.requestfrom urllib.request import Urlopenfrom Urlli B.error Import httperrorimport urllib.parsefrom bs4 import beautifulsoupimport reimport codecsimport jsonclass BaikeCraw (object): Def __init__ (self): Self.urls =set () self.view_datas= {} def Craw (self,filename): URL s = self.geturls (filename) If urls = = None:print ("not Found") Else:for Urll in URLs: Print (URLL) try:html_count=self.download (Urll) self. Passer (Urll, Html_count) except:print ("View does not exist") ' file=self. view_datas["View_name"] self.craw_pic (urll,file,html_count) print (file) "Def geturls ( Self, filename): New_urls = set () File_object = Codecs.open (filename, encoding= ' utf-16 ',) Try: All_text = File_objeCt.read () except:print ("File Open exception!) ") File_object.close () File_object.close () View_names=all_text.split (" ") for L in View_na Mes:if '? ' in L:view_names.remove (l) for L in View_names: ' http://baike.b aidu.com/search/word?word= ' # Get URL method name=urllib.parse.quote (L) name.encode (' Utf-8 ') Url= ' http://baike.baidu.com/search/word?word= ' +name new_urls.add (URL) print (new_urls) return new _urls def manger (self): Pass def Passer (self,urll,html_count): Soup = BeautifulSoup (html_count, ' HTML. Parser ', from_encoding= ' Utf_8 ') self._get_new_data (Urll, soup) return def download (self,url): if u RL is None:return None response = urllib.request.urlopen (URL) if Response.getcode ()! = 200: Return None return Response.read () def _get_new_data (self, URL, soup): # #得到数据 If Soup.find (' div ', class_= "main-content"). Find (' H1 ') is not none:self.view_datas["view_name"]=soup.find ( ' Div ', class_= "main-content"). Find (' H1 '). Get_text () #景点名 print (self.view_datas["view_name"]) Else: self.view_datas["view_name"] = Soup.find ("div", class_= "Feature_poster"). Find ("H1"). Get_text () self.view_datas[ "View_message"] = Soup.find (' div ', class_= "Lemma-summary"). Get_text () #简介 self.view_datas["Basic_message"]=soup.fin D (' div ', class_= "Basic-info cmn-clearfix"). Get_text () #基本信息 self.view_datas["Basic_message"]=self.view_datas["Basi C_message "].split (" \ n ") get=[] for line in self.view_datas[" Basic_message "]: if line! =" ": Get.append (line) self.view_datas["Basic_message"]=get i=1 get2=[] tmp= "percent" for Lin E in self.view_datas["Basic_message"]: if I% 2 = = 1:tmp=line Else:a =tmp+ ":" +line Get2.append (a) i=i+1 self.view_datas["basic_message"] = Get2 self.view_datas["Catalog"] = Soup . Find (' div ', class_= "Lemma-catalog"). Get_text (). Split ("\ n") #目录整体 get = [] for line in self.view_datas["Catal OG "]: if line! =" ": Get.append (line) self.view_datas[" Catalog "] = Get ############ ############ #百科内容 view_name=self.view_datas["view_name"] html = urllib.request.urlopen (URL) soup2 = B Eautifulsoup (Html.read (), ' Html.parser '). Decode (' utf-8 ') p = re.compile (R ' <div class= "Para-title level-2" ', RE.D Otall) R = P.search (soup2) Content_data_node = Soup2[r.span (0) [0]:] # First H2 (head) p = p = re.compile (R ') <div class= "Album-list" > ", Re. Dotall) # Tail R = P.search (content_data_node) content_data = Content_data_node[0:r.span (0) [0]] Lists = Content_data.split (' <div class= ' para-title level-2 ' > ') i = 1 for list in lists: #每一大块 Final_soup = BeautifulSoup (list, "Html.parser") name_list = None Try:part_nam E = Final_soup.find (' H2 ', class_= "Title-text"). Get_text (). Replace (View_name, "). Strip () Part_data = Final_ Soup.get_text (). Replace (View_name, "). Replace (Part_name,"). Replace (' edit ', ') # History name_list = final_s Oup.findall (' H3 ', class_= "Title-text") all_name_list = {} na= "Part_name" +str (i) All_name_list[na] = part_name Final_name_list = []########### for nlist in name_list: Nlist = Nlist.get_text (). Replace (View_name, "). Strip () Final_name_list.append (nlist) Fin= "Final_name_list" +str (i) all_name_list[fin] = final_name_list print (all_na me_list) i=i+1 #正文 try:p = re.compile (R ' <div class= "pa Ra-title level-3 ">",Re. Dotall) Final_soup = Final_soup.decode (' utf-8 ') R = P.search (Final_soup) Final_part_data = Final_soup[r.span (0) [0]:] part_lists = Final_part_data.split (' <div class= ' Para-title level-3 ">") for part_list in part_lists:final_part_soup = Beautif Ulsoup (Part_list, "Html.parser") content_lists = Final_part_soup.findall ("div", class_= "Para") For content_list in content_lists: # each minimum segment try: Pic_word = Content_list.find ("div", class_= "lemma-picture Text-pic layout-right "). Get_text () # Remove the picture description from the text Try:pi C_word2 = Content_list.find ("div", class_= "description"). Get_text () # Remove the picture description from the text conte Nt_list = Content_lisT.get_text (). Replace (Pic_word, "). Replace (Pic_word2, ') except: Content_list = Content_list.get_text (). Replace (Pic_word, ') except: Try:pic_word2 = Content_list.find ("div", class_= "description"). Get_text () # Remove the picture description in the text content_list = Content_list.get_text (). Replace (Pic_word2, ") Except:content_list = Content_list.get_text () R_part = Re.compile (R ' \[\d.\]|\[\d\] ') part_result, number = RE.SUBN (R_part, "", cont ent_list) Part_result = "". Join (Part_result.split ()) #print (part_res ult) Except:final_part_soup = BeautifulSoup (list, "Html.parser") con Tent_lists = Final_part_Soup.findall ("div", class_= "Para") for Content_list in Content_lists:try: Pic_word = Content_list.find ("div", class_= "Lemma-picture text-pic layout-right"). Get_text () # Remove text Picture description in Try:pic_word2 = Content_list.find ("div", class_= "Descrip tion "). Get_text () # Remove the picture description in the text content_list = Content_list.get_text (). Replace (Pic_word, ' ). Replace (Pic_word2, ') except:content_list = Content_list.get _text (). Replace (Pic_word, ') Except:try: Pic_word2 = Content_list.find ("div", class_= "description"). Get_text () # Remove the picture description from the text conte Nt_list = Content_list.get_text (). Replace (Pic_word2, ') except: Content_list = Content_List.get_text () R_part = Re.compile (R ' \[\d.\]|\[\d\] ') part_result, number = RE.SUBN (R_part, "", content_list) Part_result = "". Join (Part_result.split ()) #print (Part_result) except:print ("Error") return def output (Self,filename): Json_data = Json.dumps (Self.view_datas, Ensure_ascii=false, indent=2) Fout = Codecs.open (filename+ '. Json ', ' a ', enc Oding= ' utf-16 ',) Fout.write (json_data) # print (json_data) return def craw_pic (self,url,filename,h Tml_count): Soup = BeautifulSoup (html_count, ' Html.parser ', from_encoding= ' Utf_8 ') node_pic=soup.find (' div ', class_= ' banner '). Find ("A", Href=re.compile ("/photo/poi/....\.")) If Node_pic is None:return None else:part_url_pic=node_pic[' href '] full_url_pic= Urllib.parse.urljoin (url,part_url_pic) #print (full_url_pic) Try:html_pic = Urlopen (full_url_pic) except Httperror as E:return None soup_pic= BeautifulSoup (Html_pic.read ()) pic_node=soup_pic.find (' div ', class_= ' album-list ') print (Pic_node) Retu RNIF __name__ = = "__main__": Spider=baikecraw () filename= "D:\PyCharm\\view_spider\\view_points_part.txt" spider.c Raw (filename)
Python3 Crawler Learning-crawl Baidu encyclopedia content by keyword