Python3 Crawler Learning-crawl Baidu encyclopedia content by keyword

Source: Internet
Author: User

Small white made up for a long time to write, record to avoid later use when you forget to have to re-learn ~

The Learning crawler was the first to learn the Python class on the course, and then learn the MU class and NetEase cloud on the crawler tutorial. It's good to check these two yourself.

It's hard to begin with, after all, familiarity takes time, and Python is unfamiliar.

About Python version: I started to read a lot of information said Python2 better, because many libraries still do not support 3, but the use of so far still feel pythin3 more useful, because of the coding problem, think 2 is not 3 convenient. And some of the 2 data found on the Internet can still be used in a slightly changed way.


OK, start to say crawl Baidu encyclopedia thing.

The requirements set here are to crawl all the information of n attractions in Beijing, and the names of the N attractions are given in the document. The API is not used, just crawl the page information.


1. Get URLs based on keywords

Because you only need to crawl information and do not involve interaction, you can use a simple method without having to emulate the browser.

Can directly

Http://baike.baidu.com/search/word?word= "Guanjianci"

View_names:    "http://baike.baidu.com/search/word?word="  # method     to get URL  name=urllib.parse.quote (l)    nameencode (' utf-8 ')    url=' Http://baike.baidu.com/search/word?word= '+name

Here to note that the key word is noon so pay attention to the coding problem, because the URL can not appear spaces, so you need to use the quote function.

About quote ():

The usage in python2.x is: urllib.quote (text). python3.x is Urllib.parse.quote (text). By standard, URLs allow only a subset of ASCII characters (alphanumeric and partial symbols), and other characters (such as Chinese characters) are not compliant with the URL standard. Therefore, the use of other characters in the URL requires URL encoding. The part of the URL that passes the parameter (query String), in the format: Name1=value1&name2=value2&name3=value3. If you have a "&" or "=" symbol in your name or value, there will be a problem, of course. Therefore, the parameter string in the URL also needs to encode "&=" symbols. URL encoding is the way to convert the characters that need to be encoded into%xx form. Usually URL encoding is based on UTF-8 (which is, of course, related to the browser platform)

Example:

For example, "I, Unicode is 0X6211,UTF-8 encoded as 0xe60x880x91,url encoding is%e6%88%91."

The Python urllib Library provides two methods of quote and Quote_plus . These two methods have different encoding ranges. But don't delve into it, it's enough to use quote here.


2. Download URL

Easily implemented with the Urllib library, see defDownload (self, url) in the code below


3. Using BeautifulSoup to get HTML


4. Data analysis

The content in the encyclopedia is a side-by-side segment, so it is not natural to be stored logically by segment when crawling (because it is all tied together). So the regular method must be used.

The basic knowledge of regular: http://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html

Click to open link

The basic idea is to think of the entire HTML file as STR, and then use the regular method to intercept the desired content, re-convert the content into a BeautifulSoup object, and then process it further.

It may take some time to look at the regular.

There are a lot of details in the code, forget to check again only, the next time should definitely write a document, or finish writing immediately ...

Sticker Code!

# coding:utf-8 ' function: Crawl Baidu Encyclopedia all Beijing Attractions, author:yi ' import urllib.requestfrom urllib.request import Urlopenfrom Urlli B.error Import httperrorimport urllib.parsefrom bs4 import beautifulsoupimport reimport codecsimport jsonclass BaikeCraw (object): Def __init__ (self): Self.urls =set () self.view_datas= {} def Craw (self,filename): URL                 s = self.geturls (filename) If urls = = None:print ("not Found") Else:for Urll in URLs: Print (URLL) try:html_count=self.download (Urll) self. Passer (Urll, Html_count) except:print ("View does not exist") ' file=self. view_datas["View_name"] self.craw_pic (urll,file,html_count) print (file) "Def geturls (            Self, filename): New_urls = set () File_object = Codecs.open (filename, encoding= ' utf-16 ',) Try: All_text = File_objeCt.read () except:print ("File Open exception!) ") File_object.close () File_object.close () View_names=all_text.split (" ") for L in View_na Mes:if '? ' in L:view_names.remove (l) for L in View_names: ' http://baike.b            aidu.com/search/word?word= ' # Get URL method name=urllib.parse.quote (L) name.encode (' Utf-8 ') Url= ' http://baike.baidu.com/search/word?word= ' +name new_urls.add (URL) print (new_urls) return new _urls def manger (self): Pass def Passer (self,urll,html_count): Soup = BeautifulSoup (html_count, ' HTML. Parser ', from_encoding= ' Utf_8 ') self._get_new_data (Urll, soup) return def download (self,url): if u           RL is None:return None response = urllib.request.urlopen (URL) if Response.getcode ()! = 200:  Return None return Response.read () def _get_new_data (self, URL, soup): # #得到数据      If Soup.find (' div ', class_= "main-content"). Find (' H1 ') is not none:self.view_datas["view_name"]=soup.find (            ' Div ', class_= "main-content"). Find (' H1 '). Get_text () #景点名 print (self.view_datas["view_name"]) Else: self.view_datas["view_name"] = Soup.find ("div", class_= "Feature_poster"). Find ("H1"). Get_text () self.view_datas[ "View_message"] = Soup.find (' div ', class_= "Lemma-summary"). Get_text () #简介 self.view_datas["Basic_message"]=soup.fin D (' div ', class_= "Basic-info cmn-clearfix"). Get_text () #基本信息 self.view_datas["Basic_message"]=self.view_datas["Basi            C_message "].split (" \ n ") get=[] for line in self.view_datas[" Basic_message "]: if line! =" ": Get.append (line) self.view_datas["Basic_message"]=get i=1 get2=[] tmp= "percent" for Lin E in self.view_datas["Basic_message"]: if I% 2 = = 1:tmp=line Else:a             =tmp+ ":" +line   Get2.append (a) i=i+1 self.view_datas["basic_message"] = Get2 self.view_datas["Catalog"] = Soup . Find (' div ', class_= "Lemma-catalog"). Get_text (). Split ("\ n") #目录整体 get = [] for line in self.view_datas["Catal OG "]: if line! =" ": Get.append (line) self.view_datas[" Catalog "] = Get ############ ############ #百科内容 view_name=self.view_datas["view_name"] html = urllib.request.urlopen (URL) soup2 = B Eautifulsoup (Html.read (), ' Html.parser '). Decode (' utf-8 ') p = re.compile (R ' <div class= "Para-title level-2" ', RE.D Otall) R = P.search (soup2) Content_data_node = Soup2[r.span (0) [0]:] # First H2 (head) p = p = re.compile (R ') <div class= "Album-list" > ", Re.  Dotall) # Tail R = P.search (content_data_node) content_data = Content_data_node[0:r.span (0) [0]] Lists =     Content_data.split (' <div class= ' para-title level-2 ' > ') i = 1 for list in lists: #每一大块       Final_soup = BeautifulSoup (list, "Html.parser") name_list = None Try:part_nam E = Final_soup.find (' H2 ', class_= "Title-text"). Get_text (). Replace (View_name, "). Strip () Part_data = Final_ Soup.get_text (). Replace (View_name, "). Replace (Part_name,"). Replace (' edit ', ') # History name_list = final_s                Oup.findall (' H3 ', class_= "Title-text") all_name_list = {} na= "Part_name" +str (i)                    All_name_list[na] = part_name Final_name_list = []########### for nlist in name_list:                Nlist = Nlist.get_text (). Replace (View_name, "). Strip () Final_name_list.append (nlist) Fin= "Final_name_list" +str (i) all_name_list[fin] = final_name_list print (all_na me_list) i=i+1 #正文 try:p = re.compile (R ' <div class= "pa Ra-title level-3 ">",Re.                    Dotall) Final_soup = Final_soup.decode (' utf-8 ') R = P.search (Final_soup) Final_part_data = Final_soup[r.span (0) [0]:] part_lists = Final_part_data.split (' <div class= ' Para-title level-3 ">") for part_list in part_lists:final_part_soup = Beautif                        Ulsoup (Part_list, "Html.parser") content_lists = Final_part_soup.findall ("div", class_= "Para")                                For content_list in content_lists: # each minimum segment try: Pic_word = Content_list.find ("div", class_= "lemma-picture Text-pic layout-right "). Get_text () # Remove the picture description from the text Try:pi C_word2 = Content_list.find ("div", class_= "description"). Get_text () # Remove the picture description from the text conte Nt_list = Content_lisT.get_text (). Replace (Pic_word, "). Replace (Pic_word2, ') except:                                Content_list = Content_list.get_text (). Replace (Pic_word, ') except: Try:pic_word2 = Content_list.find ("div", class_= "description"). Get_text                                () # Remove the picture description in the text content_list = Content_list.get_text (). Replace (Pic_word2, ")                            Except:content_list = Content_list.get_text () R_part = Re.compile (R ' \[\d.\]|\[\d\] ') part_result, number = RE.SUBN (R_part, "", cont ent_list) Part_result = "". Join (Part_result.split ()) #print (part_res ult) Except:final_part_soup = BeautifulSoup (list, "Html.parser") con Tent_lists = Final_part_Soup.findall ("div", class_= "Para") for Content_list in Content_lists:try: Pic_word = Content_list.find ("div", class_= "Lemma-picture text-pic layout-right"). Get_text () # Remove text Picture description in Try:pic_word2 = Content_list.find ("div", class_= "Descrip tion "). Get_text () # Remove the picture description in the text content_list = Content_list.get_text (). Replace (Pic_word, ' ). Replace (Pic_word2, ') except:content_list = Content_list.get                                _text (). Replace (Pic_word, ') Except:try: Pic_word2 = Content_list.find ("div", class_= "description"). Get_text () # Remove the picture description from the text conte                                Nt_list = Content_list.get_text (). Replace (Pic_word2, ') except: Content_list = Content_List.get_text () R_part = Re.compile (R ' \[\d.\]|\[\d\] ') part_result, number =                        RE.SUBN (R_part, "", content_list) Part_result = "". Join (Part_result.split ())        #print (Part_result) except:print ("Error") return def output (Self,filename): Json_data = Json.dumps (Self.view_datas, Ensure_ascii=false, indent=2) Fout = Codecs.open (filename+ '. Json ', ' a ', enc Oding= ' utf-16 ',) Fout.write (json_data) # print (json_data) return def craw_pic (self,url,filename,h Tml_count): Soup = BeautifulSoup (html_count, ' Html.parser ', from_encoding= ' Utf_8 ') node_pic=soup.find (' div ',        class_= ' banner '). Find ("A", Href=re.compile ("/photo/poi/....\.")) If Node_pic is None:return None else:part_url_pic=node_pic[' href '] full_url_pic=    Urllib.parse.urljoin (url,part_url_pic) #print (full_url_pic)    Try:html_pic = Urlopen (full_url_pic) except Httperror as E:return None soup_pic= BeautifulSoup (Html_pic.read ()) pic_node=soup_pic.find (' div ', class_= ' album-list ') print (Pic_node) Retu RNIF __name__ = = "__main__": Spider=baikecraw () filename= "D:\PyCharm\\view_spider\\view_points_part.txt" spider.c Raw (filename)





Python3 Crawler Learning-crawl Baidu encyclopedia content by keyword

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.