Python3 According to the keyword crawl Baidu encyclopedia content

Last Update:2017-02-25 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Preface

About the Python version, I started to read a lot of information said Python2 better, because many libraries still do not support 3, but the use of so far still feel pythin3 more useful, because of the coding problem, think 2 is not 3 convenient. And some of the 2 data found on the Internet can still be used in a slightly changed way.

OK, start to say crawl Baidu encyclopedia thing.

The requirements set here are to crawl all the information of n attractions in Beijing, and the names of the N attractions are given in the document. The API is not used, just crawl the page information.

1. Get URLs based on keywords

Because you only need to crawl information and do not involve interaction, you can use a simple method without having to emulate the browser.

Can directly

<strong>http://www.php.cn/"Guanjianci" </strong>

<strong>for </strong>l <strong>in </strong>view_names: <strong> '/http baike.baidu.com/search/word?word= "</strong><em># Get URL method </em><em> </em>name= Urllib.parse.quote (L) name.encode (<strong> ' utf-8 ' </strong>) url=<strong> '/HTTP Baike.baidu.com/search/word?word= ' </strong>+name

Here to pay attention to the key word is noon so pay attention to the coding problem, because the URL can not appear spaces, so you need to use a quote function to deal with.

About quote ():

The usage in python2.x is: urllib.quote(text) . In the python3.x urllib.parse.quote(text) . By standard, URLs allow only a subset of ASCII characters (alphanumeric and partial symbols), and other characters (such as Chinese characters) are not compliant with the URL standard. Therefore, the use of other characters in the URL requires URL encoding. The part of the URL that passes the parameter (query String), in the format: name1=value1&name2=value2&name3=value3 . If you have a "&" or "=" symbol in your name or value, there will be a problem, of course. Therefore, the parameter string in the URL also needs to encode "&=" symbols. URL encoding is the way to convert the characters that need to be encoded into%xx form. Usually URL encoding is based on UTF-8 (which is, of course, related to the browser platform)

Example:

For example, "I, Unicode is 0X6211,UTF-8 encoded as 0xe60x880x91,url encoding is%e6%88%91."

Python's Urllib library is available in quote quote_plus two ways. These two methods have different encoding ranges. But don't delve into it, it quote 's enough here.

2. Download URL

Easily implemented with the Urllib library, as shown in def download(self,url) the following code

3. Using BeautifulSoup to get HTML

4. Data analysis

The content in the encyclopedia is a side-by-side segment, so it is not natural to be stored logically by segment when crawling (because it is all tied together). So the regular method must be used.

The basic idea is to think of the entire HTML file as STR, and then use the regular method to intercept the desired content, re-convert the content into beautifulsoup objects, and then process it further.

It may take some time to look at the regular.

There are a lot of details in the code, forget to check again only, the next time should definitely write a document, or finish writing immediately ...

Sticker Code!

# coding:utf-8 ' function: Crawl Baidu Encyclopedia all Beijing Attractions, author:yi ' import urllib.requestfrom urllib.request import Urlopenfrom Urllib.error Import httperrorimport urllib.parsefrom bs4 import beautifulsoupimport reimport codecsimport JSON class Baik Ecraw (object): Def __init__ (self): Self.urls =set () self.view_datas= {} def Craw (self,filename): URLs = Self.geturls (f ilename) If URLs = = None:print ("not Found") Else:for Urll in Urls:print (URLL) Try:html_count=self.down Load (URLL) Self.passer (Urll, Html_count) except:print ("View does not exist") ' file=self.view_datas[' View_na Me "] self.craw_pic (urll,file,html_count) print (file)" Def geturls (self, filename): New_urls = set () file_obj ECT = Codecs.open (filename, encoding= ' utf-16 ',) Try:all_text = File_object.read () except:print ("File Open exception!") ") File_object.close () File_object.close () View_names=all_text.split (" ") for L in View_names:if '? ' in L:view _names.remove (L) for L in View_names: "http://baike.baidu.com/search/word?word= ' # Get URL method name=urllib.parse.quote (L) name.encode (' utf-8 ') url= '/HTTP Baike.baidu.com/search/word?word= ' +name new_urls.add (URL) print (new_urls) return New_urls def manger (self): Pass de F Passer (self,urll,html_count): Soup = BeautifulSoup (html_count, ' Html.parser ', from_encoding= ' utf_8 ') self._get_new_  Data (Urll, soup) return def download (self,url): If URL is none:return None response = urllib.request.urlopen (URL) If Response.getcode ()! = 200:return None return response.read () def _get_new_data (self, URL, soup): # #得到数据 if SOUP.F IND (' P ', class_= "main-content"). Find (' H1 ') is not none:self.view_datas["view_name"]=soup.find (' P ', class_= " Main-content "). Find (' H1 '). Get_text () #景点名 print (self.view_datas[" view_name "]) else:self.view_datas[" view_name "] = Soup.find ("P", class_= "Feature_poster"). Find ("H1"). Get_text () self.view_datas["view_message"] = Soup.find (' P ', class_= "Lemma-summary"). Get_text () #简介 self.view_datas["Basic_message"]=soup.find (' P ', class_= "Basic-info cmn-clearfix"). Get_text () #基本信息 self.view_datas["Basic_message" ]=self.view_datas["Basic_message"].split ("\ n") get=[] for line in self.view_datas["Basic_message"]: if line! = "": G Et.append (line) self.view_datas["Basic_message"]=get i=1 get2=[] tmp= "percent" for line in self.view_datas["Basic_message" ]: If I% 2 = = 1:tmp=line else:a=tmp+ ":" +line get2.append (a) i=i+1 self.view_datas["basic_message"] = g Et2 self.view_datas["Catalog"] = Soup.find (' P ', class_= "Lemma-catalog"). Get_text (). Split ("\ n") #目录整体 get = [] for line I n self.view_datas["Catalog"]: if line! = "": Get.append (line) self.view_datas["Catalog"] = Get #################### # # # # #百科内容 view_name=self.view_datas["view_name"] html = urllib.request.urlopen (URL) soup2 = BeautifulSoup (Html.read () , ' Html.parser '). Decode (' utf-8 ') p = re.compile (r ", Re. Dotall) # Tail R = P.search (content_data_node) content_data = Content_data_node[0:r.span (0) [0]] LiSTS = Content_data.split (") i = 1 for list in lists: #每一大块 Final_soup = BeautifulSoup (list," Html.parser ") name_list = None Try:part_name = final_soup.find (' H2 ', class_= "Title-text"). Get_text (). Replace (View_name, "). Strip () part _data = Final_soup.get_text (). Replace (View_name, "). Replace (Part_name,"). Replace (' edit ', ') # History name_list = Final_    Soup.findall (' H3 ', class_= "Title-text") all_name_list = {} na= "Part_name" +str (i) all_name_list[na] = Part_name     Final_name_list = []########### for nlist in name_list:nlist = Nlist.get_text (). Replace (view_name, '). Strip () Final_name_list.append (nlist) fin= "Final_name_list" +str (i) all_name_list[fin] = final_name_list print (all_name_ List) i=i+1 #正文 try:p = Re.compile (R ', re. Dotall) Final_soup = Final_soup.decode (' utf-8 ') R = P.search (final_soup) final_part_data = Final_soup[r.span (0 ) [0]:] part_lists = Final_part_data.split (") for part_list in Part_lists:fInal_part_soup = BeautifulSoup (part_list, "Html.parser") content_lists = Final_part_soup.findall ("P", class_= "Para") For content_list in content_lists: # each minimum segment Try:pic_word = Content_list.find ("P", class_= "Lemma-picture text-pic layout-right"). Get_text () # Remove the picture description in the text Try:pic_word2 = Content_list.find ("P", clas s_= "description"). Get_text () # Remove the picture description from the text content_list = Content_list.get_text (). Replace (Pic_word, "). Replace (pic          _word2, ') except:content_list = Content_list.get_text (). Replace (Pic_word, ') Except:try: Pic_word2 = Content_list.find ("P", class_= "description"). Get_text () # Remove the picture description in the text content_list = content_l Ist.get_text (). Replace (Pic_word2, ') except:content_list = Content_list.get_text () R_part = Re.comp Ile (R ' \[\d.\]|\[\d\] ') part_result, number = RE.SUBN (R_part, "", content_list) Part_result = "". Join (Part_resu Lt.split ()) #print(part_result) Except:final_part_soup = BeautifulSoup (list, "Html.parser") content_lists = Final_part_soup.find All ("P", class_= "Para") for content_list in Content_lists:try:pic_word = Content_list.find ("P", class_= "L Emma-picture text-pic layout-right "). Get_text () # Remove the picture description in the text Try:pic_word2 = Content_list.find (" P ", class_=" Description "). Get_text () # Remove the picture description in the text content_list = Content_list.get_text (). Replace (Pic_word,"). Replace (Pic_word 2, ') Except:content_list = Content_list.get_text (). Replace (Pic_word, ') except:try:p Ic_word2 = Content_list.find ("P", class_= "description"). Get_text () # Remove the picture description in the text content_list = Content_list.get_tex T (). Replace (Pic_word2, ') except:content_list = Content_list.get_text () R_part = Re.compile (R ' \[\d.\]|      \[\d\] ') part_result, number = RE.SUBN (R_part, "", content_list) Part_result = "". Join (Part_result.split ())    #print (Part_result)Except:print ("Error") return def output (self,filename): Json_data = Json.dumps (Self.view_datas, Ensure_ascii=false, indent=2) Fout = Codecs.open (filename+ '. Json ', ' a ', encoding= ' utf-16 ',) Fout.write (json_data) # print (Json_data) re Turn def craw_pic (self,url,filename,html_count): Soup = BeautifulSoup (html_count, ' Html.parser ', from_encoding= ' utf_  8 ') node_pic=soup.find (' P ', class_= ' banner '). Find ("A", Href=re.compile ("/photo/poi/....\.")) If Node_pic is None:return None else:part_url_pic=node_pic[' href '] full_url_pic=urllib.parse.urljoin (url,part_url _pic) #print (full_url_pic) try:html_pic = Urlopen (full_url_pic) except Httperror as E:return None Soup_pic=beau Tifulsoup (Html_pic.read ()) Pic_node=soup_pic.find (' P ', class_= "Album-list") print (Pic_node) return if __name__ = = "__ Main__ ": Spider=baikecraw () filename=" D:\PyCharm\\view_spider\\view_points_part.txt "Spider.craw (filename)

Summarize

Use Python3 according to keyword crawl Baidu encyclopedia content to this basically ended, I hope this article can be helpful to everyone learning python.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python3 According to the keyword crawl Baidu encyclopedia content

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python3 According to the keyword crawl Baidu encyclopedia content

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support