A summary of the Python standard library beautiful soup and MongoDB crawl Himalaya Radio

Last Update:2017-07-27 Source: Internet

Author: User

Tags webp

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The Beautiful Soup Standard library is a Python library that extracts data from Html/xml files, allows you to navigate through your favorite converters, find and modify documents, and Beautiful soup will save hours of working time. The Pymongo standard library is a bridge between the MongoDB NoSQL database and the Python language, and the data is saved to MongoDB by Pymongo. Use both to crawl the data of the Himalaya station ...

Beautiful soup supports the HTML parser in the Python standard library and also supports some third-party parsers, one of which is lxml. This article uses lxml, for this installation, see Python 3.6 lxml standard library lxml installation and etree use note
At the same time, this article uses XPath to parse the part we want, and for the introduction and use of XPath and Beautiful Soup, see Beautiful Soup 4.4.0 Document XPath Introduction
This article deals with the beautiful soup and XPath knowledge is not very deep, look at the official documents can understand, and I also added a note ...
For the Pymongo standard library, I'm not a lot of nonsense, please see the Python standard library Pymongo module secondary Experience

Sometimes we need to determine the type of client that is currently making requests to the server, that is, what is commonly referred to as user-agent, or UA, the browser we use when we browse the Web is a UA, in other words, the UA is the browser, in the HTTP protocol, The User-agent request header describes the user's browser type, operating system, browser kernel and other information identification. With this logo, the websites you visit can be displayed in different versions, providing a better experience for the user or statistical information. And some websites use UA to prevent hackers or boring people like us from crawling the data on the site.
As a result, the code first takes all of the UA out of the list to facilitate subsequent crawling work.

OK, here's what we're going to crawl to get the data:

What we need is a link to the picture, ALT, etc.

Then we click on the image link to get the details inside, if some of the stations are multi-page, then we used XPath to access. At the same time we get the sound_id of the sound module in the album on the page ...

The procedure is as follows:

Import randomimport requestsfrom bs4 import beautifulsoupimport jsonfrom lxml import etreeimport pymongoclients = Pymongo.    Mongoclient ("localhost", 27017) db = clients["Ximalaya"]collection_1 = db["album"]collection_2 = db["detail"]ua_list = [ "mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.1 (khtml, like Gecko) chrome/22.0.1207.1 safari/537.1 "," mozilla/5.0 (X11; CrOS i686 2268.111.0) applewebkit/536.11 (khtml, like Gecko) chrome/20.0.1132.57 safari/536.11 "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/536.6 (khtml, like Gecko) chrome/20.0.1092.0 safari/536.6 "," mozilla/5.0 (Windows NT 6.2) Applewebki t/536.6 (khtml, like Gecko) chrome/20.0.1090.0 safari/536.6 "," mozilla/5.0 (Windows NT 6.2; WOW64) applewebkit/537.1 (khtml, like Gecko) chrome/19.77.34.5 safari/537.1 "," mozilla/5.0 (X11; Linux x86_64) applewebkit/536.5 (khtml, like Gecko) chrome/19.0.1084.9 safari/536.5 "," mozilla/5.0 (Windows NT 6.0) App lewebkit/536.5 (khtml, like Gecko) chrome/19.0.1084.36safari/536.5 "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1063.0 safari/536.3 "," mozilla/5.0 (Windows NT 5.1) Applewebki t/536.3 (khtml, like Gecko) chrome/19.0.1063.0 safari/536.3 "," mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; trident/4.0; SE 2.X METASR 1.0; SE 2.X METASR 1.0;. NET CLR 2.0.50727; SE 2.X METASR 1.0) "," mozilla/5.0 (Windows NT 6.2) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1062.0 safari/536. 3 "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1062.0 safari/536.3 "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1061.1 safari/536.3 "," mozilla/5.0 (Windows NT 6.1) Applewebki t/536.3 (khtml, like Gecko) chrome/19.0.1061.1 safari/536.3 "," mozilla/5.0 (Windows NT 6.2) applewebkit/536.3 (khtml, L Ike Gecko) chrome/19.0.1061.0 safari/536.3 "," mozilla/5.0 (X11; Linux x86_64) applewebkit/535.24 (khtml, like Gecko) chrome/19.0.1055.1 safari/535.24 "," mozilla/5.0 (Windows NT 6.2; WOW64) applewebkit/535.24 (khtml, like Gecko) chrome/19.0.1055.1 safari/535.24 "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.1 (khtml, like Gecko) chrome/22.0.1207.1 safari/537.1 "," mozilla/5.0 (X11; CrOS i686 2268.111.0) applewebkit/536.11 (khtml, like Gecko) chrome/20.0.1132.57 safari/536.11 "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/536.6 (khtml, like Gecko) chrome/20.0.1092.0 safari/536.6 "," mozilla/5.0 (Windows NT 6.2) Applewebki t/536.6 (khtml, like Gecko) chrome/20.0.1090.0 safari/536.6 "," mozilla/5.0 (Windows NT 6.2; WOW64) applewebkit/537.1 (khtml, like Gecko) chrome/19.77.34.5 safari/537.1 "," mozilla/5.0 (X11; Linux x86_64) applewebkit/536.5 (khtml, like Gecko) chrome/19.0.1084.9 safari/536.5 "," mozilla/5.0 (Windows NT 6.0) App lewebkit/536.5 (khtml, like Gecko) chrome/19.0.1084.36 safari/536.5 "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1063.0 safari/536.3 "," mozilla/5.0 (Windows NT 5.1) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1063.0 Safar i/536.3 "," mozilla/5.0 (Macintosh;  Intel Mac OS X 10_8_0) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1063.0 safari/536.3 "," mozilla/5.0 (Windows NT 6.2) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1062.0 safari/536.3 "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1062.0 safari/536.3 "," mozilla/5.0 (Windows NT 6.2) Applewebki t/536.3 (khtml, like Gecko) chrome/19.0.1061.1 safari/536.3 "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1061.1 safari/536.3 "," mozilla/5.0 (Windows NT 6.1) Applewebki t/536.3 (khtml, like Gecko) chrome/19.0.1061.1 safari/536.3 "," mozilla/5.0 (Windows NT 6.2) applewebkit/536.3 (khtml, L Ike Gecko) chrome/19.0.1061.0 safari/536.3 "," mozilla/5.0 (X11; Linux x86_64) applewebkit/535.24 (khtml, like Gecko) chrome/19.0.1055.1 safari/535.24 "," mozilla/5.0 (Windows NT 6.2; WOW64) applewebkit/535.24 (khtml, like Gecko) chrome/19.0.1055.1 safari/535.24 "]headers1 = {' Accept ': ' Text/html,appli cation/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 ', ' accept-encoding ': ' gzip, deflate, sdch ', ' Accept-Lan Guage ': ' zh-cn,zh;q=0.8,en;q=0.6 ', ' cache-control ': ' max-age=0 ', ' proxy-connection ': ' keep-alive ', ' Upgrade-Insec ' Ure-requests ': ' 1 ', ' user-agent ': Random.choice (ua_list) # user_agence indicates user agent}headers2 = {' Accept ': ' Text/html,appl ication/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 ', ' accept-encoding ': ' gzip, deflate, sdch ', ' Accept-La Nguage ': ' zh-cn,zh;q=0.8,en;q=0.6 ', ' cache-control ': ' max-age=0 ', ' proxy-connection ': ' keep-alive ', ' Referer ': ' H Ttp://www.ximalaya.com/dq/all/2 ', ' upgrade-insecure-requests ': ' 1 ', ' user-agent ': Random.choice (UA_LIST)}# Beautifu L libraries are used to process XML and html...# mainly using the BeautifulSoup module to process the requests module gets the HTML source code # using the lxml module to parse the HTML source into a tree structure, XPath to process the tree node. def get_uRL (): Start_urls = ["http://www.ximalaya.com/dq/all/{}". Format (num) for NUM in range (1,85)] # start_urls = ["/HTTP/        WWW.XIMALAYA.COM/DQ/ALL/1 "] for start_url in start_urls:html = Requests.get (Start_url, headers=headers1). Text Soup = BeautifulSoup (html, "lxml") # Use lxml to process for item in Soup.find_all (class_= "Albumfaceoutter"): # Parse and find                XML node content = {' href ': item.a["href"], ' title ': item.img[' Alt '),    ' Img_url ': item.img[' src '} collection_1.insert (content) # Another (item.a["href"]) Print (' Write complete ... ') # Enter the station specific page http://www.ximalaya.com/15836959/album/303085 and process the paging recording ... def another (URL): HTML = Requests.get (URL, headers=headers1). Text #/: Indicates the selection from the root node .... #//: Represents the selection of the current node in the Select Document node, regardless of their location ... ifanother = ETR Ee. HTML (HTML). XPath ('//div[@class = ' pagingbar_wrapper ']/a[last () -1]/@data-page ') # page link address Ifanother is the list type ... if Len ( Ifanother): # To determine if a video recording is divided into multiple pages.... num = ifanother[0] # Get page number ... print (' This channel is saved in ' + num + ' pages ') for n in range (1, int (num)): url2 = URL + '? page={} '. Format (n) get_m4a (url2) get_m4a (URL) # Get detailed data for paging recording page ... def get_m4a (URL): H tml = Requests.get (URL, headers=headers2). Text numlist = etree. HTML (HTML). XPath ('//div[@class = ' personal_body ']/@sound_ids ') [0].split (', ') for i in Numlist:murl = ' http://www. Ximalaya.com/tracks/{}.json '. Format (i) HTML = Requests.get (Murl, headers=headers1). Text dic = json.loads (htm L) Collection_2.insert (dic) if __name__ = = "__main__": Get_url ()

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More