The Beautiful Soup Standard library is a Python library that extracts data from Html/xml files, allows you to navigate through your favorite converters, find and modify documents, and Beautiful soup will save hours of working time. The Pymongo standard library is a bridge between the MongoDB NoSQL database and the Python language, and the data is saved to MongoDB by Pymongo. Use both to crawl the data of the Himalaya station ...
Beautiful soup supports the HTML parser in the Python standard library and also supports some third-party parsers, one of which is lxml. This article uses lxml, for this installation, see Python 3.6 lxml standard library lxml installation and etree use note
At the same time, this article uses XPath to parse the part we want, and for the introduction and use of XPath and Beautiful Soup, see Beautiful Soup 4.4.0 Document XPath Introduction
This article deals with the beautiful soup and XPath knowledge is not very deep, look at the official documents can understand, and I also added a note ...
For the Pymongo standard library, I'm not a lot of nonsense, please see the Python standard library Pymongo module secondary Experience
Sometimes we need to determine the type of client that is currently making requests to the server, that is, what is commonly referred to as user-agent, or UA, the browser we use when we browse the Web is a UA, in other words, the UA is the browser, in the HTTP protocol, The User-agent request header describes the user's browser type, operating system, browser kernel and other information identification. With this logo, the websites you visit can be displayed in different versions, providing a better experience for the user or statistical information. And some websites use UA to prevent hackers or boring people like us from crawling the data on the site.
As a result, the code first takes all of the UA out of the list to facilitate subsequent crawling work.
OK, here's what we're going to crawl to get the data:
What we need is a link to the picture, ALT, etc.
Then we click on the image link to get the details inside, if some of the stations are multi-page, then we used XPath to access. At the same time we get the sound_id of the sound module in the album on the page ...
The procedure is as follows:
Import randomimport requestsfrom bs4 import beautifulsoupimport jsonfrom lxml import etreeimport pymongoclients = Pymongo. Mongoclient ("localhost", 27017) db = clients["Ximalaya"]collection_1 = db["album"]collection_2 = db["detail"]ua_list = [ "mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.1 (khtml, like Gecko) chrome/22.0.1207.1 safari/537.1 "," mozilla/5.0 (X11; CrOS i686 2268.111.0) applewebkit/536.11 (khtml, like Gecko) chrome/20.0.1132.57 safari/536.11 "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/536.6 (khtml, like Gecko) chrome/20.0.1092.0 safari/536.6 "," mozilla/5.0 (Windows NT 6.2) Applewebki t/536.6 (khtml, like Gecko) chrome/20.0.1090.0 safari/536.6 "," mozilla/5.0 (Windows NT 6.2; WOW64) applewebkit/537.1 (khtml, like Gecko) chrome/19.77.34.5 safari/537.1 "," mozilla/5.0 (X11; Linux x86_64) applewebkit/536.5 (khtml, like Gecko) chrome/19.0.1084.9 safari/536.5 "," mozilla/5.0 (Windows NT 6.0) App lewebkit/536.5 (khtml, like Gecko) chrome/19.0.1084.36safari/536.5 "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1063.0 safari/536.3 "," mozilla/5.0 (Windows NT 5.1) Applewebki t/536.3 (khtml, like Gecko) chrome/19.0.1063.0 safari/536.3 "," mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; trident/4.0; SE 2.X METASR 1.0; SE 2.X METASR 1.0;. NET CLR 2.0.50727; SE 2.X METASR 1.0) "," mozilla/5.0 (Windows NT 6.2) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1062.0 safari/536. 3 "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1062.0 safari/536.3 "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1061.1 safari/536.3 "," mozilla/5.0 (Windows NT 6.1) Applewebki t/536.3 (khtml, like Gecko) chrome/19.0.1061.1 safari/536.3 "," mozilla/5.0 (Windows NT 6.2) applewebkit/536.3 (khtml, L Ike Gecko) chrome/19.0.1061.0 safari/536.3 "," mozilla/5.0 (X11; Linux x86_64) applewebkit/535.24 (khtml, like Gecko) chrome/19.0.1055.1 safari/535.24 "," mozilla/5.0 (Windows NT 6.2; WOW64) applewebkit/535.24 (khtml, like Gecko) chrome/19.0.1055.1 safari/535.24 "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.1 (khtml, like Gecko) chrome/22.0.1207.1 safari/537.1 "," mozilla/5.0 (X11; CrOS i686 2268.111.0) applewebkit/536.11 (khtml, like Gecko) chrome/20.0.1132.57 safari/536.11 "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/536.6 (khtml, like Gecko) chrome/20.0.1092.0 safari/536.6 "," mozilla/5.0 (Windows NT 6.2) Applewebki t/536.6 (khtml, like Gecko) chrome/20.0.1090.0 safari/536.6 "," mozilla/5.0 (Windows NT 6.2; WOW64) applewebkit/537.1 (khtml, like Gecko) chrome/19.77.34.5 safari/537.1 "," mozilla/5.0 (X11; Linux x86_64) applewebkit/536.5 (khtml, like Gecko) chrome/19.0.1084.9 safari/536.5 "," mozilla/5.0 (Windows NT 6.0) App lewebkit/536.5 (khtml, like Gecko) chrome/19.0.1084.36 safari/536.5 "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1063.0 safari/536.3 "," mozilla/5.0 (Windows NT 5.1) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1063.0 Safar i/536.3 "," mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1063.0 safari/536.3 "," mozilla/5.0 (Windows NT 6.2) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1062.0 safari/536.3 "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1062.0 safari/536.3 "," mozilla/5.0 (Windows NT 6.2) Applewebki t/536.3 (khtml, like Gecko) chrome/19.0.1061.1 safari/536.3 "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1061.1 safari/536.3 "," mozilla/5.0 (Windows NT 6.1) Applewebki t/536.3 (khtml, like Gecko) chrome/19.0.1061.1 safari/536.3 "," mozilla/5.0 (Windows NT 6.2) applewebkit/536.3 (khtml, L Ike Gecko) chrome/19.0.1061.0 safari/536.3 "," mozilla/5.0 (X11; Linux x86_64) applewebkit/535.24 (khtml, like Gecko) chrome/19.0.1055.1 safari/535.24 "," mozilla/5.0 (Windows NT 6.2; WOW64) applewebkit/535.24 (khtml, like Gecko) chrome/19.0.1055.1 safari/535.24 "]headers1 = {' Accept ': ' Text/html,appli cation/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 ', ' accept-encoding ': ' gzip, deflate, sdch ', ' Accept-Lan Guage ': ' zh-cn,zh;q=0.8,en;q=0.6 ', ' cache-control ': ' max-age=0 ', ' proxy-connection ': ' keep-alive ', ' Upgrade-Insec ' Ure-requests ': ' 1 ', ' user-agent ': Random.choice (ua_list) # user_agence indicates user agent}headers2 = {' Accept ': ' Text/html,appl ication/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 ', ' accept-encoding ': ' gzip, deflate, sdch ', ' Accept-La Nguage ': ' zh-cn,zh;q=0.8,en;q=0.6 ', ' cache-control ': ' max-age=0 ', ' proxy-connection ': ' keep-alive ', ' Referer ': ' H Ttp://www.ximalaya.com/dq/all/2 ', ' upgrade-insecure-requests ': ' 1 ', ' user-agent ': Random.choice (UA_LIST)}# Beautifu L libraries are used to process XML and html...# mainly using the BeautifulSoup module to process the requests module gets the HTML source code # using the lxml module to parse the HTML source into a tree structure, XPath to process the tree node. def get_uRL (): Start_urls = ["http://www.ximalaya.com/dq/all/{}". Format (num) for NUM in range (1,85)] # start_urls = ["/HTTP/ WWW.XIMALAYA.COM/DQ/ALL/1 "] for start_url in start_urls:html = Requests.get (Start_url, headers=headers1). Text Soup = BeautifulSoup (html, "lxml") # Use lxml to process for item in Soup.find_all (class_= "Albumfaceoutter"): # Parse and find XML node content = {' href ': item.a["href"], ' title ': item.img[' Alt '), ' Img_url ': item.img[' src '} collection_1.insert (content) # Another (item.a["href"]) Print (' Write complete ... ') # Enter the station specific page http://www.ximalaya.com/15836959/album/303085 and process the paging recording ... def another (URL): HTML = Requests.get (URL, headers=headers1). Text #/: Indicates the selection from the root node .... #//: Represents the selection of the current node in the Select Document node, regardless of their location ... ifanother = ETR Ee. HTML (HTML). XPath ('//div[@class = ' pagingbar_wrapper ']/a[last () -1]/@data-page ') # page link address Ifanother is the list type ... if Len ( Ifanother): # To determine if a video recording is divided into multiple pages.... num = ifanother[0] # Get page number ... print (' This channel is saved in ' + num + ' pages ') for n in range (1, int (num)): url2 = URL + '? page={} '. Format (n) get_m4a (url2) get_m4a (URL) # Get detailed data for paging recording page ... def get_m4a (URL): H tml = Requests.get (URL, headers=headers2). Text numlist = etree. HTML (HTML). XPath ('//div[@class = ' personal_body ']/@sound_ids ') [0].split (', ') for i in Numlist:murl = ' http://www. Ximalaya.com/tracks/{}.json '. Format (i) HTML = Requests.get (Murl, headers=headers1). Text dic = json.loads (htm L) Collection_2.insert (dic) if __name__ = = "__main__": Get_url ()
Python standard library Beautiful soup and MongoDB crawl Himalaya Station summary