B Station Title/sub-title/url Crawl Example (requests+re)

Source: Internet
Author: User
Tags gettext

 #Coding:utf-82 __author__="Zhoumi"3
4 ImportRequests5 ImportRe6 ImportUrllib7 " "8 The purpose of this document is to obtain:9 1. A dictionary of first-level directories and their corresponding links, in the following formTen Dictinfo = {level of directory: link} One A Dictionary of 2, two directories and their corresponding links, as follows A Dict2info = {Level two directory: link} - 3, the first level directory and the two-level directory corresponding dictionary, the following form - Dict3info = {level of directory: [Level two]} the " " - - #get the page you want to parse - #first, the exception is handled with Raise_for_status: If the request is unsuccessful, throw an exception + defgetText (URL): -Source =requests.get (URL) + source.raise_for_status () ASource.encoding =source.apparent_encoding at return(Source.text) - - #returns a dictionary of category names (keys) and corresponding links (value) - #dictinfo = {Name1list:html1list} - #For example: Animation: www.bilibili.donghua.com,........ - defgetfirsttitle (source): inText = Re.findall (r'a Class.*?div class', source) -NameList = [] toHtmllist = [] +Dictinfo = {} - forIinchText: theNamelist.append (I.split ("><em>") [1].split ("</em>") [0]) *Htmllist.append (I.split ('href= "//') [1].split ('"><em>') [0]) $ forIinchRange (len (namelist)-1):Panax NotoginsengDictinfo[namelist[i]] =Htmllist[i] - returnDictinfo the + #returns a dictionary of keys (category names) and values (corresponding links) for level two categories A #dict2info = {Name2list:html2list} the defgetsecondtitle (source): +Text2 = Re.findall (r'a href.*?<em></em></b></a></li>', source) -Name2list = [] $Html2list = [] $Dict2info = {} - forIinchText2: -Name2list.append (I.split ('><b>') [1].split ('<em>') [0]) theHtml2list.append (I.split ('a href= "//') [1].split ('"><b>') [0]) - forIinchRange (len (name2list)-1):WuyiDict2info[name2list[i]] =Html2list[i] the returnDict2info - Wu #get a dictionary of classification names for first-level and two-level classifications - #Dict3info = {Name1list:[name2list]} About defGetfirst2second (source): $Text3 = Re.findall (r'"M-i". *?</ul', Source,re. S) -Dict3info = {} -Middletitle = [] - forIinchText3: A #get a headline at each level +title = I.split ('><b>') [0].split ('</em>') [0].split ('<em>') [1] the #get sub-headings for each level title -Childtitle = I.split ('><b>') $Dict3info[title] =Childtitle the forJinchRange (len (childtitle)-1): theCHILDTITLE[J] = childtitle[j + 1] the #Handling Redundancy the Childtitle.pop () - forKinchChildtitle: inMiddletitle.append (K.split ('<em>') [0]) the #executes a storage statement for each childtitle that finishes a title theDict3info[title] =Middletitle About #Initialize the delivery list theMiddletitle = [] the returnDict3info the + - #——————————————————————————————————————————————
the ##导入字典 {Level Two class name: URLS2} plan to use the Urllib libraryBayi " " the The URL is the url2 inside the Dict_2_url2 dictionary. the this block of text is intended to obtain the source video link and video name for level two category pages - and generate the final callable dictionary {source_name:source_url} - the URL = dict_2_urls.values () the " " the the defgettext (URL): -Source =requests.get (URL) the source.raise_for_status () theSource.encoding =source.apparent_encoding the returnSource.text94 the defDownload (source): theText = Re.findall (r'<video> src= "blob:.*?" ></video>', source) thehtml = Text.split ('<video> src= "') [1].split ('"></video>') [0]98Pass

This is the two days of blind tinkering out of the code, function name, variable name definition there is a problem.

When I first got the text using Requests.get (URL), I didn't understand why it was necessary to text._raise_for_status () This code, and later understood that this was to handle the exception handling when a response request was made to the URL. Specifically what to deal with is not quite clear.

Among them, text.encoding = text.apparent.encoding The realization principle also did not dig deep, need to accumulate slowly.

Requests as a third-party library, provides a convenient function, but after studying these days, I found that this is not suitable for beginners, deep-seated is the foundation, so I think it is necessary to understand the Urllib this module.

After that, I'm going to try to use the Urllib module to process the downloaded text, urlretrieve functions, urllib.request.urlopen functions, and so on.

Also encountered a problem, when I am ready to use the video link in the dictionary to download the video of station B, will show the following results:

B ' \x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xc5}{\x93\x1be\xb2\xef\xdf8\xe2|\x87^\xb1\xc1\x8c\x03\xeb9\x9a\x97\ Xf1\x0c\x07\x0c\xdcc\x1cx\xd8\xc5

My source code is:

1 Importurllib.request2 ImportUrllib.parse3 4 defgettext (URL):5Source = Urllib.request.urlopen (url,timeout=30)6     returnSource.read ()7URL ='https://www.bilibili.com/video/av11138658/'8Text =gettext (URL)9 Print(text)

Best of all, I finally attributed the reason to the B station video has been encrypted processing, the introduction of less than one months of small white has not been able to solve the problem ~ ~ ~

B Station Title/sub-title/url Crawl Example (requests+re)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.