Python crawler-Crawl a website movie download address

Source: Internet
Author: User

Preface: because oneself or is a Python world pupil, still have a lot of way to go, so this article aims for the guide, achieves the goal, for those I have not understood the principle, does not do to do too much explanation, lest fraught, everybody may search on the net.

friendly hint: This code uses the URL only for the Exchange study use, if has the inappropriate, please contact deletes.

background: I have a computer to give dad to use, the Don likes to see some large, but the home network environment is not good, want to bulk download some to save to the computer. But most of the sites are like this at the moment,

Need a place to go in to see

If I want to download 100 movies, that must have broken hands, so I want to take these addresses to crawl out, Thunder batch download.

tools:python (version 3.x)

Crawler principle: The Web page source code contains, these scattered addresses in bulk saved to the file, easy to use.

Dry: first on the code, can not wait for you to run first, then see the detailed introduction.

ImportRequestsImportRe#changepage the link used to generate different pagesdefchangepage (url,total_page): Page_group= ['https://www.dygod.net/html/gndy/jddy/index.html']     forIinchRange (2,total_page+1): Link= Re.sub ('Jddy/index','jddy/index_'+Str (i), url,re. S) page_group.append (link)returnPage_group#Pagelink used to generate video link pages within a pagedefpagelink (URL): Base_url='https://www.dygod.net/html/gndy/jddy/'Headers= {'user-agent':'mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/45.0.2454.101 safari/537.36'} req= Requests.get (url, headers =headers) req.encoding='GBK'#Specify the encoding, otherwise it will be garbledPat = Re.compile ('<a href= "/html/gndy/jddy/(. *?)" class= "Ulink" title= (. *?) /a>', Re. S#get movie list URLsReslist =Re.findall (Pat, Req.text) FinalURL= []     forIinchRange (1,25): Xurl=Reslist[i][0] Finalurl.append (Base_url+Xurl)returnFinalURL#return all the video page addresses in this page#Getdownurl Get the video address of the pagedefgetdownurl (URL): Headers= {'user-agent':'mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/45.0.2454.101 safari/537.36'} req= Requests.get (url, headers =headers) req.encoding='GBK'#Specify the encoding, otherwise it will be garbledPat = Re.compile ('<a href= "ftp (. *?)" >ftp', Re. S#GetReslist =Re.findall (Pat, Req.text) Furl='FTP'+Reslist[0]returnFurlif __name__=="__main__": HTML="https://www.dygod.net/html/gndy/jddy/index.html"    Print('The site you are about to crawl is: https://www.dygod.net/html/gndy/jddy/index.html') pages= Input ('Please enter the number of pages to crawl:') P1=changepage (Html,int (pages)) with open ('movie Heaven. LST','W') as F:j=0 forP1iinchp1:j= j + 1Print('crawling page%d, url is%s ...'%(j,p1i)) P2=pagelink (p1i) forP2iinchP2:P3=Getdownurl (P2i)ifLen (p3) = =0:Pass                Else: FinalURL=P3 F.write (FinalURL+'\ n')    Print('All page address crawl complete!')

Core Modules getdownurl function: through the requests to obtain the page information, you can think that the text of this message is the page source code (almost any browser right-click has the option to view the source code of the Web page), By re.compile regular expression matching to match the Web page in the source code of the URL section, you can see

How do we extract this part? Match by regular expression. How do you write this regular expression? Here's a simple and rude way to do this:

<a href= "ftp (. *?)" >ftp

Reptiles often use. *? to do non-greedy matching (professional noun please Baidu), you can simply think of this (. *?) On behalf of what you want to crawl out of things, such things in each Web source is sandwiched between <a href= "ftp and ">ftp . Some people may ask, that this match out is not the URL ah, for example, out of is ://d:[email protected]:12311/[movie Paradise www.dy2018.com] Please call me in your name BD Chinese-English double word. mp4, There's a little ftp in front?

Yes, but this is intentional, if the regular expression is written <a href= "(. *?)" >ftp, may be sandwiched between <a href= " and " >ftp between the thing is too much, two times the cost of processing is not as fast as you think the most direct way to extract useful information, And then the stitching came fast.

Code Explanation:

First, Getdownurl

#Getdownurl Get the video address of the pagedefgetdownurl (URL): Headers= {'user-agent':'mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/45.0.2454.101 safari/537.36'} req= Requests.get (url, headers =headers) req.encoding='GBK'#Specify the encoding, otherwise it will be garbledPat = Re.compile ('<a href= "ftp (. *?)" >ftp', Re. S#GetReslist =Re.findall (Pat, Req.text) Furl='FTP'+Reslist[0]returnFurl

Where headers is used to disguise your script access URLs as browser access, in case some sites do anti-crawler measures. This headers can also be easily found in many browsers, in Firefox, for example, directly F12 or viewing elements, in the Network tab, the right side of the message header in the lower right corner can be seen.

Requests module: Requests.get (url, headers = headers) is used as a form of Firefox to obtain information about the page.
Re module: You can refer to the Python regular expression of something, here with Re.complile to write a matching pattern, re.findall according to the pattern in the page source code to find the corresponding things.
Second, Pagelink
#Pagelink used to generate video link pages within a pagedefpagelink (URL): Base_url='https://www.dygod.net/html/gndy/jddy/'Headers= {'user-agent':'mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/45.0.2454.101 safari/537.36'} req= Requests.get (url, headers =headers) req.encoding='GBK'#Specify the encoding, otherwise it will be garbledPat = Re.compile ('<a href= "/html/gndy/jddy/(. *?)" class= "Ulink" title= (. *?) /a>', Re. S#get movie list URLsReslist =Re.findall (Pat, Req.text) FinalURL= []     forIinchRange (1,25): Xurl=Reslist[i][0] Finalurl.append (Base_url+Xurl)returnFinalURL#return all the video page addresses in this page

The first step, Getdownurl, is the URL used to crawl a webpage, which is used to get URLs for all pages within the same page, like the following pages that contain many movie links

The source code is this:

Smart you know what information you need to see, this page body has 25 movie links, I use a list to store these URLs, in fact, range (1,25) does not contain 25, that is, I only store 24 URLs, because my regular expression is not good to write, Crawled out of the first URL there is a problem, if interested can be studied under how to perfect.

It is necessary to mention that this regular expression is used in two places. *, so the matching reslist is a two-dimensional one.

Third, Changepage

 #  changepage the link used to generate different pages  def   Changepage (url,total_page): Page_group  = [ " https://www.dygod.net/html/gndy/jddy/index.html   " ]  for  i in  range (2,total_page+1 = Re.sub ( '  jddy/index   ", "  jddy/ Index_   " +str (i), url,re. S) page_group.append (link)  return  page_group 

Here is also relatively simple, click on the next page, look up the address of the URL bar is what, here is index/index_2/index_3 ... easy stitching

Four, Main

if __name__=="__main__": HTML="https://www.dygod.net/html/gndy/jddy/index.html"    Print('The site you are about to crawl is: https://www.dygod.net/html/gndy/jddy/index.html') pages= Input ('Please enter the number of pages to crawl:') P1=changepage (Html,int (pages)) with open ('movie Heaven. LST','W') as F:j=0 forP1iinchp1:j= j + 1Print('crawling page%d, url is%s ...'%(j,p1i)) P2=pagelink (p1i) forP2iinchP2:P3=Getdownurl (P2i)ifLen (p3) = =0:Pass                Else: FinalURL=P3 F.write (FinalURL+'\ n')    Print('All page address crawl complete!')

There is almost nothing to say in main, it is to read the loop, and then write to the file.

V. Operation and Results

Then the thunderbolt can be imported directly. (Suffix downlist or lst thunderbolt can be imported directly)

PostScript: Some may feel such a brain to download the movie, maybe some movies are too bad, download it is a waste of time and resources, and manual screening is too much trouble, followed by the way the database to store the information of the film, so as to filter out the required address.

Python crawler-Crawl a website movie

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.