Python crawls movie Heaven specifies a TV show or movie

Source: Internet
Author: User

1. Analyzing Search Requests

An expert once said, want to crawl data, first analyze the site

Today we climbed to the cinema paradise, there are good-looking American drama I can find on the above, is very full.

This site ads surprisingly much, used to know, click on the search will pop up a window, with the funny music, playful blue moon?

With Python, we can avoid ads and get what we want directly.

I'm using Firefox, press F12 to open developer tools, select Network

In the normal sequence of operations, Python is actually doing some web operations on the Sims, and we're just liberating our hands with Python.

In the Search box, enter "Pride Wife", of course you can enter other TV show name also, see the developer tools

Smart you can see it, yes, that's the first request, open it.

View parameters, keyword Chinese meaning is the key word, we can know, "Pride Yin Wife" was encode became this kind of things to see, parameters Kwtype and SearchType feel no use, specific I do not know what to do, we mock the request, put it together, to prevent problems

Well, we can now open the development tool to start playing, I use IntelliJ idea, I installed the Python plugin, and pycharm will not be too much, very useful. Because I usually use Java development more, I would not bother to download other development tools. Of course, you use Notepad.

I have no objection. I first set up a film.py to put the TV show name. This is a good habit, sometimes high-security data is specifically placed in a file, encrypted, or GitHub ignore not commit, you can avoid unnecessary trouble

# Coding=utf-8 ' Pride-Yin wife '
2. Simulating a search request with Python

Establish _init_.py

Import the required package Urlib2,re,film, the comments are already clear, I'll explain the next% (Film.filmname). Decode ("Utf-8"). Encode (' gb2312 '),% is to take out my existence film.py inside value, why use decode in encode? Right-click on the page source code and you will find that movie paradise

Not utf-8 coding, but gb2312, so we want to encode ah, just now we see keyword is not understand the Martian text, we now know, it is actually gb2312 code, so here we put Filmname first decoded into Utf-8, become able to understand the " Pride Wife ", then encoded into gb2312

Movie Heaven Backstage can read "Pride Yin Wife", OK, so so.php can execute our query operation, Kwtype=0&searchtype=titile bring it, anyway also not tired.

Regarding the regular grammar, is the Python Foundation, may go to the classroom net to study, I did not explain. We aim to see the features of hyperlinks in HTML, and to do regular matching

#Coding=utf-8ImportUrllib2ImportfilmImportReopener= Urllib2.build_opener ()#building a Handler objectdefsearch (): Req= Urllib2. Request ('http://s.ygdy8.com/plus/so.php')    #The so.php request parameter will be Url.encode () in Chinese, so the Chinese encode (' gb2312 ') needs to be processedReq.add_data ('kwtype=0&searchtype=title&keyword=%s'% (Film.filmname). Decode ("Utf-8"). Encode ('gb2312')) HTML= Opener.open (req). Read (). Decode ('gb2312') Reg= R'/html/tv/oumeitv/[0-9]{8}/[0-9a-za-z.] {9,10}'    returnRe.findall (reg,html) search ()
3. Analysis

We went on to analyze the site, we just finished the search

Now the interface is so, we temporarily only take the first one, that is, "2014 flagship American drama" Pride Yin Wife "sixth season"

Open the first connection, enter the familiar interface, and finally find what we want, yes, it is

4. Get the download link

The ads are surprisingly much ... Fortunately I have disabled flash

This is the time to open idea and write code. List get to search results, because search is two, in order to see the effect, I did not traverse, only to take the first search results, that is, 2014 main play ..., here is the U is a Unicode string, because we exist here in Chinese

HTML decoding, regular match movie paradise download format

defopensearchresult (): List=search () req= Urllib2. Request ('http://www.ygdy8.com'+list[0]) HTML= Opener.open (req). Read (). Decode ('gb2312','Ignore') Reg= u'Ftp://[a-z0-9]+:[a-z0-9][email protected][a-z0-9]+. [A-z] {1,8}. [A-z] {3}:[\d]{4}/[\u4e00-\u9fa5]{0,10}[\w]*\[Sunshine Movie www.ygdy8.com\][\u4e00-\u9fa5]*[\d]+[\u4e00-\u9fa5]\[[\u4e00-\ U9FA5]+\].RMVB'    returnRe.findall (reg,html) opensearchresult ()

Then use the list to traverse the Opensearchresult, the Unicode string must traverse to see the Chinese

def getList ():      for inch Opensearchresult ():         Print igetlist ()

The result is as follows, copy down to Thunder can download

I replaced filename with the Walking Dead.

5. Source code

This is the regular basic grammar https://github.com/cjy513203427/pachong/tree/master/regularExpression

This is the source of the blog: Https://github.com/cjy513203427/pachong/tree/master/downloadDytt

Python crawls movie Heaven specifies a TV show or movie

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.