Python crawler Auxiliary tool pyquery module installation and use introduction

Source: Internet
Author: User
This article mainly describes the Python Crawler auxiliary tool pyquery module installation and use Raiders, pyquery can be easily used to parse the HTML content, so that it becomes a lot of crawler developers love, the need for friends can refer to the next

Installation under Windows:
Download Address: https://pypi.python.org/pypi/pyquery/#downloads

Install after download:


C:\python27>easy_install E:\python\pyquery-1.2.4.zip


can also be installed directly online:


C:\python27>easy_install Pyquery


Pyquery is a jquery-like Python library that uses syntax like jquery to extract any data from a Web page, and this data extraction and mining for HTML pages is a good third-party repository. Let's take a look at the usage of pyquery.

Extracting information from an HTML string


#!/usr/bin/python#-*-coding:utf-8-*-from pyquery import pyquery as pqhtml = "' 


The code snippet above gives a common method of operation for Pyquery. We first defined a piece of HTML code, and then took advantage of Pyquery's series of methods to manipulate the HTML code, mainly to get specific elements and text, and so on. Of course, Pyquery is not only able to get elements, but also to set element properties, add elements and other functions, since we are most commonly used in the above code to use the method, here is no longer introduced to other methods.

Extracting information from a URL or local HTML file

Of course, Pyquery can not only parse HTML strings like the above, but also:

D = PQ (url= ' http://www.baidu.com/')

We can directly load a URL, and the above operation method is no different. This method uses the Urllib module for HTTP requests by default, but if you have requests installed on your system, you will use requests for HTTP requests, which means you can use any of the requests parameters, such as:

PQ (' http://www.baidu.com/', headers={' user-agent ': ' Pyquery '})

Or, if you already have a corresponding HTML file in your local area, you can do the following:

D = PQ (Filename=path_to_html_file)

The above notation directly specifies the local HTML file, and the operation method is still the same as above.
As you can see, Pyquery provides us with the full convenience of selecting any element, just like jquery.

Use Pyquery to crawl watercress movie top250

After reading the pyquery syntax, let's take a look at an example and grab the Watercress movie top250.
Because of the strong anti-crawler, run a few times can not catch, I had to use requests to download the page, directly using the Pyquery Analysis page method to extract information:

From pyquery import pyquery as Pqimport requests head_req = {' user-agent ': ' mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/45.0.2454.101 safari/537.36 ', ' Referer ': ' https:// Movie.douban.com/top250?start=0 ',}r=requests.get ("https://movie.douban.com/top250?start=0", Headers=head_req) With open ("1.html", "WB") as Html:html.write (r.content) d=pq (filename= "1.html") # Print D (' ol '). Find (' Li '). html () for Data in D (' ol '). Items (' li '): Print Data.find ('. HD '). Find ('. Title '). EQ (0). Text () Print Data.find ('. Star '). Find ('. Rating_num '). Text () Print data.find ('. Quote '). Find ('. Inq '). Text () print


Run a look at the results:


Shawshank Redemption 9.6 hope to make people free. This killer is not too cold. 9.4 Strange sorghum and little Lori had to tell the story. Forrest Gump 9.4 A modern American history. Farewell My Concubine 9.4 festival. Beautiful Life 9.5 The most beautiful lies. Thousand and Chihiro 9.2 the best Hayao Miyazaki, the best Hisaishi. Schindler's list 9.4 to save a man is to save the world. Sea pianist 9.2 Everyone has to walk a firm path, even if it is a piece of pieces. Robot story 9.3 Small, Wally, Big Life. Inception 9.2 Nolan gave us a dream that we could not steal. Titanic 9.1 Lost is eternal. Three silly big make Bollywood 9.1 handsome version of silly beans, High EQ Business Edition Xie Ears. The spring of the cattle-herding class 9.2 The most heavenly boys, is the closest to God's existence. Loyal dog eight male story 9.2 never forget the person you love. Totoro 9.1 There is a dragon cat in everyone's heart, childhood will never disappear. Big-talk West tour of the Holy to marry 9.1 life love. Godfather 92 million don't bear grudges against your opponents, it will make you lose your mind. Gone with the Wind 9.2Tomorrow is another day. Paradise Cinema 9.1 Those kissing scenes, those youth, in the dark of the cinema were washed by tears of the clearest. When happiness comes knocking at 8.9 civilian inspirational movies. Fight Club 9.0 Evil and mediocrity are dormant in the same matrix, confronting each other at a particular time. The world of Truman 9.0 if I can't see you again, good morning, good afternoon. Touch Not 9.1 Elegant comedy full of warmth. Lord of the Rings 3: The final chapter of the Invincible 9.1 epic of kings. Rome holidays 8.9 Love is only a day.

Of course this is only the first page of 25, we already know the Watercress movie top250 URL is

Https://movie.douban.com/top250?start=0
Start parameter from 0, add 25 each time, until

https://movie.douban.com/top250?start=225
So you can write a loop and grab them all.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.