Python parsing HTML Development Library pyquery

Source: Internet
Author: User

PyQuery is a Python library similar to jQuery. It can also be said that it is implemented by jQuery in Python. It can use jQuery syntax to parse HTML documents, which is easy to use and fast to parse.

For example, a piece of watercress html fragment http://movie.douban.com/subject/3530403/

Director: Tom tykwe/larna vdrosky/Andy vdroskski scriptwriter: Tom ticwe/Andy vdroskski/larna vdrosky Starring: tom Hanks/HALY berry/Jim braud bunt/Hugo vivin/Jim stgis/pei douna/BEN Wei Xiao/James XI/Zhou Xun/case David/David giyaxi/ susan Jordan/hugh grant: official Website: cloudatlas.warnerbros.com: Germany/US/Hong Kong/Singapore language: English Release Date: (Mainland China)/2012-10-26 (United States) Title long: 134 minutes (Mainland China)/172 minutes (United States) IMDb link: tt1371111 Official Website: Movie cloudView Code
from pyquery import PyQuery as pqdoc=pq(url='http://movie.douban.com/subject/3530403/')data=doc('.pl')for i in data:    print pq(i).text()

Output

Director scriptwriter starring type: Official Website: production country/region: Language: Release Date: film length: IMDb link: Official Website:

It looks like jQuery.

Usage

You can use the PyQuery class to load xml documents from strings, lxml objects, files, or urls:

>>> from pyquery import PyQuery as pq>>> from lxml import etree>>> doc=pq("

You can select an object like jQuery.

>>> doc('.pl')[<span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span#rateword.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <p.pl>]

In this way, all objects whose class is 'pl' are selected.

However, text needs to be re-encapsulated When iteration is used:

For para in doc ('. pl '): para = pq (para) print para. text () Director screenwriter starring type: Official Website: production country/region: Language: Release Date: Length: IMDb link: Official Website:

The resulting text is a unicode code. If you want to write a file, you must encode it as a string.

You can use some pseudo classes provided by jquery (but css is not supported) for operations, such:

>>> Doc ('. pl: first') [<span. pl>] >>> print doc ('. pl: first'). text () Director
Attributes

Get attributes of html elements

>>> p=pq('<p id="hello" class="hello"></p>')('p')>>> p.attr('id')'hello'>>> p.attr.id'hello'>>> p.attr['id']'hello'

Assignment

>>> p.attr.id='plop'>>> p.attr.id'plop'>>> p.attr['id']='ola'>>> p.attr.id'ola'>>> p.attr(id='hello',class_='hello2')[<p#hello.hell0>]
Traversing

Filter

>>> d=pq('<p id="hello" class="hello"><a/>hello</p><p id="test"><a/>world</p>')>>> d('p').filter('.hello')[<p#hello.hello>]>>> d('p').filter('#test')[<p#test>]>>> d('p').filter(lambda i:i==1)[<p#test>]>>> d('p').filter(lambda i:i==0)[<p#hello.hello>]>>> d('p').filter(lambda i:pq(this).text()=='hello')[<p#hello.hello>]

Select in order

>>> d('p').eq(0)[<p#hello.hello>]>>> d('p').eq(1)[<p#test>]

Select embedded Element

>>> d('p').eq(1).find('a')[<a>]

Select parent Element

>>> d=pq('<p><span><em>Whoah!</em></span></p><p><em> there</em></p>')>>> d('p').eq(1).find('em')[<em>]>>> d('p').eq(1).find('em').end()[<p>]>>> d('p').eq(1).find('em').end().text()'there'>>> d('p').eq(1).find('em').end().end()[<p>, <p>]

  

Download: http://pypi.python.org/pypi/pyquery

Document: http://packages.python.org/pyquery/

Selector Summary: http://www.cnblogs.com/onlys/articles/jQuery.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.