PyQuery is a Python library similar to jQuery. It can also be said that it is implemented by jQuery in Python. It can use jQuery syntax to parse HTML documents, which is easy to use and fast to parse.
For example, a piece of watercress html fragment http://movie.douban.com/subject/3530403/
Director: Tom tykwe/larna vdrosky/Andy vdroskski scriptwriter: Tom ticwe/Andy vdroskski/larna vdrosky Starring: tom Hanks/HALY berry/Jim braud bunt/Hugo vivin/Jim stgis/pei douna/BEN Wei Xiao/James XI/Zhou Xun/case David/David giyaxi/ susan Jordan/hugh grant: official Website: cloudatlas.warnerbros.com: Germany/US/Hong Kong/Singapore language: English Release Date: (Mainland China)/2012-10-26 (United States) Title long: 134 minutes (Mainland China)/172 minutes (United States) IMDb link: tt1371111 Official Website: Movie cloudView Code
from pyquery import PyQuery as pqdoc=pq(url='http://movie.douban.com/subject/3530403/')data=doc('.pl')for i in data: print pq(i).text()
Output
Director scriptwriter starring type: Official Website: production country/region: Language: Release Date: film length: IMDb link: Official Website:
It looks like jQuery.
Usage
You can use the PyQuery class to load xml documents from strings, lxml objects, files, or urls:
>>> from pyquery import PyQuery as pq>>> from lxml import etree>>> doc=pq("
You can select an object like jQuery.
>>> doc('.pl')[<span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span#rateword.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <p.pl>]
In this way, all objects whose class is 'pl' are selected.
However, text needs to be re-encapsulated When iteration is used:
For para in doc ('. pl '): para = pq (para) print para. text () Director screenwriter starring type: Official Website: production country/region: Language: Release Date: Length: IMDb link: Official Website:
The resulting text is a unicode code. If you want to write a file, you must encode it as a string.
You can use some pseudo classes provided by jquery (but css is not supported) for operations, such:
>>> Doc ('. pl: first') [<span. pl>] >>> print doc ('. pl: first'). text () Director
Attributes
Get attributes of html elements
>>> p=pq('<p id="hello" class="hello"></p>')('p')>>> p.attr('id')'hello'>>> p.attr.id'hello'>>> p.attr['id']'hello'
Assignment
>>> p.attr.id='plop'>>> p.attr.id'plop'>>> p.attr['id']='ola'>>> p.attr.id'ola'>>> p.attr(id='hello',class_='hello2')[<p#hello.hell0>]
Traversing
Filter
>>> d=pq('<p id="hello" class="hello"><a/>hello</p><p id="test"><a/>world</p>')>>> d('p').filter('.hello')[<p#hello.hello>]>>> d('p').filter('#test')[<p#test>]>>> d('p').filter(lambda i:i==1)[<p#test>]>>> d('p').filter(lambda i:i==0)[<p#hello.hello>]>>> d('p').filter(lambda i:pq(this).text()=='hello')[<p#hello.hello>]
Select in order
>>> d('p').eq(0)[<p#hello.hello>]>>> d('p').eq(1)[<p#test>]
Select embedded Element
>>> d('p').eq(1).find('a')[<a>]
Select parent Element
>>> d=pq('<p><span><em>Whoah!</em></span></p><p><em> there</em></p>')>>> d('p').eq(1).find('em')[<em>]>>> d('p').eq(1).find('em').end()[<p>]>>> d('p').eq(1).find('em').end().text()'there'>>> d('p').eq(1).find('em').end().end()[<p>, <p>]
Download: http://pypi.python.org/pypi/pyquery
Document: http://packages.python.org/pyquery/
Selector Summary: http://www.cnblogs.com/onlys/articles/jQuery.html