Python parsing HTML Development Library Pyquery use method _python

Source: Internet
Author: User

For example

Copy Code code as follows:

<div id= "Info" >
<span><span class= ' PL ' > Director </span&gt: <a href= "/celebrity/1047989/" rel= "V:directedby" > 汤姆·提克威 </a>/<a href= "/celebrity/1161012/" rel= "V:directedby" > Lana Vodoski </a>/<a href= "/celebrity/" 1013899/"rel=" V:directedby "> Andy Vodoski </a></span><br/>
<span><span class= ' pl ' > screenwriter </span&gt: <a href= "/celebrity/1047989/" > Tom Tikwi </a>/<a href= "/celebrity/1013899/" > Andy Vodoski </a>/<a href= "/celebrity/1161012/" > Lana Vodoski </a></span ><br/>
<span><span class= ' pl ' > starring </span&gt: <a href= "/celebrity/1054450/" rel= "v:starring" > Tom Hanks </a>/<a href= "/celebrity/1054415/" rel= "v:starring" > Halle Berry </a>/<a href= "/celebrity/1019049/" Rel= "v:starring" > Jim Braudbent </a>/<a href= "/celebrity/1040994/" rel= "v:starring" > Hugo Uighur </a>/< A href= "/celebrity/1053559/" rel= "v:starring" > Jim Stergis </a>/<a href= "/celebrity/1057004/" rel= "V: Starring "> Pei Douna </a>/<a href="/celebrity/1025149/"rel=" v:starring "> Ben Wishow </a>/<a href="/ celebrity/1049713/"rel=" v:starring "> James Dasi </a>/<a href="/celebrity/1027798/"rel=" v:starring "> Zhou Xun </a>/<a href= "/celebrity/1019012/" rel= "v:starring" > Kess David </a>/<a href= "/celebrity/1201851/" Rel= "v:starring" > David Gijaci </a>/<a href= "/celebrity/1054392/" rel= "v:starring" > Susan Sarandon Langdon </a>/<a href= "/celebrity/1003493/" rel= "v:starring" > Doldrums </a></span><br/>
<span class= "PL" > Type:</span> <span property= "V:genre" > Plot </span>/<span property= "V:genre" > Science fiction </span>/<span property= "v:genre" > Suspense </span><br/>
<span class= "PL" > official website:</span> <a href= "http://cloudatlas.warnerbros.com" rel= nofollow "target=" _ Blank ">cloudatlas.warnerbros.com</a><br/>
<span class= "PL" > Producer country/Region:</span> Germany/US/Hong Kong/Singapore <br/>
<span class= "PL" > Language:</span> English <br/>
<span class= "PL" > Release date:</span> <span property= "v:initialreleasedate" content= "2013-01-31 (Mainland China)" > 2013-01-31 (China Mainland) </span>/<span property= "v:initialreleasedate" content= "2012-10-26 (USA)" >2012-10-26 (USA ) </span><br/>
<span class= "PL" >:</span> <span property= "V:runtime" content= "134" >134 minutes (Mainland China) </span>/ 172 mins (USA) <br/>
<span class= "PL" >imdb link:</span> <a href= "http://www.imdb.com/title/tt1371111" target= "_blank" Nofollow ">tt1371111</a><br>
<span class= "PL" > Official station:</span>
<a href= "http://site.douban.com/202494/" target= "_blank" > film "Cloud Picture" </a>
</div>

Copy Code code as follows:

From pyquery import Pyquery as PQ
DOC=PQ (url= ' http://movie.douban.com/subject/3530403/')
Data=doc ('. pl ')
For I in data:
Print PQ (i). Text ()

Output

Copy Code code as follows:

Director
Writers
Starring
Type:
Official website:
Production Country/region:
Language:
Release Date:
Film Length:
IMDB Link:
Official station:

Usage

Users can use the Pyquery class to load an XML document from a string, lxml object, file, or URL:

Copy Code code as follows:

>>> from pyquery import Pyquery as PQ
>>> from lxml import etree
>>> DOC=PQ (">>> DOC=PQ (etree.fromstring (">>> DOC=PQ (Filename=path_to_html_file)
>>> doc=pq (url= ' http://movie.douban.com/subject/3530403/')

You can select objects like jquery.

Copy Code code as follows:

>>> doc ('. Pl ')
[<span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl ", <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span#rateword.pl>, <span.pl> <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <span.pl>, <p.pl>]

In this way, the object class for ' pl ' is all selected.

However, when you use iterations, you need to encapsulate the text:

Copy Code code as follows:

For Para in Doc ('. pl '):
PARA=PQ (para)
Print Para.text ()
Director
Writers
Starring
Type:
Official website:
Production Country/region:
Language:
Release Date:
Film Length:
IMDB Link:
Official station:

The text you get here is a Unicode code, which you need to encode as a string if you want to write to the file.
Users can use some of the pseudo classes provided by jquery (but do not yet support CSS) to operate, such as:

Copy Code code as follows:

>>> doc ('. Pl:first ')
[<span.pl>]
>>> Print doc ('. Pl:first '). Text ()
Director

Attributes
Get Properties of HTML element

Copy Code code as follows:

>>> p=pq (' <p id= "Hello" class= "Hello" ></p> ") (' P ')
>>> p.attr (' id ')
' Hello '
>>> p.attr.id
' Hello '
>>> p.attr[' id ']
' Hello '

assigning values

Copy Code code as follows:

>>> p.attr.id= ' plop '
>>> p.attr.id
' Plop '
>>> p.attr[' id ']= ' ola '
>>> p.attr.id
' Ola '
>>> p.attr (id= ' Hello ', class_= ' Hello2 ')
[<p#hello.hell0>]

Traversing
Filter

Copy Code code as follows:

>>> d=pq (' <p id= "Hello" class= "Hello" ><a/>hello</p><p id= "test" ><a/>world </p> ')
>>> d (' P '). Filter ('. Hello ')
[<p#hello.hello>]
>>> d (' P '). Filter (' #test ')
[<p#test>]
>>> d (' P '). Filter (lambda i:i==1)
[<p#test>]
>>> d (' P '). Filter (lambda i:i==0)
[<p#hello.hello>]
>>> d (' P '). Filter (lambda I:PQ (this). Text () = = ' Hello ')
[<p#hello.hello>]

Select in order

Copy Code code as follows:

>>> d (' P '). EQ (0)
[<p#hello.hello>]
>>> d (' P '). EQ (1)
[<p#test>]

Select inline Element

Copy Code code as follows:

>>> d (' P '). EQ (1). Find (' a ')
[<a>]

Select parent Element

Copy Code code as follows:

>>> D=PQ (' <p><span><em>Whoah!</em></span></p><p><em> There</em></p> ')
>>> d (' P '). EQ (1). Find (' em ')
[<em>]
>>> d (' P '). EQ (1). Find (' em '). End ()
[<p>]
>>> d (' P '). EQ (1). Find (' em '). End (). Text ()
' There '
>>> d (' P '). EQ (1). Find (' em '). End (). End ()
[<p>, <p>]

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.