python爬蟲知識點總結(七)PyQuery詳解

來源:互聯網
上載者:User

標籤:for   擷取資訊   port   pytho   資訊   com   font   實現   ext   

官方學習文檔:http://pyquery.readthedocs.io/en/latest/api.html

一、什麼是PyQuery?

答:強大有靈活的網頁解析庫,模仿jQuery實現。如果你覺得Regex寫起來太麻煩,如果你覺的BeautifulSoup文法太難記,如果你熟悉jQuery的文法,那麼PyQuery就是你的絕佳選擇。

二、安裝

pip3 install pyquery

三、初始化

1、字串初始化

html = ‘‘‘<div>    <ul>        <li class="item-0">first item</li>        <li class="item-1"><a href="link2.html">second item</a></li>        <li class="item-0 active"><a href="link3.html"><span class="blod">thrid item</span></a></li>        <li class="item-1 active"><a href="link4.html">fourth item</a></li>        <li class="item-0"><a href="link5.html">fifth item</a></li>    </ul></div>‘‘‘from pyquery import PyQuery as pqdoc = pq(html)print(doc(‘li‘))

  

2、URL初始化

from pyquery import PyQuery as pqdoc = pq(url="http://www.baidu.com")print(doc(‘head‘))

  

3、文本初始化

from pyquery import PyQuery as pqdoc = pq(filename = ‘demo.html‘)print(doc(‘li‘))

  

四、基本CSS選取器

html = ‘‘‘<div id="container">    <ul class="list">        <li class="item-0">first item</li>        <li class="item-1"><a href="link2.html">second item</a></li>        <li class="item-0 active"><a href="link3.html"><span class="blod">thrid item</span></a></li>        <li class="item-1 active"><a href="link4.html">fourth item</a></li>        <li class="item-0"><a href="link5.html">fifth item</a></li>    </ul></div>‘‘‘from pyquery import PyQuery as pqdoc = pq(html)print(doc(‘#container .list li‘))

  

1、尋找元素

  子項目

html = ‘‘‘<div id="container">    <ul class="list">        <li class="item-0">first item</li>        <li class="item-1"><a href="link2.html">second item</a></li>        <li class="item-0 active"><a href="link3.html"><span class="blod">thrid item</span></a></li>        <li class="item-1 active"><a href="link4.html">fourth item</a></li>        <li class="item-0"><a href="link5.html">fifth item</a></li>    </ul></div>‘‘‘from pyquery import PyQuery as pqdoc = pq(html)items = doc(‘.list‘)print(type(items))print(items)lis = items.find(‘li‘)print(type(lis))print(lis)

  

lis = items.children()print(type(lis))print(lis)

  

lis = items.children(‘.active‘)print(lis)

  

  父元素

html = ‘‘‘<div id="container">    <ul class="list">        <li class="item-0">first item</li>        <li class="item-1"><a href="link2.html">second item</a></li>        <li class="item-0 active"><a href="link3.html"><span class="blod">thrid item</span></a></li>        <li class="item-1 active"><a href="link4.html">fourth item</a></li>        <li class="item-0"><a href="link5.html">fifth item</a></li>    </ul></div>‘‘‘from pyquery import PyQuery as pqdoc = pq(html)items = doc(‘.list‘)container = items.parent()print(type(container))print(container)

  

html = ‘‘‘<div class="wrap"><div id="container">    <ul class="list">        <li class="item-0">first item</li>        <li class="item-1"><a href="link2.html">second item</a></li>        <li class="item-0 active"><a href="link3.html"><span class="blod">thrid item</span></a></li>        <li class="item-1 active"><a href="link4.html">fourth item</a></li>        <li class="item-0"><a href="link5.html">fifth item</a></li>    </ul></div></div>‘‘‘from pyquery import PyQuery as pqdoc = pq(html)items = doc(‘.list‘)container = items.parents()print(type(container))print(container)

  

parent = items.parents(‘.wrap‘)print(parents)

  

  兄弟元素

html = ‘‘‘<div class="wrap"><div id="container">    <ul class="list">        <li class="item-0">first item</li>        <li class="item-1"><a href="link2.html">second item</a></li>        <li class="item-0 active"><a href="link3.html"><span class="blod">thrid item</span></a></li>        <li class="item-1 active"><a href="link4.html">fourth item</a></li>        <li class="item-0"><a href="link5.html">fifth item</a></li>    </ul></div></div>‘‘‘from pyquery import PyQuery as pqdoc = pq(html)li = doc(‘.list .item-0.active‘)print(li.siblings())

  

html = ‘‘‘<div class="wrap"><div id="container">    <ul class="list">        <li class="item-0">first item</li>        <li class="item-1"><a href="link2.html">second item</a></li>        <li class="item-0 active"><a href="link3.html"><span class="blod">thrid item</span></a></li>        <li class="item-1 active"><a href="link4.html">fourth item</a></li>        <li class="item-0"><a href="link5.html">fifth item</a></li>    </ul></div></div>‘‘‘from pyquery import PyQuery as pqdoc = pq(html)li = doc(‘.list .item-0.active‘)print(li.siblings(‘.active‘))

  

五、遍曆

1、單個元素

html = ‘‘‘<div class="wrap"><div id="container">    <ul class="list">        <li class="item-0">first item</li>        <li class="item-1"><a href="link2.html">second item</a></li>        <li class="item-0 active"><a href="link3.html"><span class="blod">thrid item</span></a></li>        <li class="item-1 active"><a href="link4.html">fourth item</a></li>        <li class="item-0"><a href="link5.html">fifth item</a></li>    </ul></div></div>‘‘‘from pyquery import PyQuery as pqdoc = pq(html)li = doc(‘.item-0.active‘)print(li)

  

html = ‘‘‘<div class="wrap"><div id="container">    <ul class="list">        <li class="item-0">first item</li>        <li class="item-1"><a href="link2.html">second item</a></li>        <li class="item-0 active"><a href="link3.html"><span class="blod">thrid item</span></a></li>        <li class="item-1 active"><a href="link4.html">fourth item</a></li>        <li class="item-0"><a href="link5.html">fifth item</a></li>    </ul></div></div>‘‘‘from pyquery import PyQuery as pqdoc = pq(html)lis = doc(‘li‘).items()print(type(lis))for li in lis:    print(li)

  

2、擷取資訊

  擷取屬性

html = ‘‘‘<div class="wrap"><div id="container">    <ul class="list">        <li class="item-0">first item</li>        <li class="item-1"><a href="link2.html">second item</a></li>        <li class="item-0 active"><a href="link3.html"><span class="blod">thrid item</span></a></li>        <li class="item-1 active"><a href="link4.html">fourth item</a></li>        <li class="item-0"><a href="link5.html">fifth item</a></li>    </ul></div></div>‘‘‘from pyquery import PyQuery as pqdoc = pq(html)a = doc(‘.item-0.active a‘)print(a)print(a.attr(‘href‘))print(a.attr.href)

  

  擷取文本

html = ‘‘‘<div class="wrap"><div id="container">    <ul class="list">        <li class="item-0">first item</li>        <li class="item-1"><a href="link2.html">second item</a></li>        <li class="item-0 active"><a href="link3.html"><span class="blod">thrid item</span></a></li>        <li class="item-1 active"><a href="link4.html">fourth item</a></li>        <li class="item-0"><a href="link5.html">fifth item</a></li>    </ul></div></div>‘‘‘from pyquery import PyQuery as pqdoc = pq(html)a = doc(‘.item-0.active a‘)print(a)print(a.attr(‘href‘))print(a.text())

  

擷取HTML

html = ‘‘‘<div class="wrap"><div id="container">    <ul class="list">        <li class="item-0">first item</li>        <li class="item-1"><a href="link2.html">second item</a></li>        <li class="item-0 active"><a href="link3.html"><span class="blod">thrid item</span></a></li>        <li class="item-1 active"><a href="link4.html">fourth item</a></li>        <li class="item-0"><a href="link5.html">fifth item</a></li>    </ul></div></div>‘‘‘from pyquery import PyQuery as pqdoc = pq(html)li = doc(‘.item-0.active‘)print(li)print(li.html())

  

六、DOM操作

1、addClass\removeClass

html = ‘‘‘<div class="wrap"><div id="container">    <ul class="list">        <li class="item-0">first item</li>        <li class="item-1"><a href="link2.html">second item</a></li>        <li class="item-0 active"><a href="link3.html"><span class="blod">thrid item</span></a></li>        <li class="item-1 active"><a href="link4.html">fourth item</a></li>        <li class="item-0"><a href="link5.html">fifth item</a></li>    </ul></div></div>‘‘‘from pyquery import PyQuery as pqdoc = pq(html)li = doc(‘.item-0.active‘)print(li)li.removeClass(‘active‘)print(li)li.addClass(‘active‘)print(li)

  

2、attr、css

html = ‘‘‘<div class="wrap"><div id="container">    <ul class="list">        <li class="item-0">first item</li>        <li class="item-1"><a href="link2.html">second item</a></li>        <li class="item-0 active"><a href="link3.html"><span class="blod">thrid item</span></a></li>        <li class="item-1 active"><a href="link4.html">fourth item</a></li>        <li class="item-0"><a href="link5.html">fifth item</a></li>    </ul></div></div>‘‘‘from pyquery import PyQuery as pqdoc = pq(html)li = doc(‘.item-0.active‘)print(li)li.attr(‘name‘,‘link‘)print(li)li.css(‘font-size‘,‘14px‘)print(li)

  

 3、remove

html = ‘‘‘<div class="wrap"><div id="container">    <ul class="list">        <li class="item-0">first item</li>        <li class="item-1"><a href="link2.html">second item</a></li>        <li class="item-0 active"><a href="link3.html"><span class="blod">thrid item</span></a></li>        <li class="item-1 active"><a href="link4.html">fourth item</a></li>        <li class="item-0"><a href="link5.html">fifth item</a></li>    </ul></div></div>‘‘‘from pyquery import PyQuery as pqdoc = pq(html)wrap = doc(‘.wrap‘)print(wrap.text())wrap.find(‘p‘).remove()print(wrap.text())

  

其他DOM方法

http://pyquery.readthedocs.io/en/latest/api.html

 

七、偽類別選取器

html = ‘‘‘<div class="wrap"><div id="container">    <ul class="list">        <li class="item-0">first item</li>        <li class="item-1"><a href="link2.html">second item</a></li>        <li class="item-0 active"><a href="link3.html"><span class="blod">thrid item</span></a></li>        <li class="item-1 active"><a href="link4.html">fourth item</a></li>        <li class="item-0"><a href="link5.html">fifth item</a></li>    </ul></div></div>‘‘‘from pyquery import PyQuery as pqdoc = pq(html)li = doc("li:first-child")print(li)li = doc("li:last-child")print(li)# 標籤從0開始li = doc("li:nth-child(2)") # ntj-child(2)擷取第2個標籤print(li)li = doc("li:gt(2)") # gt-child(2)擷取比2大的標籤print(li)li = doc("li:nth-child(2n)") # nth-child(2n)擷取偶數的標籤print(li)li = doc("li:contains(second)") # contains(second)擷取包含second文本的標籤print(li)

  

更多CSS選取器可以查看 http://www.w3school.com.cn/css/index.html

官方文檔網站:http://pyquery.readthedocs.io

jQuery官方文檔:http://jquery.cuishifeng.cn/

 

python爬蟲知識點總結(七)PyQuery詳解

相關文章

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.