Python parsing HTML: Introduction and use of the Pyquery library

Source: Internet
Author: User

Most of this article is reproduced in https://www.jianshu.com/p/c07f7cd1b548

First put your own resolution TechWeb a site image of the code

 fromPyqueryImportPyquery as Pqheaders= {'user-agent':'mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36'                         '(khtml, like Gecko) chrome/63.0.3239.84 safari/537.36'}defget_info (URL): HTML= Requests.get (Url,headers =headers,verify=False) d=PQ (html.content) Doc= d ("Div"). Filter (". List_con") Doc= Doc ("Div"). Filter (". Picture_text")     forTrinchdoc.items (): Temp=tr.find ("img")        Print(Temp.attr ("src"))if __name__=="__main__": Get_info ("http://mi.techweb.com.cn/")

Objective

Python's Library of reptiles is quite numerous and has its own strengths. Knowing the front end also knows that jquery can precisely locate and manipulate objects in the DOM tree through selectors, so I think it's cool if I can crawl the web with JQuery.

Just look at Python there is no DOM-related libraries or something, and really found- pyquery !

Pyquery Introduction

Pyquery is equivalent to the Python implementation of jquery, which can be used to parse HTML pages and so on. Its syntax is almost identical to jquery and is familiar and well-known to people who have used jquery.

The quoted author's exact words are:

"The API is as much as possible the similar to jquery."

Installation

You can use PIP or Easy_install.
Note: because Pyquery relies on lxml, you will be prompted to fail if you install lxml first.

    1. Install lxml:https://pypi.python.org/pypi/lxml/2.3/(recommended to download the installation package directly, convenient and quick);
    2. Install Pyquery:easy_install pyquery or pip install pyquery;
    3. Verification: Input Enter import pyquery no error is installed success;
Initialization

There are 4 ways to initialize:
You can use Pyquery by passing in a string, lxml, file, or URL.

from pyquery import PyQuery as pqfrom lxml import etreed = pq("#传入字符串d = pq(etree.fromstring("#传入lxmld = pq(url=‘http://google.com/‘) #传入urld = pq(filename=path_to_html_file) #传入文件

Now, D is like the $ in jQuery.

Example

Quickly familiarize yourself with the usage of pyquery with a simple example, passing in the file example.html with the following:

<Div><Trclass="Item-0" ><Td>first section</Td><td>1111</Td><td>17-01-28 22:51</td></tr><tr class= "item-1" > <td>second section</td><td>2222</< Span class= "Hljs-name" >td><td>17-01-28 22:53</td></tr></DIV>          

Python program:

#-*-Coding:utf-8-*-From Pyqueryimport pyquery as pq #引入 Pyquerydoc = PQ (Filename= ' example.html ') # incoming file example.html< Span class= "Hljs-keyword" >print doc.html () # html () method gets the currently selected HTML block print Doc ( Item-1 ') # equivalent to class Selector, choose the HTML block of class item-1 data = Doc ( ' tr ') # Select <tr> element 
                 
                  for tr 
                  in data.items (): # traverse < in data tr> Element temp = TR ( ' TD '). EQ (2). Text () # Select a text block in the 3rd <td> element print temp     
                 

Operation Result:

# Print doc.html ()<Trclass="Item-0" ><Td>first section</Td><td>1111</Td><Td>17-01-28 22:51</Td></Tr><Trclass="Item-1" ><Td>second section</Td><td>2222</Td><td>17-01-28 22:53</td></tr># print Doc ('. item-1 ') <tr class= "item-1" ><< Span class= "Hljs-name" >td>second section</td><td>2222</td>< Span class= "Hljs-tag" ><td>17-01-28 22:53</td></tr># print tr (' TD '). EQ (2). Text () 17-01-28 22:51# Print TR (' TD '). EQ (2). Text () 17-01-28 22:53          
Operation

1, .html() and .text() : Get the corresponding HTML block or text content,

p=pq("print p(‘head‘).html()# 获取相应的 HTML 块print p(‘head‘).text()# 获取相应的文本内容‘‘‘输出:<title>hello</title>Hello World!‘‘‘

2. .(‘selector‘) : Get target content by selector,

d = pq("<div><p id=‘item-0‘>test 1</p><p class=‘item-1‘>test 2</p></div>")print d(‘div‘).html()# 获取 <div> 元素内的 HTML 块print d(‘#item-0‘).text()# 获取 id 为 item-0 的元素内的文本内容print d(‘.item-1‘).text()# 获取 class 为 item-1 的元素的文本内容‘‘‘输出:<p id="item-0">test 1</p><p class="item-1">test 2</p>test 1test 2‘‘‘

3, .eq(index) : Gets the specified element according to the index number (index starting from 0),

d = pq("<div><p id=‘item-0‘>test 1</p><p class=‘item-1‘>test 2</p></div>")print d(‘p‘).eq(1).text()# 获取第二个 p 元素的文本内容,‘‘‘输出test 2‘‘‘

4. .find() : Find nested elements,

d = pq("<div><p id=‘item-0‘>test 1</p><p class=‘item-1‘>test 2</p></div>")print d(‘div‘).find(‘p‘) # 查找 <div> 内的 p 元素print d(‘div‘).find(‘p‘).eq(0) # 查找 <div> 内的 p 元素,输出第一个 p 元素‘‘‘输出:<p id="item-0">test 1</p><p class="item-1">test 2</p><p id="item-0">test 1</p>‘‘‘

5, .filter() : Filter The specified elements according to class, ID,

d = pq("<div><p id=‘item-0‘>test 1</p><p class=‘item-1‘>test 2</p></div>")print d(‘p‘).filter(‘.item-1‘) # 查找 class 为 item-1 的 p 元素print d(‘p‘).filter(‘#item-0‘) # 查找 id 为 item-0 的 p 元素‘‘‘输出:<p class="item-1">test 2</p><p id="item-0">test 1</p>‘‘‘

6, .attr() : Get, modify the property value,

d = PQ ( "<div><p id= ' item-0 ' > Test 1</p><a class= ' item-1 ' >test 2</p></div> ") print D ( ' P '). attr ( ' id ') # get <p> Tag Properties id< Span class= "Hljs-keyword" >print D ( ' a '). attr ( ' class ', # modify <a> Tag class attribute New         

7, Other operations:
.addClass(value): Add Class;
.hasClass(name): Determines whether the specified class is included, returns True or False;
.children(): Gets the child element;
.parents(): Gets the parent element;
.next(): Gets the next element;
.nextAll(): Gets all the elements behind the block;
.not_(‘selector‘): Gets all the elements that do not match the selector;
for i in d.items(‘li‘): print i.text(): Traverse the LI element in D;

Conclusion

The above operation for daily crawl some small data, basic enough to use. Of course, Pyquery has a lot of other things to do here, so if you need to know more about pyquery, you can check the official documentation.

Official documents are in English, but they are also easier to read and understand. I found a Chinese-language tutorial site, which is also available here.

Official Document: https://pythonhosted.org/pyquery/index.html#
Chinese Course: http://www.geoinformatics.cn/lab/pyquery/

Python parsing HTML: Introduction and use of the Pyquery library

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.