Python parsing HTML: Introduction and use of the Pyquery library

Last Update:2018-04-12 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Most of this article is reproduced in https://www.jianshu.com/p/c07f7cd1b548

First put your own resolution TechWeb a site image of the code

 fromPyqueryImportPyquery as Pqheaders= {'user-agent':'mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36'                         '(khtml, like Gecko) chrome/63.0.3239.84 safari/537.36'}defget_info (URL): HTML= Requests.get (Url,headers =headers,verify=False) d=PQ (html.content) Doc= d ("Div"). Filter (". List_con") Doc= Doc ("Div"). Filter (". Picture_text")     forTrinchdoc.items (): Temp=tr.find ("img")        Print(Temp.attr ("src"))if __name__=="__main__": Get_info ("http://mi.techweb.com.cn/")

Objective

Python's Library of reptiles is quite numerous and has its own strengths. Knowing the front end also knows that jquery can precisely locate and manipulate objects in the DOM tree through selectors, so I think it's cool if I can crawl the web with JQuery.

Just look at Python there is no DOM-related libraries or something, and really found- pyquery !

Pyquery Introduction

Pyquery is equivalent to the Python implementation of jquery, which can be used to parse HTML pages and so on. Its syntax is almost identical to jquery and is familiar and well-known to people who have used jquery.

The quoted author's exact words are:

"The API is as much as possible the similar to jquery."

Installation

You can use PIP or Easy_install.
Note: because Pyquery relies on lxml, you will be prompted to fail if you install lxml first.

Install lxml:https://pypi.python.org/pypi/lxml/2.3/(recommended to download the installation package directly, convenient and quick);
Install Pyquery:easy_install pyquery or pip install pyquery;
Verification: Input Enter import pyquery no error is installed success;

Initialization

There are 4 ways to initialize:
You can use Pyquery by passing in a string, lxml, file, or URL.

from pyquery import PyQuery as pqfrom lxml import etreed = pq("#传入字符串d = pq(etree.fromstring("#传入lxmld = pq(url=‘http://google.com/‘) #传入urld = pq(filename=path_to_html_file) #传入文件

Now, D is like the $ in jQuery.

Example

Quickly familiarize yourself with the usage of pyquery with a simple example, passing in the file example.html with the following:

<Div><Trclass="Item-0" ><Td>first section</Td><td>1111</Td><td>17-01-28 22:51</td></tr><tr class= "item-1" > <td>second section</td><td>2222</< Span class= "Hljs-name" >td><td>17-01-28 22:53</td></tr></DIV>

Python program:

#-*-Coding:utf-8-*-From Pyqueryimport pyquery as pq #引入 Pyquerydoc = PQ (Filename= ' example.html ') # incoming file example.html< Span class= "Hljs-keyword" >print doc.html () # html () method gets the currently selected HTML block print Doc ( Item-1 ') # equivalent to class Selector, choose the HTML block of class item-1 data = Doc ( ' tr ') # Select <tr> element 
                 
                  for tr 
                  in data.items (): # traverse < in data tr> Element temp = TR ( ' TD '). EQ (2). Text () # Select a text block in the 3rd <td> element print temp

Operation Result:

# Print doc.html ()<Trclass="Item-0" ><Td>first section</Td><td>1111</Td><Td>17-01-28 22:51</Td></Tr><Trclass="Item-1" ><Td>second section</Td><td>2222</Td><td>17-01-28 22:53</td></tr># print Doc ('. item-1 ') <tr class= "item-1" ><< Span class= "Hljs-name" >td>second section</td><td>2222</td>< Span class= "Hljs-tag" ><td>17-01-28 22:53</td></tr># print tr (' TD '). EQ (2). Text () 17-01-28 22:51# Print TR (' TD '). EQ (2). Text () 17-01-28 22:53

Operation

1, .html() and .text() : Get the corresponding HTML block or text content,

p=pq("print p(‘head‘).html()# 获取相应的 HTML 块print p(‘head‘).text()# 获取相应的文本内容‘‘‘输出：<title>hello</title>Hello World!‘‘‘

2. .(‘selector‘) : Get target content by selector,

d = pq("<div><p id=‘item-0‘>test 1</p><p class=‘item-1‘>test 2</p></div>")print d(‘div‘).html()# 获取 <div> 元素内的 HTML 块print d(‘#item-0‘).text()# 获取 id 为 item-0 的元素内的文本内容print d(‘.item-1‘).text()# 获取 class 为 item-1 的元素的文本内容‘‘‘输出：<p id="item-0">test 1</p><p class="item-1">test 2</p>test 1test 2‘‘‘

3, .eq(index) : Gets the specified element according to the index number (index starting from 0),

d = pq("<div><p id=‘item-0‘>test 1</p><p class=‘item-1‘>test 2</p></div>")print d(‘p‘).eq(1).text()# 获取第二个 p 元素的文本内容，‘‘‘输出test 2‘‘‘

4. .find() : Find nested elements,

d = pq("<div><p id=‘item-0‘>test 1</p><p class=‘item-1‘>test 2</p></div>")print d(‘div‘).find(‘p‘) # 查找 <div> 内的 p 元素print d(‘div‘).find(‘p‘).eq(0) # 查找 <div> 内的 p 元素，输出第一个 p 元素‘‘‘输出：<p id="item-0">test 1</p><p class="item-1">test 2</p><p id="item-0">test 1</p>‘‘‘

5, .filter() : Filter The specified elements according to class, ID,

d = pq("<div><p id=‘item-0‘>test 1</p><p class=‘item-1‘>test 2</p></div>")print d(‘p‘).filter(‘.item-1‘) # 查找 class 为 item-1 的 p 元素print d(‘p‘).filter(‘#item-0‘) # 查找 id 为 item-0 的 p 元素‘‘‘输出：<p class="item-1">test 2</p><p id="item-0">test 1</p>‘‘‘

6, .attr() : Get, modify the property value,

d = PQ ( "<div><p id= ' item-0 ' > Test 1</p><a class= ' item-1 ' >test 2</p></div> ") print D ( ' P '). attr ( ' id ') # get <p> Tag Properties id< Span class= "Hljs-keyword" >print D ( ' a '). attr ( ' class ', # modify <a> Tag class attribute New

7, Other operations:
.addClass(value): Add Class;
.hasClass(name): Determines whether the specified class is included, returns True or False;
.children(): Gets the child element;
.parents(): Gets the parent element;
.next(): Gets the next element;
.nextAll(): Gets all the elements behind the block;
.not_(‘selector‘): Gets all the elements that do not match the selector;
for i in d.items(‘li‘): print i.text(): Traverse the LI element in D;

Conclusion

The above operation for daily crawl some small data, basic enough to use. Of course, Pyquery has a lot of other things to do here, so if you need to know more about pyquery, you can check the official documentation.

Official documents are in English, but they are also easier to read and understand. I found a Chinese-language tutorial site, which is also available here.

Official Document: https://pythonhosted.org/pyquery/index.html#
Chinese Course: http://www.geoinformatics.cn/lab/pyquery/

Python parsing HTML: Introduction and use of the Pyquery library

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python parsing HTML: Introduction and use of the Pyquery library

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python parsing HTML: Introduction and use of the Pyquery library

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support