Python web crawler PyQuery basic usage tutorial, pythonpyquery

Last Update:2018-02-07 Source: Internet

Author: User

Tags python web crawler

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Python web crawler PyQuery basic usage tutorial, pythonpyquery

Preface

The pyquery library is implemented in Python of jQuery. It can use jQuery syntax to parse HTML documents. It is easy-to-use and fast-to-use, and similar to BeautifulSoup, it is used for parsing. Compared with the perfect and informative BeautifulSoup documentation, although the PyQuery library documentation is weak, it can still be used, and it is convenient and concise to use in some places.

Install

For installation of PyQuery, refer to this article: http://www.bkjia.com/article/82955.htm

PyQuery library official documentation

Initialize to a PyQuery object
Common CCS Selector
Pseudo-class selector
Search for tags
Obtain Tag Information

Initialize to a PyQuery object

Html = "
It is equivalent to the first-recognized method of the BeautifulSoup library, which converts html into BeautifulSoup objects.
bsObj = BeautifulSoup(html, 'html.parser')
The PyQuery library must also have its own initialization.
1.1 initialize the string
From pyquery import PyQuery as pq # initialize to PyQuery object doc = pq (html) print (type (doc) print (doc)
Return
<Class 'pyquery. pyquery. pyquery'> 
1.2 initialize html files
# The filename parameter is the html file path test_html = pq (filename = 'test.html ') print (type (test_html) print (test_html)
Return
<Class 'pyquery. pyquery. pyquery'> 
1.3 initialize the URL response
response = pq(url = 'https://www.baidu.com')print(type(response))print(response)
Return
<class 'pyquery.pyquery.PyQuery'>
Ii. Common CCS Selector
Print the label whose id is container
print(doc('#container'))print(type(doc('#container')))
Return
<ul id="container"> <li class="object-1"/> <li class="object-2"/> <li class="object-3"/></ul><class 'pyquery.pyquery.PyQuery'>
Print the label of class as object-1
print(doc('.object-1'))
Return
<li class="object-1"/>
Print the tag named body
print(doc('body'))
Return
<body> <ul id="container"> <li class="object-1"/> <li class="object-2"/> <li class="object-3"/> </ul></body>
Multiple css selectors
print(doc('html #container'))
Return
<ul id="container"> <li class="object-1"/> <li class="object-2"/> <li class="object-3"/></ul>
Iii. Pseudo-class selector
Pseudo nth
Print (pseudo _ Doc ('li: nth-child (2) ') # print the first li tag print (pseudo _ Doc ('li: first-child ')) # print the last tag print (pseudo _ Doc ('li: last-child '))
Return
<Li class = "object-2"> syntax </li> <li class = "object-1"> Python </li> <li class = "object-6"> fun </li>
Contains
# Find the li label print (pseudo _ Doc ("li: contains ('python')") containing Python) # Find the li label print (pseudo _ Doc ("li: contains ('hao ')"))
Return
<Li class = "object-1"> Python </li> <li class = "object-3"> good </li> <li class = "object-4"> good </li> <li class = "object-6"> Fun </li>
4. Search for tags
Search for qualified tags in the Pyquery object according to the condition, similar to the find method in BeautifulSoup.
Print the tag id = container
print(doc.find('#container'))
Return
<ul id="container"> <li class="object-1"/> <li class="object-2"/> <li class="object-3"/></ul>print(doc.find('li'))
Return
<li class="object-1"/><li class="object-2"/><li class="object-3"/>
4.2 child tags-children Method
# Id = iner label iner = doc. find ('# iner') print (container. children ())
Return
<li class="object-1"/><li class="object-2"/><li class="object-3"/>
4.3 parent label-parent Method
object_2 = doc.find('.object-2')print(object_2.parent())
Return
<ul id="container"> <li class="object-1"/> <li class="object-2"/> <li class="object-3"/></ul>
4.4 sibling tag-siblings Method
object_2 = doc.find('.object-2')print(object_2.siblings())
Return
<li class="object-1"/><li class="object-3"/>
5. Obtain Tag Information
After locating the target tag, we need the text or attribute values inside the tag. At this time, we need to extract the text or attribute values.
5.1 tag attribute value Extraction
. Attr () refers to the attribute name of the input tag and returns the attribute value.
object_2 = doc.find('.object-2')print(object_2.attr('class'))
Return
object-2
5.2 text in the tag
. Text ()
Html_text = "
Return
Simple and Easy to use PyQuery Hello World! Good Python syntax
object_1 = docs.find('.object-1')print(object_1.text())container = docs.find('#container')print(container.text())
Return
PythonHello World! Good Python syntax
Tips: If I only want to get "Hello World" and do not want to get any other text, I can remove the li tag using the remove method, and then use the text method.
container = docs.find('#container')container.remove('li')print(container.text())
Return
Hello World！
Pyquery custom usage
Access URL
Compared with BeautifulSoup, PyQuery can initiate a request to the website. For example
from pyquery import PyQueryPyQuery(url = 'https://www.baidu.com')
Opener Parameters
This is PyQuery's request for Baidu's Web site and processing the response returned by the request as a PyQuery object. Generally, the pyquery library calls the urllib library by default. If you want to use selenium or requests library, you can customize the opener parameter of PyQuery.
The opener parameter indicates the request library used by pyquery to initiate a request to the website. Common Request libraries such as urllib, requests, and selenium. Here we define a selenium opener.
From pyquery import PyQueryfrom selenium. webdriver import PhantomJS # Use selenium to access urldef selenium_opener (url): # I didn't put Phantomjs into environment variables, therefore, you need to put the path driver = PhantomJS (executable_path = 'phantomjs path') driver every time you use it. get (url) html = driver. page_source driver. quit () return html # note that the opener parameter in use is a function name without parentheses! PyQuery (url = 'https: // www.baidu.com/', opener = selenium_opener)
At this time, we can operate on the PyQuery object to extract useful information. For details, please refer to the previous sharing. If you want to learn more functions, the pyquery document is not very detailed. Fortunately, it basically matches the jQuery function. If you want to use pyquery well, you need to check the jQuery document.
Cookies, headers
In requests usage, it is generally used as a browser to make the website more authentic. In general, we need to pass in headers. If necessary, we also need to pass in cookies. The pyquery library has this function, and can also pretend to be a browser.
From pyquery import PyQuerycookies = {'cookies': 'Your cookies'} headers = {'user-agent': 'mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) appleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.20.3.100 Safari/100'} PyQuery (url = 'https: // www.baidu.com/', headers?headers,cookies=cookies)
Bring your selenium with the pyquery Function
Directly convert the webpage obtained from the URL accessed by the driver into a PyQuery object, making it easier to extract data.
From pyquery import PyQueryfrom selenium. webdriver import PhantomJSclass Browser (PhantomJS): @ property def dom (self): return PyQuery (self. page_source) "this part of the property is the decorator. You need to know the function that follows @ property to implement the class property function. Here browser. dom is the dom attribute of browser. "Browser = Browser (executable_path = 'phantomjs path') browser. get (url = 'https: // www.baidu.com/') print (type (browser. dom ))
Return
<class 'pyquery.pyquery.PyQuery'>
Summary
The above is all the content of this article. I hope the content of this article has some reference and learning value for everyone's learning or work. If you have any questions, please leave a message to us, thank you for your support.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More