Python web crawler PyQuery basic usage tutorial, pythonpyquery

Source: Internet
Author: User
Tags python web crawler

Python web crawler PyQuery basic usage tutorial, pythonpyquery

Preface

The pyquery library is implemented in Python of jQuery. It can use jQuery syntax to parse HTML documents. It is easy-to-use and fast-to-use, and similar to BeautifulSoup, it is used for parsing. Compared with the perfect and informative BeautifulSoup documentation, although the PyQuery library documentation is weak, it can still be used, and it is convenient and concise to use in some places.

Install

For installation of PyQuery, refer to this article: http://www.bkjia.com/article/82955.htm

PyQuery library official documentation

  • Initialize to a PyQuery object
  • Common CCS Selector
  • Pseudo-class selector
  • Search for tags
  • Obtain Tag Information

Initialize to a PyQuery object

Html = "

It is equivalent to the first-recognized method of the BeautifulSoup library, which converts html into BeautifulSoup objects.

bsObj = BeautifulSoup(html, 'html.parser')

The PyQuery library must also have its own initialization.

1.1 initialize the string

From pyquery import PyQuery as pq # initialize to PyQuery object doc = pq (html) print (type (doc) print (doc)

Return

<Class 'pyquery. pyquery. pyquery'> 

1.2 initialize html files

# The filename parameter is the html file path test_html = pq (filename = 'test.html ') print (type (test_html) print (test_html)

Return

<Class 'pyquery. pyquery. pyquery'> 

1.3 initialize the URL response

response = pq(url = 'https://www.baidu.com')print(type(response))print(response)

Return

<class 'pyquery.pyquery.PyQuery'>

Ii. Common CCS Selector

Print the label whose id is container

print(doc('#container'))print(type(doc('#container')))

Return

<ul id="container"> <li class="object-1"/> <li class="object-2"/> <li class="object-3"/></ul><class 'pyquery.pyquery.PyQuery'>

Print the label of class as object-1

print(doc('.object-1'))

Return

<li class="object-1"/>

Print the tag named body

print(doc('body'))

Return

<body> <ul id="container"> <li class="object-1"/> <li class="object-2"/> <li class="object-3"/> </ul></body>

Multiple css selectors

print(doc('html #container'))

Return

<ul id="container"> <li class="object-1"/> <li class="object-2"/> <li class="object-3"/></ul>

Iii. Pseudo-class selector

Pseudo nth

Print (pseudo _ Doc ('li: nth-child (2) ') # print the first li tag print (pseudo _ Doc ('li: first-child ')) # print the last tag print (pseudo _ Doc ('li: last-child '))

Return

<Li class = "object-2"> syntax </li> <li class = "object-1"> Python </li> <li class = "object-6"> fun </li>

Contains

# Find the li label print (pseudo _ Doc ("li: contains ('python')") containing Python) # Find the li label print (pseudo _ Doc ("li: contains ('hao ')"))

Return

<Li class = "object-1"> Python </li> <li class = "object-3"> good </li> <li class = "object-4"> good </li> <li class = "object-6"> Fun </li>

4. Search for tags

Search for qualified tags in the Pyquery object according to the condition, similar to the find method in BeautifulSoup.

Print the tag id = container

print(doc.find('#container'))

Return

<ul id="container"> <li class="object-1"/> <li class="object-2"/> <li class="object-3"/></ul>print(doc.find('li'))

Return

<li class="object-1"/><li class="object-2"/><li class="object-3"/>

4.2 child tags-children Method

# Id = iner label iner = doc. find ('# iner') print (container. children ())

Return

<li class="object-1"/><li class="object-2"/><li class="object-3"/>

4.3 parent label-parent Method

object_2 = doc.find('.object-2')print(object_2.parent())

Return

<ul id="container"> <li class="object-1"/> <li class="object-2"/> <li class="object-3"/></ul>

4.4 sibling tag-siblings Method

object_2 = doc.find('.object-2')print(object_2.siblings())

Return

<li class="object-1"/><li class="object-3"/>

5. Obtain Tag Information

After locating the target tag, we need the text or attribute values inside the tag. At this time, we need to extract the text or attribute values.

5.1 tag attribute value Extraction

. Attr () refers to the attribute name of the input tag and returns the attribute value.

object_2 = doc.find('.object-2')print(object_2.attr('class'))

Return

object-2

5.2 text in the tag

. Text ()

Html_text = "

Return

Simple and Easy to use PyQuery Hello World! Good Python syntax
object_1 = docs.find('.object-1')print(object_1.text())container = docs.find('#container')print(container.text())

Return

PythonHello World! Good Python syntax

Tips: If I only want to get "Hello World" and do not want to get any other text, I can remove the li tag using the remove method, and then use the text method.

container = docs.find('#container')container.remove('li')print(container.text())

Return

Hello World!

Pyquery custom usage

Access URL

Compared with BeautifulSoup, PyQuery can initiate a request to the website. For example

from pyquery import PyQueryPyQuery(url = 'https://www.baidu.com')

Opener Parameters

This is PyQuery's request for Baidu's Web site and processing the response returned by the request as a PyQuery object. Generally, the pyquery library calls the urllib library by default. If you want to use selenium or requests library, you can customize the opener parameter of PyQuery.

The opener parameter indicates the request library used by pyquery to initiate a request to the website. Common Request libraries such as urllib, requests, and selenium. Here we define a selenium opener.

From pyquery import PyQueryfrom selenium. webdriver import PhantomJS # Use selenium to access urldef selenium_opener (url): # I didn't put Phantomjs into environment variables, therefore, you need to put the path driver = PhantomJS (executable_path = 'phantomjs path') driver every time you use it. get (url) html = driver. page_source driver. quit () return html # note that the opener parameter in use is a function name without parentheses! PyQuery (url = 'https: // www.baidu.com/', opener = selenium_opener)

At this time, we can operate on the PyQuery object to extract useful information. For details, please refer to the previous sharing. If you want to learn more functions, the pyquery document is not very detailed. Fortunately, it basically matches the jQuery function. If you want to use pyquery well, you need to check the jQuery document.

Cookies, headers

In requests usage, it is generally used as a browser to make the website more authentic. In general, we need to pass in headers. If necessary, we also need to pass in cookies. The pyquery library has this function, and can also pretend to be a browser.

From pyquery import PyQuerycookies = {'cookies': 'Your cookies'} headers = {'user-agent': 'mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) appleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.20.3.100 Safari/100'} PyQuery (url = 'https: // www.baidu.com/', headers?headers,cookies=cookies)

Bring your selenium with the pyquery Function

Directly convert the webpage obtained from the URL accessed by the driver into a PyQuery object, making it easier to extract data.

From pyquery import PyQueryfrom selenium. webdriver import PhantomJSclass Browser (PhantomJS): @ property def dom (self): return PyQuery (self. page_source) "this part of the property is the decorator. You need to know the function that follows @ property to implement the class property function. Here browser. dom is the dom attribute of browser. "Browser = Browser (executable_path = 'phantomjs path') browser. get (url = 'https: // www.baidu.com/') print (type (browser. dom ))

Return

<class 'pyquery.pyquery.PyQuery'>

Summary

The above is all the content of this article. I hope the content of this article has some reference and learning value for everyone's learning or work. If you have any questions, please leave a message to us, thank you for your support.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.