Pyquery Learning of Python crawlers

Source: Internet
Author: User

Related content:
    • Introduction of Pyquery
    • Use of Pyquery
      • Mounting module
      • Import Module
      • Parsing object Initialization
      • CSS Selector
      • Select the element after the selected element
      • Acquisition of elements ' text, attributes, etc.
    • Pyquery performing DOM operations, CSS Operations
      • DOM Operations
      • CSS Actions
    • An example of using Pyquery to crawl a new book of watercress

Starting Time: 2018-03-09 21:26

Introduction of Pyquery
    • Pyquery allows jquery queries on XML and HTML documents.
    • Pyquery uses lxml for fast XML and HTML operations.
    • Pyquery is jquery in Python

Pyquery use: 1. Install the module:

PIP3 Install Pyquery

2. Import the module:
 from Import Pyquery as PQ
3. Parse Object initialization:

"Use Pyquery initialization to parse an object, Pyquery is a class that directly passes the object being parsed as a parameter."

    • String initialization when parsing an object as a string: By default it is a string, and if the string is a HTTP\HTTPS prefix, it will be considered a URL
      Textparse = PQ (HTML)
    • URL initialization When parsing an object as a Web page: The keyword parameter url= is recommended
      # Urlparse = PQ (' http://www.baidu.com ') #1urlparse = PQ (url='http://www.baidu.com')  #2
    • File initialization when parsing an object as a file: The keyword parameter filename= is recommended
      Fileparse = PQ (filename="L:\demo.html")
    • After parsing, you can use related functions or variables to filter, you can use CSS and so on to filter,

4.CSS selector:
  • Using tags to get:
    result = Textparse ('h2'). Text ()
  • Using the class selector:
    Result3=textparse (". P1"). Text ()
  • Select with ID:
    Result4=textparse ("#user"). attr ("type")
  • Group selection:
    Result5=textparse ("p,div"). Text ()
  • Descendant selector:
    Result6=textparse ("div a"). Attr.href
  • Property selector:
    Result7=textparse ("[class= ' P1 ']"). Text ()
  • CSS3 pseudo-Class selector:
    Result8=textparse ("p:last"). Text ()

(More, you can refer to CSS)

5. Select the element after the selected element:
    • Find (): Find the specified child element, find can have parameters, the argument can be any jQuery selector syntax,
    • Filter (): Filters the results to find the specified element, filter can have parameters, which can be the syntax of any jQuery selector,
    • Children (): Gets all child elements, which can have parameters, which can be the syntax of any jQuery selector,
    • Parent (): Gets the parental element, which can have arguments, which can be the syntax of any jQuery selector,
    • Parents (): Gets the ancestor element, which can have parameters, which can be the syntax of any jQuery selector,
    • Siblings (): Gets the sibling element, which can have arguments, which can be the syntax of any jQuery selector,
 fromPyqueryImportPyquery as PQHtml=""""""## #初始化Textparse =PQ (HTML)#Urlparse = PQ (' http://www.baidu.com ') #1#Urlparse = PQ (url= ' http://www.baidu.com ') #2#Fileparse = PQ (filename= "L:\demo.html")##获取result = Textparse ('H2'). Text ()Print(Result) result2= Textparse ('Div'). HTML ()Print(RESULT2) RESULT3=textparse (". P1"). Text ()Print(RESULT3) Result4=textparse ("#user"). attr ("type")Print(RESULT4) result5=textparse ("P,div"). Text ()Print(RESULT5) result6=textparse ("Div a"). Attr.hrefPrint(RESULT6) result7=textparse ("[class= ' P1 ']"). Text ()Print(result7) result8=textparse ("P:last"). Text ()Print(RESULT8) Result9=textparse ("Div"). Find ("a"). Text ()Print(RESULT9) result12=textparse ("P"). Filter (". P1"). Text ()Print(result12) result10=textparse ("Div"). Children ()Print(result10) result11=textparse ("a"). Parent ()Print(result11)

6. Acquisition of elements ' text, attributes, etc.:

attr (attribute): Get property

Result2=textparse ("a"). attr ("href")

Attr.xxxx: Get property xxxx

Result21=textparse ("a"). attr.hrefresult22=textparse (" a"). " ). attr.class_result23=textparse ("a"). Attr.id_result24= Textparse ("a"). Attr.value

Text (): Gets the text, and only the text is returned in the child element

Result1=textparse ("a"). Text ()

HTML (): Gets HTML, functions similar to text, but returns HTML tags

Result3=textparse ("div"). HTML ()

Supplement 1:

Iteration of an element: If the returned result is multiple elements, you can use items () If you want to iterate over each element:

Supplemental 2:pyquery is the python of jquery, the grammar is basically interlinked, want to know more, you can refer to jquery.

Pyquery performs DOM operations, CSS Operations: DOM operations:

Add_class (): Add Class

Remove_class (): Remove class

Remove (): Deletes the specified element

 fromPyqueryImportPyquery as pqhtml=""""""Textparse=PQ (HTML) textparse ('a'). Add_class ("C1")Print(Textparse ('a'). attr ("class")) Textparse ('a'). Remove_class ("C1")Print(Textparse ('a'). attr ("class"))Print(Textparse ('Div'). HTML ()) Textparse ('Div'). Remove ("a")Print(Textparse ('Div'). HTML ())

CSS actions:
    • attr (): Setting properties
      • Format: attr ("Property name", "Property value")
    • CSS (): Setting CSS
      • Formatting 1:css (CSS style, style value)
      • Format 2:css ({"Style 1": "Style value", "Style 2": "Style Value"})

 fromPyqueryImportPyquery as pqhtml=""""""Textparse=PQ (HTML) textparse ('a'). attr ("name","hehe")Print(Textparse ('a'). attr ("name")) Textparse ('a'). CSS ("Color"," White") Textparse ('a'). CSS ({"Background-color":"Black","postion":"fixed"})Print(Textparse ('a'). attr ("style"))

When these operations are used:

"Sometimes the data style can be processed and stored, it needs to be used, such as I get down the data style I am not satisfied, can be customized to my own format"

"Sometimes you need to clean up and then filter out the specified results, such as <div>123<a></a></div>, if you just want to get 123, you can delete <a> and get it first"

An example of using Pyquery to crawl a new book of Watercress:

Use the review element first to locate the target element

Confirm Crawl Information

Note that the Watercress book is a few points on the back page, in fact, the target should be Li's upper-level ul:

Use Pyquery to filter out results:

 fromPyqueryImportPyquery as Pqurlparse=PQ (url="https://book.douban.com/") Info=urlparse ("Div.carousel ul Li Div.info") file=open ("Demo.txt","W", encoding="UTF8") forIinchInfo.items (): Title=i.find ("Div.title") Author=i.find ("Span.author") Abstract=i.find (". Abstract") File.write ("Title:"+title.text () +"\ n") File.write ("Author:"+author.text () +"\ n") File.write ("Overview:"+abstract.text () +"\ n") File.write ("-----------------\ n")    Print("\ n") File.close ()

Pyquery Learning of Python crawlers

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.