Related content:
- Introduction of Pyquery
- Use of Pyquery
- Mounting module
- Import Module
- Parsing object Initialization
- CSS Selector
- Select the element after the selected element
- Acquisition of elements ' text, attributes, etc.
- Pyquery performing DOM operations, CSS Operations
- DOM Operations
- CSS Actions
- An example of using Pyquery to crawl a new book of watercress
Starting Time: 2018-03-09 21:26
Introduction of Pyquery
- Pyquery allows jquery queries on XML and HTML documents.
- Pyquery uses lxml for fast XML and HTML operations.
- Pyquery is jquery in Python
Pyquery use: 1. Install the module:
PIP3 Install Pyquery
2. Import the module:
from Import Pyquery as PQ
3. Parse Object initialization:
"Use Pyquery initialization to parse an object, Pyquery is a class that directly passes the object being parsed as a parameter."
4.CSS selector:
- Using tags to get:
result = Textparse ('h2'). Text ()
- Using the class selector:
Result3=textparse (". P1"). Text ()
- Select with ID:
Result4=textparse ("#user"). attr ("type")
- Group selection:
Result5=textparse ("p,div"). Text ()
- Descendant selector:
Result6=textparse ("div a"). Attr.href
- Property selector:
Result7=textparse ("[class= ' P1 ']"). Text ()
- CSS3 pseudo-Class selector:
Result8=textparse ("p:last"). Text ()
(More, you can refer to CSS)
5. Select the element after the selected element:
- Find (): Find the specified child element, find can have parameters, the argument can be any jQuery selector syntax,
- Filter (): Filters the results to find the specified element, filter can have parameters, which can be the syntax of any jQuery selector,
- Children (): Gets all child elements, which can have parameters, which can be the syntax of any jQuery selector,
- Parent (): Gets the parental element, which can have arguments, which can be the syntax of any jQuery selector,
- Parents (): Gets the ancestor element, which can have parameters, which can be the syntax of any jQuery selector,
- Siblings (): Gets the sibling element, which can have arguments, which can be the syntax of any jQuery selector,
fromPyqueryImportPyquery as PQHtml=""""""## #初始化Textparse =PQ (HTML)#Urlparse = PQ (' http://www.baidu.com ') #1#Urlparse = PQ (url= ' http://www.baidu.com ') #2#Fileparse = PQ (filename= "L:\demo.html")##获取result = Textparse ('H2'). Text ()Print(Result) result2= Textparse ('Div'). HTML ()Print(RESULT2) RESULT3=textparse (". P1"). Text ()Print(RESULT3) Result4=textparse ("#user"). attr ("type")Print(RESULT4) result5=textparse ("P,div"). Text ()Print(RESULT5) result6=textparse ("Div a"). Attr.hrefPrint(RESULT6) result7=textparse ("[class= ' P1 ']"). Text ()Print(result7) result8=textparse ("P:last"). Text ()Print(RESULT8) Result9=textparse ("Div"). Find ("a"). Text ()Print(RESULT9) result12=textparse ("P"). Filter (". P1"). Text ()Print(result12) result10=textparse ("Div"). Children ()Print(result10) result11=textparse ("a"). Parent ()Print(result11)
6. Acquisition of elements ' text, attributes, etc.:
attr (attribute): Get property
Result2=textparse ("a"). attr ("href")
Attr.xxxx: Get property xxxx
Result21=textparse ("a"). attr.hrefresult22=textparse (" a"). " ). attr.class_result23=textparse ("a"). Attr.id_result24= Textparse ("a"). Attr.value
Text (): Gets the text, and only the text is returned in the child element
Result1=textparse ("a"). Text ()
HTML (): Gets HTML, functions similar to text, but returns HTML tags
Result3=textparse ("div"). HTML ()
Supplement 1:
Iteration of an element: If the returned result is multiple elements, you can use items () If you want to iterate over each element:
Supplemental 2:pyquery is the python of jquery, the grammar is basically interlinked, want to know more, you can refer to jquery.
Pyquery performs DOM operations, CSS Operations: DOM operations:
Add_class (): Add Class
Remove_class (): Remove class
Remove (): Deletes the specified element
fromPyqueryImportPyquery as pqhtml=""""""Textparse=PQ (HTML) textparse ('a'). Add_class ("C1")Print(Textparse ('a'). attr ("class")) Textparse ('a'). Remove_class ("C1")Print(Textparse ('a'). attr ("class"))Print(Textparse ('Div'). HTML ()) Textparse ('Div'). Remove ("a")Print(Textparse ('Div'). HTML ())
CSS actions:
- attr (): Setting properties
- Format: attr ("Property name", "Property value")
- CSS (): Setting CSS
- Formatting 1:css (CSS style, style value)
- Format 2:css ({"Style 1": "Style value", "Style 2": "Style Value"})
fromPyqueryImportPyquery as pqhtml=""""""Textparse=PQ (HTML) textparse ('a'). attr ("name","hehe")Print(Textparse ('a'). attr ("name")) Textparse ('a'). CSS ("Color"," White") Textparse ('a'). CSS ({"Background-color":"Black","postion":"fixed"})Print(Textparse ('a'). attr ("style"))
When these operations are used:
"Sometimes the data style can be processed and stored, it needs to be used, such as I get down the data style I am not satisfied, can be customized to my own format"
"Sometimes you need to clean up and then filter out the specified results, such as <div>123<a></a></div>, if you just want to get 123, you can delete <a> and get it first"
An example of using Pyquery to crawl a new book of Watercress:
Use the review element first to locate the target element
Confirm Crawl Information
Note that the Watercress book is a few points on the back page, in fact, the target should be Li's upper-level ul:
Use Pyquery to filter out results:
fromPyqueryImportPyquery as Pqurlparse=PQ (url="https://book.douban.com/") Info=urlparse ("Div.carousel ul Li Div.info") file=open ("Demo.txt","W", encoding="UTF8") forIinchInfo.items (): Title=i.find ("Div.title") Author=i.find ("Span.author") Abstract=i.find (". Abstract") File.write ("Title:"+title.text () +"\ n") File.write ("Author:"+author.text () +"\ n") File.write ("Overview:"+abstract.text () +"\ n") File.write ("-----------------\ n") Print("\ n") File.close ()
Pyquery Learning of Python crawlers