650) this.width=650; "src=" Http://s4.51cto.com/wyfs02/M02/80/29/wKiom1c5kzmwIWExAAEimWIMhCk711.png "title=" Soft Wen 02.png "alt=" Wkiom1c5kzmwiwexaaeimwimhck711.png "/>
lxml is a library of Python that can handle XML quickly and flexibly. It supports XML Path Language (XPath) and extensible Stylesheet Language Transformation (XSLT), and implements common ElementTree APIs.
This 2-day test is to extract the Web page content in Python using XSLT, which is recorded as follows:
1. To extractCollection Search Sir Net Old Edition forumPost Title and number of replies
650) this.width=650; "id=" aimg_811 "src=" http://www.gooseeker.com/doc/data/attachment/forum/201604/26/ 145455am2mfd528nzq3mg0.jpg "class=" Zoom "width=" "height=" 245 "style=" margin-top:10px; "alt=" 145455am2mfd528nzq3mg0.jpg "/>
2. Run the following code (test pass under WINDOWS10, python3.2):
From urllib import requestfrom lxml import etreeurl= "HTTP://WWW.GOOSEEKER.COM/CN /forum/7 "Conn=request.urlopen (URL) doc = etree. HTML (Conn.read ()) Xslt_root = etree. XML ("" "<xsl:stylesheet version=" 1.0 " xmlns:xsl=" Http://www.w3.org/1999/XSL/Transform " > <xsl:template match= "/" >< list ><xsl:apply-templates select= "//*[@id = ' Forum ' and count (./table/tbody/tr[position () >=1 and count (.//*[@class = ' topic ']/a/text ()) >0]) >0] " mode= "List"/></list ></xsl:template><xsl:template match= "Table/tbody/tr[position () > =1] " mode=" list "><item>< title ><xsl:value-of select=" *//*[@class = ' topic ']/a/text () "/ ><xsl:value-of select= "*[@class = ' topic ']/a/text ()"/><xsl:if test= "@class = ' topic '" > <xsl:value-of select= "A/text ()"/></xsl:if></title >< reply Number ><xsl:value-of select= "*//*[@class = ' replies ']/text ()"/><xsl:value-of select=" *[@class = ' replies ']/text () "/><xsl:if test=" @class = ' replies ' " ><xsl:value-of select= "text ()"/></xsl:if></reply number ></item></xsl:template> <xsl:template match= "//*[@id = ' Forum ' and count (./table/tbody/tr[position () >=1 and count (.//*[@class = ' topic ']/a/text ()) >0]) >0] " mode=" List "><item><list><xsl: Apply-templates select= "Table/tbody/tr[position () >=1]" mode= "list"/></list></item> </xsl:template></xsl:stylesheet> "" ") transform = etree. XSLT (xslt_root) result_tree = transform (DOC) print (result_tree)
3. Get fetch Results
650) this.width=650; "id=" aimg_812 "src=" http://www.gooseeker.com/doc/data/attachment/forum/201604/26/ 145615yu298r75uocfiz4y.jpg "class=" Zoom "width=" "style=" margin-top:10px; "alt=" 145615yu298r75uocfiz4y.jpg "/ >
4. Summary
gooseeker visual Extraction Rule Generator ms &NBSP, the extractor's generation process will be very convenient, and can be standardized insertion, so as to achieve a generic crawler.
modified, added: 1. Flip page 2. Crawl results write to file
The updated code is as follows:
from urllib import requestfrom lxml import etreeimport timexslt_root = etree. XML ("" "<xsl:stylesheet version=" 1.0 " xmlns:xsl=" Http://www.w3.org/1999/XSL/Transform " > <xsl:template match= "/" >< list ><xsl:apply-templates select= "//*[@id = ' Forum ' and count (./table/tbody/tr[position () >=1 and count (.//*[@class = ' topic ']/a/text ()) >0]) >0] " mode= "List"/></list ></xsl:template><xsl:template match= "Table/tbody/tr[position () > =1] " mode=" list "><item>< title ><xsl:value-of select=" *//*[@class = ' topic ']/a/text () "/ ><xsl:value-of select= "*[@class = ' topic ']/a/text ()"/><xsl:if test= "@class = ' topic '" > <xsl:value-of select= "A/text ()"/></xsl:if></title >< reply Number ><xsl:value-of select= "*//*[@class = ' replies ']/text ()"/><xsl:value-of select= "*[@class = ' replies ']/text ()"/><xsl:if test="@class = ' replies '" ><xsl:value-of select= "text ()"/></xsl:if></reply number ></item></ Xsl:template><xsl:template match= "//*[@id = ' Forum ' and count (./table/tbody/tr[position () >=1 and count (.//*[@class = ' topic ']/a/text ()) >0]) >0] " mode=" list "><item><list ><xsl:apply-templates select= "Table/tbody/tr[position () >=1]" mode= "list"/></list> </item></xsl:template></xsl:stylesheet> "" ") baseurl=" HTTP://WWW.GOOSEEKER.COM/CN/FORUM/7 " Basefilebegin= "Jsk_bbs_" basefileend= ". xml" count=1while (count < 12): url=baseurl + "? page=" + str (count) conn=request.urlopen (URL) doc = Etree. HTML (Conn.read ()) transform = etree. XSLT (xslt_root) result_tree = transform (DOC) print (str (result_tree)) file_obj=open (Basefilebegin+str ( Count) +basefileend, ' W ', encoding= ' UTF-8 ') file_obj.write (str ( Result_tree)) file_obj.close () count+=1 time.sleep (2)
This article from "Fullerhua blog" blog, declined reprint!
Python extracts Web page data using XSLT