Python crawler Combat

Source: Internet
Author: User
Tags xpath

task: What you need to get: Liaoche's official website python section of the title and content, then get the entire python The content of the tutorial, not just this one page: http://www.liaoxuefeng.com/wiki/0014316089557264a6b348958f449949df42a6d3a2e542c000

  1. Yes source for analysis:

    python tutorial at class= "X-content" div div class= "X-wiki-content" div

  2. Get the title and content of the article:

    get the source code of the webpage, need to use ruquests Module

    Import Request

    Htmlsource=requests.get (' http://www.liaoxuefeng.com/wiki/0014316089557264a6b348958f449949df42a6d3a2e542c000 ')

  3. Get title:
      1. To match with regular expressions:

    Title=re.findall ('

    b) Use XPath Get content :

    A tree-like thing needs to be built before matching:

    Selector=etree. HTML (Htmlsource.text)

    Selector.xpath ('//div[@class = "X-content"]/h4/text () ') [0]

  4. Get content:
      1. Match by regular expression

    Content=re.findall (' class= ' x-wiki-content > (. *?) </div> ', Htmlsource.text,re. S) [0]

    so get the content inside that Div all the content that you can through Re module's Sub method to replace all tags with empty characters, such as:

    Re.sub (' < (. +?) > ', ', content,count=0)

      1. through XPath Get content

    contentdiv= Selector.xpath ('//div[@class = "X-wiki-content"]

    Print Contentdiv[0].xpath (' string (.) ')

  5. in the second step, you can extract the contents of a Web page, then we will extract the links on the left side of the page to extract the URL Get python all content of the tutorial
  6. analyze the HTML source on the left side of the website and get urllist:

    ul-->li-->a, there so selector.xpath ('//ul[@class = "Uk-nav uk-nav-side"] ') [1] Span style= "font-family: the song Body;" > is what we want ul

    also href the relative path is displayed, and the domain name is missing compared to the true network address. http, so get this href after that, we need to deal with it further.

    using regular expressions is cumbersome, so getting URL You can use a more simple XPath

    Urllist=[]

    Linklist=selector.xpath ('//ul[@class = ' Uk-nav uk-nav-side '] ') [1]

    For I in linklist:

    Urllist.append (' http://www.liaoxuefeng.com ' + i.xpath (' a @href ') [0])

    so you get what you want . urllist the

  7. Get all the title and content based on these URLs

    the first method is to traverse urllist :

    according to URL Get title and the content, It can then be written to a file that is very slow and may require 1 divided into several minutes

    the second approach is through map () function:

    Pass in a reptile function ( parameter is URL), and a URL List

Python crawler Combat

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.