task: What you need to get: Liaoche's official website python section of the title and content, then get the entire python The content of the tutorial, not just this one page: http://www.liaoxuefeng.com/wiki/0014316089557264a6b348958f449949df42a6d3a2e542c000
- Yes source for analysis:
python tutorial at class= "X-content" div div class= "X-wiki-content" div
- Get the title and content of the article:
get the source code of the webpage, need to use ruquests Module
Import Request
Htmlsource=requests.get (' http://www.liaoxuefeng.com/wiki/0014316089557264a6b348958f449949df42a6d3a2e542c000 ')
- Get title:
- To match with regular expressions:
Title=re.findall ('
b) Use XPath Get content :
A tree-like thing needs to be built before matching:
Selector=etree. HTML (Htmlsource.text)
Selector.xpath ('//div[@class = "X-content"]/h4/text () ') [0]
- Get content:
- Match by regular expression
Content=re.findall (' class= ' x-wiki-content > (. *?) </div> ', Htmlsource.text,re. S) [0]
so get the content inside that Div all the content that you can through Re module's Sub method to replace all tags with empty characters, such as:
Re.sub (' < (. +?) > ', ', content,count=0)
- through XPath Get content
contentdiv= Selector.xpath ('//div[@class = "X-wiki-content"]
Print Contentdiv[0].xpath (' string (.) ')
- in the second step, you can extract the contents of a Web page, then we will extract the links on the left side of the page to extract the URL Get python all content of the tutorial
- analyze the HTML source on the left side of the website and get urllist:
ul-->li-->a, there so selector.xpath ('//ul[@class = "Uk-nav uk-nav-side"] ') [1] Span style= "font-family: the song Body;" > is what we want ul
also href the relative path is displayed, and the domain name is missing compared to the true network address. http, so get this href after that, we need to deal with it further.
using regular expressions is cumbersome, so getting URL You can use a more simple XPath
Urllist=[]
Linklist=selector.xpath ('//ul[@class = ' Uk-nav uk-nav-side '] ') [1]
For I in linklist:
Urllist.append (' http://www.liaoxuefeng.com ' + i.xpath (' a @href ') [0])
so you get what you want . urllist the
- Get all the title and content based on these URLs
the first method is to traverse urllist :
according to URL Get title and the content, It can then be written to a file that is very slow and may require 1 divided into several minutes
the second approach is through map () function:
Pass in a reptile function ( parameter is URL), and a URL List
Python crawler Combat