Today, using Requests+xpath to achieve simple data crawling, get the title on the CSDN blog, publish time, and read the number of times
Download Pycharm
I'm using pycharm here.
Http://www.jetbrains.com/pycharm/download/download-thanks.html?platform=windows&code=PCC
About the use of pycharm, and Androidstudio is the same, here not to say more. the regular expression base . You can match any character, excluding the newline character, which can match the previous character 0 or infinite times. Matches the previous character 0 or 1 times () the data in parentheses returns the usual method of FindAll: matches the contents of all the compound laws, returns the list Search that contains the results: matches and extracts the contents of the first compound rule, and returns a regular expression object Sub: Replace the content of the composite rule, return the value of the replacement simple regular demo
. The Use
A = "XZADKF"
B = Re.findall (' x ... ', a)
print b #输出 [' Xzad ']
* The use of
A = "XZAXKF"
B = Re.findall (' x* ', a)
print b # [' X ', ', ', ', ' X ', ', ', ', ', ', ']
? The use of
A = "XZAXKF"
B = Re.findall (' x ', a)
print b # [' X ', ', ', ', ' X ', ', ', ', ', ', ', ', ']
Rexstr = "Adsfxxhelloxxiowengopfdwxxworldxxadjgoos"
B = Re.findall (' xx.*xx ', rexstr) # Meet the criteria as much as possible to find
print b # [' Xxhelloxxiowengopfdwxxworldxx ']
c = re.findall (' xx.*?xx ', rexstr)
print C # [' Xxhelloxx ', ' Xxworldxx ']
d = Re.findall (' xx (. *?) xx ', rexstr)
print D # [' Xxhelloxx ', ' xxworldxx ']
hello = ' Adsfxxhello
Xxiowengopfdwxxworldxxadjgoos '
e = Re.findall (' xx (. *?) xx ', Hello,re. S
Get title
Here we get each title in the http://jp.tingroom.com/yuedu/yd300p/
html = requests.get (' http://jp.tingroom.com/yuedu/yd300p/')
html.encoding = ' utf-8 '
print Html.text
title = Re.findall (' color: #039; " > (. *?) </a> ', Html.text,re. S) for each in the
title:
Print each
This effect is as follows:
Installing XPath
Installing XPath with PIP
Pip Install XPath
Crawl the content of the blog home page
#.*-coding:utf-8-*-Import requests import re import sys reload (SYS) sys.setdefaultencoding ("Utf-8") from lxml import ETR EE class Spider (object): # get URL corresponding page source def getsource (self,url): headers = {' user-agent ': ' mozilla/5.0 ( Windows NT 6.1; WOW64; rv:53.0) gecko/20100101 firefox/53.0 '} sourcehtml = Requests.get (URL, headers=headers) return sourcehtml.t Ext # Change Link Address page def chagnepage (self,orginalstr): currentpage = Int (Re.search (' page= (\d+) ', Orginalstr, re.) S). Group (1)) PageGroup = [] for I in range (CurrentPage, CurrentPage + 3): link = re.sub (' page= \d+ ', ' page=%s '% i, orginalstr, re. S) pagegroup.append (link) return PageGroup # Parsing the data def we need from HTML Getneedinfo (self,sourcehtml) : Currentallinfo = [] selector = etree.
HTML (sourcehtml) titles = Selector.xpath ('//dl[@class = ' blog_list clearfix ']//dd ') Singlepageinfo = {};
For VS in titles: info = Vs.xpath (' h3[@class = "Tracking-ad"]/a/text ()) print "title:" + info[0] singlepageinfo[' titl E '] = info[0] Time = Vs.xpath (' div[@class = "Blog_list_b clearfix"]/div[@class = "Blog_list_b_r fr"]/label/text () ' Print "Time:" + time[0] singlepageinfo[' times ' = time[0] Readcount = Vs.xpath (' div[@cl
ass= "Blog_list_b clearfix"]/div[@class = "Blog_list_b_r fr"]/span/em/text ()) print "Read times:" + readcount[0] Currentallinfo.append (singlepageinfo) Print Currentallinfo if __name__ = = ' __main__ ': Spider = Spider () URL = "http://blog.csdn.net/?&page=1" allpage = spider.chagnepage (URL) allpageinfo = [] for link I n allpage:print ' processing: ' +link sourcehtml = spider.getsource (link) spider.getneedinfo (sourcehtml)