Python is not my main business, the first to learn Python is mainly to learn reptiles, think that they can crawl from the Internet is a very magical and very useful things, because we can get some aspects of data or other things, anyway, useful.
These two days idle nothing, mainly to let the brain relax on the writing crawler to play, on a preliminary use BeautifulSoup to crawl the basic statistics of a CSDN blog (http://blog.csdn.net/hw140701/article/details/55048364), Today, I would like to go straight to the address of a CSDN blog's homepage to crawl all of the blog's article links, and then extract the elements of each article, I am here to extract each blog
First, the main ideas
through the analysis CSDN blog Site source code, we found when we enter a blog page URL, such as: http://blog.csdn.net/hw140701
There are several articles on the homepage, as well as links to articles, and the default is 15 articles. At the bottom of the home blog there will be a page link, such as
, A total of 65 articles in 5 pages, and each page contains 15 links to the article.
So our overall thinking is:
1. Enter the blog home address to get a link to all articles in the current page;
2. Get the link address for each paging
3. Get the link address of all articles on each page by the link address of each page
4. Depending on the link address of each article, get the content of each article until all of the blog posts have been crawled
Second, code Analysis
2.1 Paging Link Source analysis
Use the browser to open the Web address, using the Developer tool to view the blog home site source code, found that the page link address is hidden in the following tags
So we matched all the paging links with the following code
[python] view plain copy
- Bsobj.findall ("A", Href=re.compile ("^/([a-za-z0-9]+) (/article) (/list) (/[0-9]+) *$"): #正则表达式匹配分页的链接
Bsobj as BeautifulSoup object
2.2 Pagination Each article link source code Analysis
Get each page of the link, on each page of the article link source analysis, the source code is as follows
Through analysis, so we take the following methods to match
[python] view plain copy
- Bsobj.findall ("A", Href=re.compile ("^/([a-za-z0-9]+) (/article) (/details) (/[0-9]+) *$"))
Or
[python] view plain copy
- Bsobj.findall ("span", {"Class": "Link_title"})
2.3 Source code Analysis of text content in each article
Through the analysis of the source code of the website in each article, it is found that the content is located in the following location in the source code
So match it with the following code
[python] view plain copy
- Bsobj.findall ("span", Style=re.compile ("Font-size: ([0-9]+) px"))
3. All codes and results
now enclose all the code, the comment section may be wrong, you can modify it according to this code, to crawl a Csdn any element in the blog
[python] view plain copy
- #__author__ = ' Administrat
- #coding =utf-8
- Import IO
- Import OS
- Import sys
- Import Urllib
- from urllib.request import Urlopen
- from Urllib import Request
- from BS4 import BeautifulSoup
- Import datetime
- Import Random
- Import RE
- Import Requests
- Import Socket
- Socket.setdefaulttimeout (#设置全局超时函数)
- Sys.stdout = io. Textiowrapper (sys.stdout.buffer,encoding= ' GB18030 ')
- headers1={' user-agent ': ' mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) gecko/20100101 firefox/23.0 '}
- headers2={' user-agent ': ' mozilla/5.0 (Windows NT 6.3; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/45.0.2454.101 safari/537.36 '}
- headers3={' user-agent ': ' mozilla/5.0 (Windows NT 6.1) applewebkit/537.11 (khtml, like Gecko) chrome/23.0.1271.64 safari/ 537.11 '}
- #得到CSDN博客某一个分页的所有文章的链接
- Articles=set ()
- def getarticlelinks (pageurl):
- #设置代理IP
- #代理IP可以上http://zhimaruanjian.com/Get
- Proxy_handler=urllib.request.proxyhandler ({' Post ': ' 210.136.17.78:8080 '})
- Proxy_auth_handler=urllib.request.proxybasicauthhandler ()
- Opener = Urllib.request.build_opener (Urllib.request.HTTPHandler, Proxy_handler)
- Urllib.request.install_opener (opener)
- #获取网页信息
- Req=request. Request (pageurl,headers=headers1 or headers2 or headers3)
- Html=urlopen (req)
- Bsobj=beautifulsoup (Html.read (), "Html.parser")
- Global Articles
- #return Bsobj.findall ("A", Href=re.compile ("^/([a-za-z0-9]+) (/article) (/details) (/[0-9]+) *$"))
- #return Bsobj.findall ("a")
- #for articlelist in Bsobj.findall ("span", {"Class": "Link_title"}):
- for Articlelist in Bsobj.findall ("span", {"Class": "Link_title"}): #正则表达式匹配每一篇文章链接
- #print (Articlelist)
- if ' href ' in Articlelist.a.attrs:
- if articlelist.a.attrs["href"] not in articles:
- #遇到了新界面
- newarticle=articlelist.a.attrs["href"]
- #print (newarticle)
- Articles.add (newarticle)
- #articlelinks =getarticlelinks ("http://blog.csdn.net/hw140701")
- #for list in Articlelinks:
- #print (list.attrs["href"])
- #print (list.a.attrs["href"])
- #写入文本
- #def data_out (data):
- # with open ("E:/csdn.txt", "A +") as Out:
- # out.write (' \ n ')
- # out.write (data,)
- #得到CSDN博客每一篇文章的文字内容
- def Getarticletext (articleurl):
- #设置代理IP
- #代理IP可以上http://zhimaruanjian.com/Get
- Proxy_handler=urllib.request.proxyhandler ({' https ': ' 111.76.129.200:808 '})
- Proxy_auth_handler=urllib.request.proxybasicauthhandler ()
- Opener = Urllib.request.build_opener (Urllib.request.HTTPHandler, Proxy_handler)
- Urllib.request.install_opener (opener)
- #获取网页信息
- Req=request. Request (articleurl,headers=headers1 or headers2 or headers3)
- Html=urlopen (req)
- Bsobj=beautifulsoup (Html.read (), "Html.parser")
- #获取文章的文字内容
- for textlist in Bsobj.findall ("span", Style=re.compile ("Font-size: ([0-9]+) px")): #正则表达式匹配文字内容标签
- Print (Textlist.get_text ())
- #data_out (Textlist.get_text ())
- #得到CSDN博客某个博客主页上所有分页的链接, get a link to each article based on the pagination link and crawl the text of each blog post
- Pages=set ()
- def getpagelinks (Bokezhuye):
- #设置代理IP
- #代理IP可以上http://zhimaruanjian.com/Get
- Proxy_handler=urllib.request.proxyhandler ({' Post ': ' 121.22.252.85:8000 '})
- Proxy_auth_handler=urllib.request.proxybasicauthhandler ()
- Opener = Urllib.request.build_opener (Urllib.request.HTTPHandler, Proxy_handler)
- Urllib.request.install_opener (opener)
- #获取网页信息
- Req=request. Request (bokezhuye,headers=headers1 or headers2 or headers3)
- Html=urlopen (req)
- Bsobj=beautifulsoup (Html.read (), "Html.parser")
- Links to all Articles #获取当前页面 (first page)
- Getarticlelinks (Bokezhuye)
- #去除重复的链接
- Global Pages
- for PageList in Bsobj.findall ("A", Href=re.compile ("^/" ([a-za-z0-9]+) (/article) (/list) (/[0-9]+) *$ ")): #正则表达式匹配分页的链接
- if ' href ' in Pagelist.attrs:
- if pagelist.attrs["href"] not in pages:
- #遇到了新的界面
- newpage=pagelist.attrs["href"]
- #print (NewPage)
- Pages.Add (NewPage)
- #获取接下来的每一个页面上的每一篇文章的链接
- newpagelink= "http://blog.csdn.net/" +newpage
- Getarticlelinks (Newpagelink)
- #爬取每一篇文章的文字内容
- for Articlelist in articles:
- Newarticlelist= "http://blog.csdn.net/" +articlelist
- Print (newarticlelist)
- Getarticletext (Newarticlelist)
- #getArticleLinks ("http://blog.csdn.net/hw140701")
- Getpagelinks ("http://blog.csdn.net/hw140701")
- #getArticleText ("http://blog.csdn.net/hw140701/article/details/55104018")
Results
In which sometimes garbled, this is due to the existence of a space, and temporarily find a way to solve.
In addition, there are times when the server does not respond to the error, as follows:
Python crawler Little practice: Crawl any CSDN blog post text content (or can be rewritten to save other elements), indirectly increase the number of blog visits