Python crawler Little practice: Crawl any CSDN blog post text content (or can be rewritten to save other elements), indirectly increase the number of blog visits

Source: Internet
Author: User

 

Python is not my main business, the first to learn Python is mainly to learn reptiles, think that they can crawl from the Internet is a very magical and very useful things, because we can get some aspects of data or other things, anyway, useful.

These two days idle nothing, mainly to let the brain relax on the writing crawler to play, on a preliminary use BeautifulSoup to crawl the basic statistics of a CSDN blog (http://blog.csdn.net/hw140701/article/details/55048364), Today, I would like to go straight to the address of a CSDN blog's homepage to crawl all of the blog's article links, and then extract the elements of each article, I am here to extract each blog

First, the main ideas

through the analysis CSDN blog Site source code, we found when we enter a blog page URL, such as: http://blog.csdn.net/hw140701

There are several articles on the homepage, as well as links to articles, and the default is 15 articles. At the bottom of the home blog there will be a page link, such as

, A total of 65 articles in 5 pages, and each page contains 15 links to the article.

So our overall thinking is:

1. Enter the blog home address to get a link to all articles in the current page;

2. Get the link address for each paging

3. Get the link address of all articles on each page by the link address of each page

4. Depending on the link address of each article, get the content of each article until all of the blog posts have been crawled

Second, code Analysis

2.1 Paging Link Source analysis

Use the browser to open the Web address, using the Developer tool to view the blog home site source code, found that the page link address is hidden in the following tags

So we matched all the paging links with the following code

[python] view plain copy

    1. Bsobj.findall ("A", Href=re.compile ("^/([a-za-z0-9]+) (/article) (/list) (/[0-9]+) *$"): #正则表达式匹配分页的链接

Bsobj as BeautifulSoup object

2.2 Pagination Each article link source code Analysis

Get each page of the link, on each page of the article link source analysis, the source code is as follows

Through analysis, so we take the following methods to match

[python] view plain copy

    1. Bsobj.findall ("A", Href=re.compile ("^/([a-za-z0-9]+) (/article) (/details) (/[0-9]+) *$"))


Or

[python] view plain copy

    1. Bsobj.findall ("span", {"Class": "Link_title"})


2.3 Source code Analysis of text content in each article

Through the analysis of the source code of the website in each article, it is found that the content is located in the following location in the source code

So match it with the following code

[python] view plain copy

    1. Bsobj.findall ("span", Style=re.compile ("Font-size: ([0-9]+) px"))


3. All codes and results

now enclose all the code, the comment section may be wrong, you can modify it according to this code, to crawl a Csdn any element in the blog

[python] view plain copy

  1. #__author__ = ' Administrat
  2. #coding =utf-8
  3. Import IO
  4. Import OS
  5. Import sys
  6. Import Urllib
  7. from urllib.request import Urlopen
  8. from Urllib import Request
  9. from BS4 import BeautifulSoup
  10. Import datetime
  11. Import Random
  12. Import RE
  13. Import Requests
  14. Import Socket
  15. Socket.setdefaulttimeout (#设置全局超时函数)
  16. Sys.stdout = io. Textiowrapper (sys.stdout.buffer,encoding= ' GB18030 ')
  17. headers1={' user-agent ': ' mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) gecko/20100101 firefox/23.0 '}
  18. headers2={' user-agent ': ' mozilla/5.0 (Windows NT 6.3; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/45.0.2454.101 safari/537.36 '}
  19. headers3={' user-agent ': ' mozilla/5.0 (Windows NT 6.1) applewebkit/537.11 (khtml, like Gecko) chrome/23.0.1271.64 safari/ 537.11 '}
  20. #得到CSDN博客某一个分页的所有文章的链接
  21. Articles=set ()
  22. def getarticlelinks (pageurl):
  23. #设置代理IP
  24. #代理IP可以上http://zhimaruanjian.com/Get
  25. Proxy_handler=urllib.request.proxyhandler ({' Post ': ' 210.136.17.78:8080 '})
  26. Proxy_auth_handler=urllib.request.proxybasicauthhandler ()
  27. Opener = Urllib.request.build_opener (Urllib.request.HTTPHandler, Proxy_handler)
  28. Urllib.request.install_opener (opener)
  29. #获取网页信息
  30. Req=request. Request (pageurl,headers=headers1 or headers2 or headers3)
  31. Html=urlopen (req)
  32. Bsobj=beautifulsoup (Html.read (), "Html.parser")
  33. Global Articles
  34. #return Bsobj.findall ("A", Href=re.compile ("^/([a-za-z0-9]+) (/article) (/details) (/[0-9]+) *$"))
  35. #return Bsobj.findall ("a")
  36. #for articlelist in Bsobj.findall ("span", {"Class": "Link_title"}):
  37. for Articlelist in Bsobj.findall ("span", {"Class": "Link_title"}): #正则表达式匹配每一篇文章链接
  38. #print (Articlelist)
  39. if ' href ' in Articlelist.a.attrs:
  40. if articlelist.a.attrs["href"] not in articles:
  41. #遇到了新界面
  42. newarticle=articlelist.a.attrs["href"]
  43. #print (newarticle)
  44. Articles.add (newarticle)
  45. #articlelinks =getarticlelinks ("http://blog.csdn.net/hw140701")
  46. #for list in Articlelinks:
  47. #print (list.attrs["href"])
  48. #print (list.a.attrs["href"])
  49. #写入文本
  50. #def data_out (data):
  51. # with open ("E:/csdn.txt", "A +") as Out:
  52. # out.write (' \ n ')
  53. # out.write (data,)
  54. #得到CSDN博客每一篇文章的文字内容
  55. def Getarticletext (articleurl):
  56. #设置代理IP
  57. #代理IP可以上http://zhimaruanjian.com/Get
  58. Proxy_handler=urllib.request.proxyhandler ({' https ': ' 111.76.129.200:808 '})
  59. Proxy_auth_handler=urllib.request.proxybasicauthhandler ()
  60. Opener = Urllib.request.build_opener (Urllib.request.HTTPHandler, Proxy_handler)
  61. Urllib.request.install_opener (opener)
  62. #获取网页信息
  63. Req=request. Request (articleurl,headers=headers1 or headers2 or headers3)
  64. Html=urlopen (req)
  65. Bsobj=beautifulsoup (Html.read (), "Html.parser")
  66. #获取文章的文字内容
  67. for textlist in Bsobj.findall ("span", Style=re.compile ("Font-size: ([0-9]+) px")): #正则表达式匹配文字内容标签
  68. Print (Textlist.get_text ())
  69. #data_out (Textlist.get_text ())
  70. #得到CSDN博客某个博客主页上所有分页的链接, get a link to each article based on the pagination link and crawl the text of each blog post
  71. Pages=set ()
  72. def getpagelinks (Bokezhuye):
  73. #设置代理IP
  74. #代理IP可以上http://zhimaruanjian.com/Get
  75. Proxy_handler=urllib.request.proxyhandler ({' Post ': ' 121.22.252.85:8000 '})
  76. Proxy_auth_handler=urllib.request.proxybasicauthhandler ()
  77. Opener = Urllib.request.build_opener (Urllib.request.HTTPHandler, Proxy_handler)
  78. Urllib.request.install_opener (opener)
  79. #获取网页信息
  80. Req=request. Request (bokezhuye,headers=headers1 or headers2 or headers3)
  81. Html=urlopen (req)
  82. Bsobj=beautifulsoup (Html.read (), "Html.parser")
  83. Links to all Articles #获取当前页面 (first page)
  84. Getarticlelinks (Bokezhuye)
  85. #去除重复的链接
  86. Global Pages
  87. for PageList in Bsobj.findall ("A", Href=re.compile ("^/" ([a-za-z0-9]+) (/article) (/list) (/[0-9]+) *$ ")): #正则表达式匹配分页的链接
  88. if ' href ' in Pagelist.attrs:
  89. if pagelist.attrs["href"] not in pages:
  90. #遇到了新的界面
  91. newpage=pagelist.attrs["href"]
  92. #print (NewPage)
  93. Pages.Add (NewPage)
  94. #获取接下来的每一个页面上的每一篇文章的链接
  95. newpagelink= "http://blog.csdn.net/" +newpage
  96. Getarticlelinks (Newpagelink)
  97. #爬取每一篇文章的文字内容
  98. for Articlelist in articles:
  99. Newarticlelist= "http://blog.csdn.net/" +articlelist
  100. Print (newarticlelist)
  101. Getarticletext (Newarticlelist)
  102. #getArticleLinks ("http://blog.csdn.net/hw140701")
  103. Getpagelinks ("http://blog.csdn.net/hw140701")
  104. #getArticleText ("http://blog.csdn.net/hw140701/article/details/55104018")

Results

In which sometimes garbled, this is due to the existence of a space, and temporarily find a way to solve.

In addition, there are times when the server does not respond to the error, as follows:

Python crawler Little practice: Crawl any CSDN blog post text content (or can be rewritten to save other elements), indirectly increase the number of blog visits

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.