[Python learning] simple web crawler Crawl blog post and ideas introduction

Source: Internet
Author: User
Tags wrapper python web crawler

The previous emphasis on Python's use of web crawler is very effective, this article is also a combination of learning Python video knowledge and my postgraduate data mining direction knowledge. So the introduction of Python is how to crawl the network data, the article knowledge is easy, but also share to everyone, as a simple introduction! At the same time just share knowledge, I hope you do not destroy the knowledge of the network or infringe other people's original article. Mainly includes:
1. Introduction to the simple idea and process of crawling csdn own blog post
2. Implement Python source code crawl Sina Han Cold blog 316 articles

The simple idea of a reptile

     Recently, Liu Bing's "Web data Mining" know that in the study of information extraction problem is mainly used in three ways:
1. Manual method:by observing the Web page and source code to find out the pattern, and then tapping the code to extract the target data. However, this method cannot handle the large number of sites.
2. Wrapper Induction:its English name is called wrapper induction, that is, there is supervised learning methods, is half their own initiative. This method learns a set of extraction rules from a manually annotated Web page or data recordset to extract Web page data in a similar format.
3. Take your own initiative to extract:It is unsupervised method, given a page or a number of pages, their own initiative to find patterns or syntax to achieve data extraction, because there is no need to manually label, it can handle a large number of Web sites and Web pages of data extraction work.
Python web crawler used here is a simple data extraction program, I will continue to study some python+ data mining knowledge and write this kind of article. First I want to get all my own Csdn blog (static. html files), detailed ideas and implementation methods such as the following:
The first step is to analyze the source code of CSDN Blog
        The first thing to do is to analyze the blog source code to get a CSDN article, in the use of IE browser by F12 or Google Chrome browser right-click "Review elements" to analyze the basic information of the blog. http://blog.csdn.net/in Web pages Eastmount Links All of the author's posts.
The source code format shown is as follows:

Among them <diw class= "List_item Article_item"; </div> represents each of the blog posts displayed, with the first presentation sample shown below for example:

It's detailed HTML source code such as the following:

So we just need to get each page in the blog <div class= "article_title" > Link <a href= "/eastmount/article/details/39599061", and add Http://blog.csdn.net can. In the pass code:

Import urllibcontent = Urllib.urlopen ("http://blog.csdn.net/eastmount/article/details/39599061"). Read () Open (' Test.html ', ' w+ '). Write (content)

However, CSDN will prohibit this behavior, the server prohibits crawling site content to other people's online. Our blog posts are often crawled by other sites, but do not state the original source, but also respect the original. It displays the error "403 Forbidden".
PS: It is said that simulating normal internet can achieve crawling csdn content, the reader can do their own research, the author does not introduce here. References (verified):
http://www.yihaomen.com/article/python/210.htm
http://www.2cto.com/kf/201405/304829.html
The second step to get all your own articles
Here is just a discussion of thought, if our first article has been successful. The following use of Python's find () to continue looking for the next article link from the previous successful location allows you to get all the articles on the first page. It shows 20 articles on a page, and the last page shows the remaining articles.
So how do you get articles on other pages?


we can find the hyperlinks shown when jumping to a different page are:

1th page HTTP://BLOG.CSDN.NET/EASTMOUNT/ARTICLE/LIST/1 2nd page HTTP://BLOG.CSDN.NET/EASTMOUNT/ARTICLE/LIST/2 3rd page/http BLOG.CSDN.NET/EASTMOUNT/ARTICLE/LIST/3 4th page Http://blog.csdn.net/Eastmount/article/list/4

This idea is easy, the process is simple such as the following:
for (int i=0;i<4;i++)//Get all page articles
for (int j=0;j<20;j++)//Get one page Article note last page article number
GetContent (); Get an article primarily to get a hyperlink

at the same time to learn through the normal form, in the process of obtaining the content of the Web page is particularly convenient. As I used to get pictures using C # and the normal table:http://blog.csdn.net/eastmount/article/details/ 12235521

Two. Crawling Sina Blog

The above is a simple idea of the crawler, but because some site server is forbidden to get site content, but Sina some blog can be achieved. Here's the "51CTO College Zhipu Education python video " to get all of Sina Han's blogs.
Address:http://blog.sina.com.cn/s/articlelist_1191258123_0_1.html
We can get every <div class= "Articlecell sg_j_linedot1" in the same way as above. The </div> includes a hyperlink to an article, for example, as seen in:

At this point, the code for an article is obtained through Python such as the following:

Import urllibcontent = Urllib.urlopen ("http://blog.sina.com.cn/s/blog_4701280b0102eo83.html"). Read () Open (' Blog.html ', ' w+ '). Write (content)

        
        <a title= "on the seven elements of a movie Some of my views on the film and some news of "target=" _blank "href="
http://blog.sina.com.cn/s/blog_4701280b0102eo83.html "> on the seven elements of the film--About my power </a>
         Use Python to manually get the hyperlink http before you tell the normal table, find the first "<a title" from the beginning of the article, and then find "href=" and ". html" to get "http://blog.sina.com.cn/s/blog_4701280b0102eo83.html ". Code such as the following:

#<a title= ":" target= "_blank" href= "http://blog.sina...html"; </a> #coding: Utf-8con = Urllib.urlopen ("http://blog.sina.com.cn/s/articlelist_1191258123_0_1.html"). Read () title = Con.find (R ' <a title= ') href = Con.find (R ' href= ', title)  #从title位置開始搜索html = Con.find (R '. html ', href)   #从href位置開始搜素近期htmlurl = con[href+6:html+5]         #href = "Total 6 bits. HTML total 5-bit print ' URL: ', url# output url:http://blog.sina.com.cn/s /blog_4701280b0102eohi.html

following the ideas described in the previous two-tier loop can be achieved to obtain all the articles, detailed code such as the following:

 #coding: utf-8import urllibimport timepage=1while page<=7:url=[']*50 #新浪播客每页显示    50 articles temp= ' http://blog.sina.com.cn/s/articlelist_1191258123_0_ ' +str (page) + '. html ' con =urllib.urlopen (temp). Read () #初始化 i=0 title=con.find (R ' <a title= ') href=con.find (R ' href= ', title) HTML = Con.find (R '. html ', href) #循环 Show article while Title!=-1 and Href!=-1 and Html!=-1 and i<50:url[i]=con[href+6:html+5] Print Url[i] #显示文章U RL #以下的从第一篇结束位置開始查找 title=con.find (R ' <a title= ', HTML) href=con.find (R ' href= ', title) HTML = Con.find (R '. html ', href) i=i+1 else:print ' End page= ', page #下载获取文章 j=0 while (j<i): #前面 6 page 50 The last page is I content=urllib.urlopen (Url[j]). Read () Open (R ' hanhan/' +url[j][-26:], ' w+ '). Write (content) #写方式打 Open + means create j=j+1 time.sleep (1) else:print ' Download ' page=page+1else:print ' all find End ' /pre>

so we put Han's 316 Sina blog post all crawl success and can display each article, show sample such as the following:

        This article is mainly about how to use Python to crawl network data , I will also learn some intelligent data mining knowledge and Python application, to achieve more efficient crawling and acquisition of customer intentions and interests of knowledge. Want to achieve intelligent crawling of pictures and novels two software.
        This article only provides ideas, I hope you respect the original results of others, Do not arbitrarily crawl other people's articles and does not contain the original author information reproduced! Finally hope that the article to help you, beginner python, if there are errors or shortcomings, please Haihan!
&NBSP;&NBSP;&NBSP; (By:eastmount 2014-10-4 noon 11 o'clock   original csdn  http:// blog.csdn.net/eastmount/ )
         reference:
        1. 51cto College Zhipu Education python video
http://edu.51cto.com/course/course_id-581.html
2. Web Data Mining (Liu Bing)

[Python learning] simple web crawler Crawl blog post and ideas introduction

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.