The previous emphasis on Python's use of web crawlers is very effective, and this article is also a combination of learning Python video knowledge and my postgraduate data mining direction knowledge. This is a simple introduction to how Python crawls the network data, the article knowledge is very simple, but also share to everyone, Just get started! At the same time only share knowledge, I hope you do not destroy the knowledge of the network or infringe other people's original article. Mainly include:
1. Introduction to the simple idea and process of crawling csdn own blog post
2. Python source to crawl Sina Han Cold blog 316 articles
The simple idea of a reptile
recently, Liu Bing's "web data Mining" knows that there are three ways to study information extraction problems:
1. Manual method:by observing the Web page and source code to find patterns, and then write the program to extract the target data. However, this method cannot handle the large number of sites.
2. Wrapper Induction:its English name is called wrapper induction, that is, supervised learning method, is semi-automatic. This method learns a set of extraction rules from a manually annotated Web page or data recordset to extract Web page data in a similar format.
3. Automatic extraction:It is unsupervised method, given one or several pages, automatically from the search for patterns or syntax to achieve data extraction, because no manual labeling, it can handle a large number of sites and web pages of data extraction work.
Python web crawler used here is a simple data extraction program, I will continue to study some python+ data mining knowledge and write this kind of article. First I want to get all my own Csdn blog (static. html files), the specific ideas and implementation methods are as follows:
The first step analyzes the source code of CSDN blog
The first thing to achieve is through the analysis of the blog source to obtain an article csdn, in the use of IE browser by F12 or Google Chrome right-click "Review elements" can analyze the basic information of the blog. http://blog.csdn.net/in Web pages Eastmount Links All of the author's posts.
The source format shown is as follows:
Among them <diw class= "List_item Article_item"; </div> represents each of the blog posts displayed, the first of which appears as follows:
Its specific HTML source code is as follows:
So we just need to get each page in the blog <div class= "article_title" > Link <a href= "/eastmount/article/details/39599061"; Http://blog.csdn.net can be. In the pass code:
Import urllibcontent = Urllib.urlopen ("http://blog.csdn.net/eastmount/article/details/39599061"). Read () Open (' Test.html ', ' w+ '). Write (content)
However, CSDN will prohibit such behavior, the server does not crawl site content to other people's online. Our blog posts are often crawled by other sites, but do not state the original source, and please respect the original. It displays the error "403 Forbidden".
PS: It is said that simulating normal internet can achieve crawling csdn content, the reader can do their own research, the author does not introduce here. Reference (verified):
http://www.yihaomen.com/article/python/210.htm
http://www.2cto.com/kf/201405/304829.html
The second step to get all of your own articles
Here we only discuss ideas, assuming that our first article has been successful. The following use of Python's find () continues to find the next article link from the previous successful location to get all the articles on the first page. It shows 20 articles on a page, and the last page shows the remaining articles.
So how do I get articles on other pages?
we can see that the hyperlinks shown when jumping to different pages are:
1th page HTTP://BLOG.CSDN.NET/EASTMOUNT/ARTICLE/LIST/1 2nd page HTTP://BLOG.CSDN.NET/EASTMOUNT/ARTICLE/LIST/2 3rd page/http BLOG.CSDN.NET/EASTMOUNT/ARTICLE/LIST/3 4th page Http://blog.csdn.net/Eastmount/article/list/4
This thought is very simple, the process is simple as follows:
for (int i=0;i<4;i++)//Get all page articles
for (int j=0;j<20;j++)//Get one page Article note last page article number
GetContent (); Get an article primarily to get a hyperlink
At the same time, learning through regular expressions, in the process of obtaining the content of the Web page is particularly convenient. As I was earlier in the article using C # and regular expressions to get pictures:http://blog.csdn.net/eastmount/article/details/ 12235521
Two. Crawling Sina Blog
The above describes the simple idea of the crawler, but because some Web servers prohibit access to site content, but Sina some blog can be achieved. Refer to "51CTO College Zhipu Education python video " to get all of Sina Han's blogs.
Address:http://blog.sina.com.cn/s/articlelist_1191258123_0_1.html
Using the same way as above we can get each <div class= "Articlecell sg_j_linedot1" ... The </div> contains a hyperlink to an article, as shown in:
The code for getting an article through Python is as follows:
import Urllibcontent = Urllib.urlopen ("http://blog.sina.com.cn/s/blog_4701280b0102eo83.html"). Read () Open (' blog.html ', ' W + '). Write (content)
<a title= "on the seven elements of a movie Some of my views on the film and some news of "target=" _blank "href=" http://blog.sina.com.cn/s/blog_4701280b0102eo83.html "> "On the seven elements of film"--About me to electricity ... </a>
Use Python to manually get the hyperlink http before you tell the regular expression, find the first" <a title "from the beginning of the article, and then find" href= "and". html "to get" http://blog.sina.com.cn/s/blog_4701280b0102eo83.html " The. Code is as follows:
#<a title= ":" target= "_blank" href= "http://blog.sina...html"; </a> #coding: Utf-8con = Urllib.urlopen ("http://blog.sina.com.cn/s/articlelist_1191258123_0_1.html"). Read () title = Con.find (R ' <a title= ') href = Con.find (R ' href= ', title) #从title位置开始搜索html = Con.find (R '. html ', href) #从href位置开始搜素最近htmlurl = con[href+6:html+5] #href = "Total 6 bits. HTML total 5-bit print ' URL: ', url# output url:http://blog.sina.com.cn/s /blog_4701280b0102eohi.html
following the thought described above through a two-tier loop to achieve all the article, the code is as follows:
#coding: Utf-8import urllibimport timepage=1while page<=7:url=[']*50 #新浪播客每页显示50篇 temp= ' http://blog.sina.co m.cn/s/articlelist_1191258123_0_ ' +str (page) + '. html ' con =urllib.urlopen (temp). Read () #初始化 i=0 Title=con.find (r ' <a title= ') href=con.find (R ' href= ', title) HTML = Con.find (R '. html ', href) #循环显示文章 while Title!=-1 and href! =-1 and Html!=-1 and i<50:url[i]=con[href+6:html+5] Print Url[i] #显示文章URL #下面的从第一篇结束位置开始查找 Title=con.find (R ' <a title= ', HTML) href=con.find (R ' href= ', title) HTML = Con.find (R '. html ', href) i= I+1 else:print ' End page= ', page #下载获取文章 j=0 while (j<i): #前面6页为50篇 The last page is I Content=url Lib.urlopen (Url[j]). Read () Open (R ' hanhan/' +url[j][-26:], ' w+ '). Write (content) #写方式打开 + means no j=j+1 ti is created Me.sleep (1) else:print ' Download ' page=page+1else:print ' all find End '
in this way, we will take Han's 316 Sina blog posts all crawled successfully and can display each article, shown as follows:
This article is mainly about how to use Python to crawl network data. Later I will also learn some intelligent data mining knowledge and the use of Python, to achieve more efficient crawling and acquisition of customer intentions and interests of knowledge. Want to achieve intelligent crawling of pictures and novels two software.
This article only provides ideas, I hope you respect the original results of others, Do not arbitrarily crawl other people's articles and not contain the original author information reproduced! Finally hope that the article to help you, beginner python, if there are errors or shortcomings, please Haihan!
&NBSP;&NBSP;&NBSP; (By:eastmount 2014-9-28 noon 11 o'clock original csdn http:// blog.csdn.net/eastmount/ )
reference:
1. 51cto College Zhipu Education python video http://edu.51cto.com/course/course_id-581.html
2. Web Data Mining (Liu Bing)
[Python learning] simple web crawler Crawl blog post and ideas introduction