Python write crawler-Crawl School News

Source: Internet
Author: User

python write web crawler (i)


about Python:

I learned C. Learned about C + +. Finally, you learn Java to eat.

has been in the small world of Java to mingle.

there's a saying: "Life's short, you need python!" Life is short, I use Python.

How powerful and concise is it?

Hold this curiosity, while not busy for a few days. Still can't help the elementary school a bit. (--actually learned less than two days)


Use an example of "HelloWorld"

Javaclass main{public static void Main (string[] args) {String str = "helloworld!"; System.out.println (str);}}

#Pythonstr = ' HelloWorld ' Print str


At first glance, Python is really cool! Save time with a brief break

As for efficiency. I believe that no matter in any language, as a developer important yes or the streamlining of thinking!

Others say, "Python is cool, and the crematorium is reconstructed."

But Python has so many scenarios. According to their own needs to choose, I believe so many senior right.

Python is indeed a course worth learning.


about web crawler:I don't know anything about this. Grab an explanation, a touch of understandingWeb crawler (also known as Web spider, Network robot, in the middle of the foaf community, more often called the Web Chaser), is a certain rules, automatically crawl the World Wide Web information program or script. Other infrequently used names are ants, auto-indexing, simulation programs, or worms. the real complex and powerful crawler has a lot of crawling algorithms and strategies.The example I have written is simply Jane.


The Python Foundation is not finished yet, and I can't wait to make a look at it.

So I thought of writing a web crawler. Let's practice for a little bit.

My alma mater is Gansu Agricultural. I often watch the school news, so I try to climb a school news, when it's okay to take it out to see


Because the study is very little ... The functionality of this crawler is still very simple. is to download the latest news from the school's official website and save it as a Web page. You can see as many pages as you want.

This is where I got the latest top 4 pages of news.


The first time to write a crawler. A very simple function, I divided it 3 steps:

The first step: first crawl a news, download it to save!

Part II: Get all the news from this page down, download and save!

The third part: The X page of all the news crawl down, download save!


Web crawler is very important to analyze the page elements.


First Step: Crawl the contents of a URL first

#crawler: Gansu Agricultural University News Network section of the school news. #爬这个页面的第一篇新闻 #http://news.gsau.edu.cn/tzgg1/xxxw33.htm#coding:utf-8import Urllibstr = ' <a class= ' c43092 ' href= '. /info/1037/30577.htm "target=" _blank "title=" double action Shui Quan Xiang the village Working group to contact the village to carry out precision poverty-related work "> Dual action Shui Quan Xiang the village Working group to contact the village to carry out precision poverty-related work </a > ' Hrefbeg = Str.find (' href= ') hrefend = Str.find ('. htm ', hrefbeg) href = str[hrefbeg+6:hrefend+4]print Hrefhref = href[ 3:]print Hreftitlebeg = Str.find (R ' title= ') Titleend = Str.find (R ' > ', Titlebeg) title = Str[titlebeg+7:titleend-1] Print Titleurl = ' http://news.gsau.edu.cn/' + href print ' url: ' +urlcontent = Urllib.urlopen (URL). Read () #print Contentfil ename = title + '. html ' #将抓取的页面content写入filename保存本地目录中open (filename, ' W '). Write (content)

There is no need to parse too many page elements in this. Just cut the string .



Step Two: Crawl all the news on this page, 23 articles per page

This time will be a small analysis, the 23 URLs, how to find each URL?




Here you can lock an element to find it. And then pay attention to each find when the law, in fact, is the order of the search start

Here I save each URL in an array, and after the search is complete, the URLs in the arrays are downloaded.


#crawler甘肃农业大学新闻网板块的 School News. #http://news.gsau.edu.cn/tzgg1/xxxw33.htm#<a class= "c43092" href= ". /info/1037/30567.htm "target=" _blank "title=" double action Shui Quan Xiang the village Working group to contact the village to carry out precision poverty-related work "> Dual action Shui Quan Xiang the village Working group to contact the village to carry out precision poverty-related work </a > #coding: Utf-8import urllibimport timeimport stat, ospagesize = 23articleList = Urllib.urlopen ('/HTTP/ News.gsau.edu.cn/tzgg1/xxxw33.htm '). Read () urllist = [']*pagesize# lock class= ' c43092 ' hrefclass = Articlelist.find (' class= "c43092" ') Hrefbeg = Articlelist.find (R ' href= ', hrefclass) hrefend = Articlelist.find (R '. htm ', Hrefbeg) href= Articlelist[hrefbeg+6:hrefend+4][3:]print href#url = ' http://news.gsau.edu.cn/' + href #print ' url: ' +urli = 0while href! =-1 and I<pagesize:urllist[i] = ' http://news.gsau.edu.cn/' + href Hrefclass = Articlelist.find (' class= ' c4309 2 "', hrefend) Hrefbeg = Articlelist.find (R ' href= ', hrefclass) hrefend = Articlelist.find (R '. htm ', Hrefbeg) href=a Rticlelist[hrefbeg+6:hrefend+4][3:] Print urllist[i] i = i+1else:print R ' This page all URLs have beenFinish!!! ' #将本页每一篇新闻下载到本地 (news title file name store) #title: <HTML><HEAD><TITLE> Jiuquan Mayor Du Wei to school cooperation matters-News Network </title>j = 0while j<pagesize:content = Urllib.urlopen (Urllist[j]). Read () Titlebeg = Content.find (R ' <TITLE> ') TITLE End = Content.find (R ' </TITLE> ', Titlebeg) title = content[titlebeg+7:titleend] Print title print Urllist[j ]+r ' Downloading ... ' time.sleep (1) Open (R ' gsaunews ' + os.path.sep + title.decode (' Utf-8 '). Encode (' GBK ') + '. html ', ' w+ '). Write (content) J = j + 1else:print R ' when page full URL download complete! '


Step Three: Crawl all news of n pages

To crawl n pages here, the first thing to do is to analyze what you are going to crawl is up-to-date, not a fixed number of pages.

So to analyze the next page of data, just the bottom of the homepage also gives the paging data, directly with it!


Look at the URL of the last few pages:

#http://news.gsau.edu.cn/tzgg1/xxxw33.htm        first page #http://news.gsau.edu.cn/tzgg1/xxxw33/221.htm    second page #http:// news.gsau.edu.cn/tzgg1/xxxw33/220.htm    Third page #http://news.gsau.edu.cn/tzgg1/xxxw33/219.htm    fourth page
on the score page data, it is easy to find the law, that is: fenyecount-pageno+1

One of the annoying things here is that we don't know why, except for the first page, there will be a part of the page that is not on this page but on the previous one. Cause I've been looking for half a day

Made a lot of judgments before retrieving it.

#crawler甘肃农业大学新闻网板块的 School News. #coding: Utf-8import urllibimport timeimport stat, ospagecount = 4pageSize = 23pageNo = 1urlList = [']*pagesize*pagecount# analysis page elements #<td width= "1%" align= "left" id= "fanye43092" nowrap= "" > Total 5,084 items   1/222 </td>indexcontent = Urllib.urlopen (' http://news.gsau.edu.cn/tzgg1/xxxw33.htm '). Read () Fenyeid = Indexcontent.find (' id= "fanye43092" ') #这里锁定分页的id进行查找fenyeBeg = Indexcontent.find (' 1/', Fenyeid) Fenyeend = Indexcontent.find ('   ', Fenyebeg) fenyecount = Int (Indexcontent[fenyebeg+2:fenyeend]) i = 0while PageNo <= Pagecount:if pageno==1:articleurl = ' http://news.gsau.edu.cn/tzgg1/xxxw33.htm ' Else:articleurl = ' h ttp://news.gsau.edu.cn/tzgg1/xxxw33/' + str (fenyecount-pageno+1) + '. htm ' Print R '--------Total Crawl ' + str (pagecount) + ' page Current section ' + str (pageno) + ' page URL: ' + articleurl articlelist = Urllib.urlopen (Articleurl). Read () while I<pagesize*pag Eno:if PageNo = = 1: #i = 0,23,46 ... when looking from the beginning, the restOpen from the last URL end position if i = = pagesize* (pageNo-1): Hrefid = Articlelist.find (' id= ' Line43092_0 "') Else:hrefid = Articlelist.find (' class= "c43092" ', hrefend) else:if i = = pagesize* (pageNo-1): Hrefid = Articlelist.find (' id= "lineimg43092_16" ') Else:hrefid = Arti  Clelist.find (' class= ' c43092 "', hrefend) Hrefbeg = Articlelist.find (R ' href= ', Hrefid) hrefend        = Articlelist.find (R '. htm ', Hrefbeg) if PageNo = = 1:href=articlelist[hrefbeg+6:hrefend+4][3:]            Else:href=articlelist[hrefbeg+6:hrefend+4][6:] urllist[i] = ' http://news.gsau.edu.cn/' + href        Print Urllist[i] i = i+1 else:print R ' ======== first ' +str (pageno) + ' page URL extraction complete!!! ' PageNo = PageNo + 1print r ' ============ all URL extraction complete!!! ============ ' + ' \ n ' *3print R ' ========== started downloading to local =========== ' j = 0while J < PageCount * Pagesize:content = UrllIb.urlopen (Urllist[j]). Read () Titlebeg = Content.find (R ' <TITLE> ') Titleend = Content.find (R ' </TITLE> ', t Itlebeg) title = content[titlebeg+7:titleend] Print title print urllist[j]+r ' downloading ... ' + ' \ n ' time.sleep (1) O Pen (R ' gsaunews ' + os.path.sep + title.decode (' Utf-8 '). Encode (' GBK ') + '. html ', ' w+ '). Write (content) j = j + 1else:pri NT R ' Download complete, total download ' +str (pagecount) + ' page, ' +str (pagecount*pagesize) + ' news '

This is the end of the climb ....


Look at the effect of the climb finish

==================== restart:d:\python\csdncrawler03.py ====================--------Crawl 4 page now 1th page url:http:// news.gsau.edu.cn/tzgg1/xxxw33.htmhttp://news.gsau.edu.cn/info/1037/30596.htmhttp://news.gsau.edu.cn/info/1037/ 30595.htmhttp://news.gsau.edu.cn/info/1037/30593.htmhttp://news.gsau.edu.cn/info/1037/30591.htmhttp:// news.gsau.edu.cn/info/1037/30584.htmhttp://news.gsau.edu.cn/info/1037/30583.htmhttp://news.gsau.edu.cn/info/ 1037/30580.htmhttp://news.gsau.edu.cn/info/1037/30577.htmhttp://news.gsau.edu.cn/info/1037/30574.htmhttp:// news.gsau.edu.cn/info/1037/30573.htmhttp://news.gsau.edu.cn/info/1037/30571.htmhttp://news.gsau.edu.cn/info/ 1037/30569.htmhttp://news.gsau.edu.cn/info/1037/30567.htmhttp://news.gsau.edu.cn/info/1037/30566.htmhttp:// news.gsau.edu.cn/info/1037/30565.htmhttp://news.gsau.edu.cn/info/1037/30559.htmhttp://news.gsau.edu.cn/info/ 1037/30558.htmhttp://news.gsau.edu.cn/info/1037/30557.htmhttp://news.gsau.edu.cn/info/1037/30555.htmhttp:// news.gsau.edu.cn/info/1037/30554.htmhttp://news.gsau.edu.cn/info/1037/30546.htmhttp://news.gsau.edu.cn/info/1037/30542.htmhttp:// news.gsau.edu.cn/info/1037/30540.htm======== 1th page URL extraction complete!!! --------Total Crawl 4 page 2nd page url:http://news.gsau.edu.cn/tzgg1/xxxw33/221.htmhttp://news.gsau.edu.cn/info/1037/30536. htmhttp://news.gsau.edu.cn/info/1037/30534.htmhttp://news.gsau.edu.cn/info/1037/30528.htmhttp:// news.gsau.edu.cn/info/1037/30525.htmhttp://news.gsau.edu.cn/info/1037/30527.htmhttp://news.gsau.edu.cn/info/ 1037/30524.htmhttp://news.gsau.edu.cn/info/1037/30520.htmhttp://news.gsau.edu.cn/info/1037/30519.htmhttp:// news.gsau.edu.cn/info/1037/30515.htmhttp://news.gsau.edu.cn/info/1037/30508.htmhttp://news.gsau.edu.cn/info/ 1037/30507.htmhttp://news.gsau.edu.cn/info/1037/30506.htmhttp://news.gsau.edu.cn/info/1037/30505.htmhttp:// news.gsau.edu.cn/info/1037/30501.htmhttp://news.gsau.edu.cn/info/1037/30498.htmhttp://news.gsau.edu.cn/info/ 1037/30495.htmhttp://news.gsau.edu.cn/info/1037/30493.htmhttp://news.gsau.edu.cn/info/1037/30482.htmhttp://news.gsau.edu.cn/info/1037/30480.htmhttp://news.gsau.edu.cn/info/1037/30472.htmhttp://news.gsau.edu.cn/ info/1037/30471.htmhttp://news.gsau.edu.cn/info/1037/30470.htmhttp://news.gsau.edu.cn/info/1037/30469.htm===== = = = 2nd Page URL Fetch complete!!! --------Total Crawl 4 page 3rd page url:http://news.gsau.edu.cn/tzgg1/xxxw33/220.htmhttp://news.gsau.edu.cn/info/1037/30468. htmhttp://news.gsau.edu.cn/info/1037/30467.htmhttp://news.gsau.edu.cn/info/1037/30466.htmhttp:// news.gsau.edu.cn/info/1037/30465.htmhttp://news.gsau.edu.cn/info/1037/30461.htmhttp://news.gsau.edu.cn/info/ 1037/30457.htmhttp://news.gsau.edu.cn/info/1037/30452.htmhttp://news.gsau.edu.cn/info/1037/30450.htmhttp:// news.gsau.edu.cn/info/1037/30449.htmhttp://news.gsau.edu.cn/info/1037/30441.htmhttp://news.gsau.edu.cn/info/ 1037/30437.htmhttp://news.gsau.edu.cn/info/1037/30429.htmhttp://news.gsau.edu.cn/info/1037/30422.htmhttp:// news.gsau.edu.cn/info/1037/30408.htmhttp://news.gsau.edu.cn/info/1037/30397.htmhttp://news.gsau.edu.cn/info/ 1037/30396.htmhttp://news.gsau.edu.cn/info/1037/30394.htmhttp://news.gsau.edu.cn/info/1037/30392.htmhttp://news.gsau.edu.cn/info/1037 /30390.htmhttp://news.gsau.edu.cn/info/1037/30386.htmhttp://news.gsau.edu.cn/info/1037/30385.htmhttp:// news.gsau.edu.cn/info/1037/30376.htmhttp://news.gsau.edu.cn/info/1037/30374.htm======== 3rd page URL Extraction complete!!! --------Total Crawl 4 page 4th page url:http://news.gsau.edu.cn/tzgg1/xxxw33/219.htmhttp://news.gsau.edu.cn/info/1037/30370. htmhttp://news.gsau.edu.cn/info/1037/30369.htmhttp://news.gsau.edu.cn/info/1037/30355.htmhttp:// news.gsau.edu.cn/info/1037/30345.htmhttp://news.gsau.edu.cn/info/1037/30343.htmhttp://news.gsau.edu.cn/info/ 1037/30342.htmhttp://news.gsau.edu.cn/info/1037/30340.htmhttp://news.gsau.edu.cn/info/1037/30339.htmhttp:// news.gsau.edu.cn/info/1037/30335.htmhttp://news.gsau.edu.cn/info/1037/30333.htmhttp://news.gsau.edu.cn/info/ 1037/30331.htmhttp://news.gsau.edu.cn/info/1037/30324.htmhttp://news.gsau.edu.cn/info/1037/30312.htmhttp:// News.gsau.edu.cn/info/1037/30311.htmhttp://news.gsau.edu.cn/info/1037/30302.htmhttp://news.gsau.edu.cn/info/1037/30301.htmhttp://news.gsau.edu.cn/info/1037/ 30298.htmhttp://news.gsau.edu.cn/info/1037/30294.htmhttp://news.gsau.edu.cn/info/1037/30293.htmhttp:// news.gsau.edu.cn/info/1037/30289.htmhttp://news.gsau.edu.cn/info/1037/30287.htmhttp://news.gsau.edu.cn/info/ 1037/30286.htmhttp://news.gsau.edu.cn/info/1037/30279.htm======== 4th page URL Extraction complete!!! ============ all URL extraction complete!!! ====================== began to download to local =========== Gansu refreshing source Ecological Technology Co., Ltd. to discuss school-enterprise cooperation-News network http://news.gsau.edu.cn/info/1037/30596. HTM is downloading ... The second session of our school counselor professional ability Competition ended successfully-news network http://news.gsau.edu.cn/info/1037/30595.htm is downloading ... Bi-joint Action Week Village Working group to carry out precision poverty alleviation work-news network http://news.gsau.edu.cn/info/1037/30593.htm is downloading ... Xinjiang and the Corps to hold special campus job fair-News Network http://news.gsau.edu.cn/info/1037/30591.htm is downloading ... "Picture news" Li-Ye Xi garden Drunk Spring Breeze-News Network http:/ News.gsau.edu.cn/info/1037/30584.htm is downloading ... Gannon teachers and students love to help leukemia students Zhang Yi-News Network http://news.gsau.edu.cn/info/1037/30583.htm is downloading ... Vice President Zhaoxing to Xin Zhuang Cun to carry out the work of precision poverty alleviation-news network http://news.gsau.edu.cn/info/1037/30580.htm is downloading ... Double Action Hong Zhuang Village Working group to carry out precision poverty alleviation work-News network http://nEws.gsau.edu.cn/info/1037/30577.htm is downloading ... The headmaster Wu Jianmin to Guanghe County to carry out the precision poverty alleviation and the double joint work-News network http://news.gsau.edu.cn/info/1037/30574.htm is downloading ... Animal Medical College goes to the garden village to carry out the work of precision poverty alleviation-news network http://news.gsau.edu.cn/info/1037/30573.htm is downloading ...



Climbed 90 pages over a minute.

Of course the code can also optimize a lot. But my Python foundation is really weak.

You can also continue to upgrade, use regular, match the content you want, rather than like this to save the whole.

Can continue to improve in the future study. Climb something more interesting.

This example is just to see what the beginner Python can do for me.








Python write crawler-Crawl School News

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.