Python write crawler-Crawl School News

Source: Internet
Author: User

python write web crawler (i)


about Python:

I learned C. Learned about C + +. Finally, you learn Java to eat.

has been in the small world of Java to mingle.

there's a saying: "Life's short, you need python!" Life is short, I use Python.

How powerful and concise is it?

Hold this curiosity, while not busy for a few days. Still can't help the elementary school a bit. (--in fact, less than two days of learning)


Use a sample of "HelloWorld"

Javaclass main{public static void Main (string[] args) {String str = "helloworld!"; System.out.println (str);}}

#Pythonstr = ' HelloWorld ' Print str


At first glance, Python is really cool! Save time with a brief break

As for efficiency. I believe that no matter what the language, as a developer is important or the simplification of thinking!

Others say, "Python is cool, and the crematorium is reconstructed."

But Python has so many applications. According to their own needs to choose, I believe so many senior right.

Python is indeed a course worth learning.


about web crawler:I don't know anything about this. Grab an explanation, a touch of understandingWeb crawler (also known as Web spider, Network robot, in the middle of the foaf community, more often called the Web Chaser). is a rule according to certain rules. The program or script that crawls the World Wide Web information on its own initiative. Other names that are not commonly used include ants, self-active indexing, simulation programs, or worms. Real complex and powerful crawlers have a very large number of crawling algorithms and strategies.The example I wrote was simply Jane.


The Python Foundation is not finished yet, and I can't wait to make a look at it.

So I thought of writing a web crawler. Let's practice for a little bit.

My alma mater is Gansu Agricultural. I will often watch the school news, so I try to climb a school news, when it's okay to take it out to see


Because of very little learning ... This crawler is still very easy to function. is to download the latest news from the school's official website and save it as a Web page. Want to see how many pages are able to.

This is where I got the latest top 4 pages of news.


The first time to write a crawler. A very easy feature, I took it 3 steps:

The first step: first crawl a news, download it to save!

The second part: All the news of this page crawl down, download save!

Part III: Climb down all the news on page x, download and save!


Web crawler is very important to analyze the elements of the Web page.


First Step: Crawl the contents of a URL first

#crawler: Gansu Agricultural University News Network section of the school news. #爬这个页面的第一篇新闻 #http://news.gsau.edu.cn/tzgg1/xxxw33.htm#coding:utf-8import Urllibstr = ' <a class= ' c43092 ' href= '. /info/1037/30577.htm "target=" _blank "title=" double action Shui Quan Xiang the village Working group to contact the village to carry out precision poverty-related work "> Dual action Shui Quan Xiang the village Working group to contact the village to carry out precision poverty-related work </a > ' Hrefbeg = Str.find (' href= ') hrefend = Str.find ('. htm ', hrefbeg) href = str[hrefbeg+6:hrefend+4]print Hrefhref = href[ 3:]print Hreftitlebeg = Str.find (R ' title= ') Titleend = Str.find (R ' > ', Titlebeg) title = Str[titlebeg+7:titleend-1] Print Titleurl = ' http://news.gsau.edu.cn/' + href print ' url: ' +urlcontent = Urllib.urlopen (URL). Read () #print Contentfil ename = title + '. html ' #将抓取的页面content写入filename保存本地文件夹中open (filename, ' W '). Write (content)

There is no need to parse too many page elements in this. Just cut the string .



Step Two: Crawl all the news on this page, 23 articles per page

This time will be a small analysis, the 23 URLs, each URL how to find?




It is possible to lock an element first to find it. And note that every time you find the law, in fact, the order of the search start

Here I keep each URL in an array, after the search is complete, the URL in the array to download.


#crawler甘肃农业大学新闻网板块的 School News. #http://news.gsau.edu.cn/tzgg1/xxxw33.htm#<a class= "c43092" href= ". /info/1037/30567.htm "target=" _blank "title=" double action Shui Quan Xiang the village Working group to contact the village to carry out precision poverty-related work "> Dual action Shui Quan Xiang the village Working group to contact the village to carry out precision poverty-related work </a > #coding: Utf-8import urllibimport timeimport stat, ospagesize = 23articleList = Urllib.urlopen ('/HTTP/ News.gsau.edu.cn/tzgg1/xxxw33.htm '). Read () urllist = [']*pagesize# lock class= ' c43092 ' hrefclass = Articlelist.find (' class= "c43092" ') Hrefbeg = Articlelist.find (R ' href= ', hrefclass) hrefend = Articlelist.find (R '. htm ', Hrefbeg) href= Articlelist[hrefbeg+6:hrefend+4][3:]print href#url = ' http://news.gsau.edu.cn/' + href #print ' url: ' +urli = 0while href! =-1 and I<pagesize:urllist[i] = ' http://news.gsau.edu.cn/' + href Hrefclass = Articlelist.find (' class= ' c4309 2 "', hrefend) Hrefbeg = Articlelist.find (R ' href= ', hrefclass) hrefend = Articlelist.find (R '. htm ', Hrefbeg) href=a Rticlelist[hrefbeg+6:hrefend+4][3:] Print urllist[i] i = i+1else:print R ' This page all URLs have beenFinish!!! ' #将本页每一篇新闻下载到本地 (news title file name store) #title: <HTML><HEAD><TITLE> Jiuquan Mayor Du Wei to school cooperation matters-News Network </title>j = 0while j<pagesize:content = Urllib.urlopen (Urllist[j]). Read () Titlebeg = Content.find (R ' <TITLE> ') TITLE End = Content.find (R ' </TITLE> ', Titlebeg) title = content[titlebeg+7:titleend] Print title print Urllist[j ]+r ' Downloading ... ' time.sleep (1) Open (R ' gsaunews ' + os.path.sep + title.decode (' Utf-8 '). Encode (' GBK ') + '. html ', ' w+ '). Write (content) J = j + 1else:print R ' When page all URLs complete download! '


Step Three: Crawl all the news from N pages

To crawl n pages here, the first thing to do is to analyze what you are going to crawl is up-to-date, not a fixed number of pages.

So to analyze the next page of data, just the bottom of the page is also given the paging data, directly use it!


Look at the URL for the recent few pages:

#http://news.gsau.edu.cn/tzgg1/xxxw33.htm        first page #http://news.gsau.edu.cn/tzgg1/xxxw33/221.htm    second page #http:// news.gsau.edu.cn/tzgg1/xxxw33/220.htm    Third page #http://news.gsau.edu.cn/tzgg1/xxxw33/219.htm    fourth page
compared to paging data, very easy to find the law, that is: fenyecount-pageno+1

One of the annoying things here is that we don't know why, except for the first page, there will be a part of the page that is not on this page but on the previous one. Cause I've been looking for half a day

Made a lot of inferences before retrieving it.

#crawler甘肃农业大学新闻网板块的 School News. #coding: Utf-8import urllibimport timeimport stat, ospagecount = 4pageSize = 23pageNo = 1urlList = [']*pagesize*pagecount# analysis page elements #<td width= "1%" align= "left" id= "fanye43092" nowrap= "" > Total 5,084 items   1/222 </td>indexcontent = Urllib.urlopen (' http://news.gsau.edu.cn/tzgg1/xxxw33.htm '). Read () Fenyeid = Indexcontent.find (' id= "fanye43092" ') #这里锁定分页的id进行查找fenyeBeg = Indexcontent.find (' 1/', Fenyeid) Fenyeend = Indexcontent.find ('   ', Fenyebeg) fenyecount = Int (Indexcontent[fenyebeg+2:fenyeend]) i = 0while PageNo <= Pagecount:if pageno==1:articleurl = ' http://news.gsau.edu.cn/tzgg1/xxxw33.htm ' Else:articleurl = ' h ttp://news.gsau.edu.cn/tzgg1/xxxw33/' + str (fenyecount-pageno+1) + '. htm ' Print R '--------Total Crawl ' + str (pagecount) + ' page Current section ' + str (pageno) + ' page URL: ' + articleurl articlelist = Urllib.urlopen (Articleurl). Read () while I<pagesize*pag Eno:if PageNo = = 1: #i = 0,23,46 ... when looking from the beginning, the restOpen from the last URL end position if i = = pagesize* (pageNo-1): Hrefid = Articlelist.find (' id= ' Line43092_0 "') Else:hrefid = Articlelist.find (' class= "c43092" ', hrefend) else:if i = = pagesize* (pageNo-1): Hrefid = Articlelist.find (' id= "lineimg43092_16" ') Else:hrefid = Arti  Clelist.find (' class= ' c43092 "', hrefend) Hrefbeg = Articlelist.find (R ' href= ', Hrefid) hrefend        = Articlelist.find (R '. htm ', Hrefbeg) if PageNo = = 1:href=articlelist[hrefbeg+6:hrefend+4][3:]            Else:href=articlelist[hrefbeg+6:hrefend+4][6:] urllist[i] = ' http://news.gsau.edu.cn/' + href        Print Urllist[i] i = i+1 else:print R ' ======== first ' +str (pageno) + ' page URL extracted!!! ' PageNo = PageNo + 1print r ' ============ all URLs extracted!!! ============ ' + ' \ n ' *3print R ' ========== started downloading to local =========== ' j = 0while J < PageCount * Pagesize:content = UrllIb.urlopen (Urllist[j]). Read () Titlebeg = Content.find (R ' <TITLE> ') Titleend = Content.find (R ' </TITLE> ', t Itlebeg) title = content[titlebeg+7:titleend] Print title print urllist[j]+r ' downloading ... ' + ' \ n ' time.sleep (1) O Pen (R ' gsaunews ' + os.path.sep + title.decode (' Utf-8 '). Encode (' GBK ') + '. html ', ' w+ '). Write (content) j = j + 1else:pri NT r ' download completed, total download ' +str (pagecount) + ' page, ' +str (pagecount*pagesize) + ' news '

This is the end of the climb ....


Look at the effect of the climb finish

==================== restart:d:\python\csdncrawler03.py ====================--------Crawl 4 page now 1th page url:http:// news.gsau.edu.cn/tzgg1/xxxw33.htmhttp://news.gsau.edu.cn/info/1037/30596.htmhttp://news.gsau.edu.cn/info/1037/ 30595.htmhttp://news.gsau.edu.cn/info/1037/30593.htmhttp://news.gsau.edu.cn/info/1037/30591.htmhttp:// news.gsau.edu.cn/info/1037/30584.htmhttp://news.gsau.edu.cn/info/1037/30583.htmhttp://news.gsau.edu.cn/info/ 1037/30580.htmhttp://news.gsau.edu.cn/info/1037/30577.htmhttp://news.gsau.edu.cn/info/1037/30574.htmhttp:// news.gsau.edu.cn/info/1037/30573.htmhttp://news.gsau.edu.cn/info/1037/30571.htmhttp://news.gsau.edu.cn/info/ 1037/30569.htmhttp://news.gsau.edu.cn/info/1037/30567.htmhttp://news.gsau.edu.cn/info/1037/30566.htmhttp:// news.gsau.edu.cn/info/1037/30565.htmhttp://news.gsau.edu.cn/info/1037/30559.htmhttp://news.gsau.edu.cn/info/ 1037/30558.htmhttp://news.gsau.edu.cn/info/1037/30557.htmhttp://news.gsau.edu.cn/info/1037/30555.htmhttp:// news.gsau.edu.cn/info/1037/30554.htmhttp://news.gsau.edu.cn/info/1037/30546.htmhttp://news.gsau.edu.cn/info/1037/30542.htmhttp:// news.gsau.edu.cn/info/1037/30540.htm======== 1th page URL extracted!!! --------Total Crawl 4 page 2nd page url:http://news.gsau.edu.cn/tzgg1/xxxw33/221.htmhttp://news.gsau.edu.cn/info/1037/30536. htmhttp://news.gsau.edu.cn/info/1037/30534.htmhttp://news.gsau.edu.cn/info/1037/30528.htmhttp:// news.gsau.edu.cn/info/1037/30525.htmhttp://news.gsau.edu.cn/info/1037/30527.htmhttp://news.gsau.edu.cn/info/ 1037/30524.htmhttp://news.gsau.edu.cn/info/1037/30520.htmhttp://news.gsau.edu.cn/info/1037/30519.htmhttp:// news.gsau.edu.cn/info/1037/30515.htmhttp://news.gsau.edu.cn/info/1037/30508.htmhttp://news.gsau.edu.cn/info/ 1037/30507.htmhttp://news.gsau.edu.cn/info/1037/30506.htmhttp://news.gsau.edu.cn/info/1037/30505.htmhttp:// news.gsau.edu.cn/info/1037/30501.htmhttp://news.gsau.edu.cn/info/1037/30498.htmhttp://news.gsau.edu.cn/info/ 1037/30495.htmhttp://news.gsau.edu.cn/info/1037/30493.htmhttp://news.gsau.edu.cn/info/1037/30482.htmhttp://news.gsau.edu.cn/info/1037/30480.htmhttp://news.gsau.edu.cn/info/1037/30472.htmhttp://news.gsau.edu.cn/ info/1037/30471.htmhttp://news.gsau.edu.cn/info/1037/30470.htmhttp://news.gsau.edu.cn/info/1037/30469.htm===== = = = 2nd Page URL extracted!!! --------Total Crawl 4 page 3rd page url:http://news.gsau.edu.cn/tzgg1/xxxw33/220.htmhttp://news.gsau.edu.cn/info/1037/30468. htmhttp://news.gsau.edu.cn/info/1037/30467.htmhttp://news.gsau.edu.cn/info/1037/30466.htmhttp:// news.gsau.edu.cn/info/1037/30465.htmhttp://news.gsau.edu.cn/info/1037/30461.htmhttp://news.gsau.edu.cn/info/ 1037/30457.htmhttp://news.gsau.edu.cn/info/1037/30452.htmhttp://news.gsau.edu.cn/info/1037/30450.htmhttp:// news.gsau.edu.cn/info/1037/30449.htmhttp://news.gsau.edu.cn/info/1037/30441.htmhttp://news.gsau.edu.cn/info/ 1037/30437.htmhttp://news.gsau.edu.cn/info/1037/30429.htmhttp://news.gsau.edu.cn/info/1037/30422.htmhttp:// news.gsau.edu.cn/info/1037/30408.htmhttp://news.gsau.edu.cn/info/1037/30397.htmhttp://news.gsau.edu.cn/info/ 1037/30396.htmhttp://news.gsau.edu.cn/info/1037/30394.htmhttp://news.gsau.edu.cn/info/1037/30392.htmhttp://news.gsau.edu.cn/info/1037 /30390.htmhttp://news.gsau.edu.cn/info/1037/30386.htmhttp://news.gsau.edu.cn/info/1037/30385.htmhttp:// news.gsau.edu.cn/info/1037/30376.htmhttp://news.gsau.edu.cn/info/1037/30374.htm======== 3rd page URL extracted!!! --------Total Crawl 4 page 4th page url:http://news.gsau.edu.cn/tzgg1/xxxw33/219.htmhttp://news.gsau.edu.cn/info/1037/30370. htmhttp://news.gsau.edu.cn/info/1037/30369.htmhttp://news.gsau.edu.cn/info/1037/30355.htmhttp:// news.gsau.edu.cn/info/1037/30345.htmhttp://news.gsau.edu.cn/info/1037/30343.htmhttp://news.gsau.edu.cn/info/ 1037/30342.htmhttp://news.gsau.edu.cn/info/1037/30340.htmhttp://news.gsau.edu.cn/info/1037/30339.htmhttp:// news.gsau.edu.cn/info/1037/30335.htmhttp://news.gsau.edu.cn/info/1037/30333.htmhttp://news.gsau.edu.cn/info/ 1037/30331.htmhttp://news.gsau.edu.cn/info/1037/30324.htmhttp://news.gsau.edu.cn/info/1037/30312.htmhttp:// News.gsau.edu.cn/info/1037/30311.htmhttp://news.gsau.edu.cn/info/1037/30302.htmhttp://news.gsau.edu.cn/info/1037/30301.htmhttp://news.gsau.edu.cn/info/1037/ 30298.htmhttp://news.gsau.edu.cn/info/1037/30294.htmhttp://news.gsau.edu.cn/info/1037/30293.htmhttp:// news.gsau.edu.cn/info/1037/30289.htmhttp://news.gsau.edu.cn/info/1037/30287.htmhttp://news.gsau.edu.cn/info/ 1037/30286.htmhttp://news.gsau.edu.cn/info/1037/30279.htm======== 4th page URL extracted!!! ============ all URLs are extracted!!! ====================== started to download to local =========== Gansu refreshing source Ecological Technology Co., Ltd. to discuss school-enterprise cooperation-News network http://news.gsau.edu.cn/info/1037/30596. HTM is downloading ... The second session of our school counselor professional ability Competition ended successfully-news network http://news.gsau.edu.cn/info/1037/30595.htm is downloading ... Bi-joint Action Week Village Working group to carry out precision poverty alleviation work-news network http://news.gsau.edu.cn/info/1037/30593.htm is downloading ... Xinjiang and the Corps to hold special campus job fair-News Network http://news.gsau.edu.cn/info/1037/30591.htm is downloading ... "Picture news" Li-Ye Xi garden Drunk Spring Breeze-News Network http:/ News.gsau.edu.cn/info/1037/30584.htm is downloading ... Gannon teachers and students love to help leukemia students Zhang Yi-News Network http://news.gsau.edu.cn/info/1037/30583.htm is downloading ... Vice President Zhaoxing to Xin Zhuang Cun to carry out the work of precision poverty alleviation-news network http://news.gsau.edu.cn/info/1037/30580.htm is downloading ... Double Action Hong Zhuang Village Working group to carry out precision poverty alleviation work-News network http://nEws.gsau.edu.cn/info/1037/30577.htm is downloading ... The headmaster Wu Jianmin to Guanghe County to carry out the precision poverty alleviation and the double joint work-News network http://news.gsau.edu.cn/info/1037/30574.htm is downloading ... Animal Medical College goes to the garden village to carry out the work of precision poverty alleviation-news network http://news.gsau.edu.cn/info/1037/30573.htm is downloading ...



Climbed 90 pages over a minute.

Of course, the code can also optimize a lot. But my Python foundation is really weak. Or go to a bad patch.

You can continue to upgrade, use regular, match what you want, rather than save it in a broad way.

Can continue to improve in the future study. Climb something more interesting.

This example is just to see what the beginner Python can do for me.








Python write crawler-Crawl School News

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.