Python Simple crawler function implementation

Source: Internet
Author: User

When Google founders wrote their first crude crawler in Python, running on the same rudimentary server;
Few people can imagine how they have subverted the internet and even the human world in the decades that followed.

There is a network where there are reptiles, reptiles English name Spider. It is a program used to crawl Web site data. For example: We through a program, regularly to crawl similar to Baidu glutinous rice, the public comments on the data, the information stored in the database, and then add the Display page, a buy navigation station on the advent. There is no doubt that crawlers are the initial source of data for many websites.


First, the realization of a crawler function

--View the URL of the first article in the blog post directory

The first step is to introduce the Urllib module, find the URL using the Find function, and get the desired URL after the character processing.

#!/usr/bin/env pythonimport urlliburl = [']*40i = 0con = Urllib.urlopen (' Http://blog.sina.com.cn/s/articlelist_ 1191258123_0_1.html '). Read () title = Con.find (R ' <a title= ') href = Con.find (R ' href= ', title) HTML = Con.find (R '. html '), href) url = con[href +6:html +5]print URL

Second, view the blog post directory The first page of all articles URL

A:

#!/usr/bin/env pythonimport urlliburl = [']*40i = 0con = Urllib.urlopen (' Http://blog.sina.com.cn/s/articlelist_ 1191258123_0_1.html '). Read () title = Con.find (R ' <a title= ') href = Con.find (R ' href= ', title) HTML = Con.find (R '. html '), HREF) url[0] = con[href +6:html +5]print urlwhile Title! =-1 and href! =-1 and HTML! =-1 and i < 40:url[i] = Co N[href +6:html +5] Print Url[i] title = Con.find (R ' <a title= ', html) href = Con.find (R ' href= ', title) HTML = Con.find (R '. html ', href) i = i +1

or B:

#!/usr/bin/env Pythonimport Urllibi = 0con = Urllib.urlopen (' Http://blog.sina.com.cn/s/articlelist_1191258123_0_1. HTML '). Read () title = Con.find (R ' <a title= ') href = Con.find (R ' href= ', title) HTML = Con.find (R '. html ', href) url = con[     href +6:html +5]while Title! =-1 and href! =-1 and HTML! =-1 and i < 50:title = Con.find (R ' <a title= ', HTML) href = Con.find (R ' href= ', title) HTML = Con.find (R '. html ', href) url = con[href +6:html +5] Print URL i = i + 1

Third, download the first page of the post directory all the articles


A:

#!/usr/bin/env pythonimport urllibi = 0url = [']*40con =  Urllib.urlopen (' http://www.zhihu.com/collection/19668036 '). Read () Target = con.find (R ' <a  target= "_blank") Base = con.find (R ' href= ', target) end = con.find (' > ', Base) url[0] =   ' http://www.zhihu.com '  + con[target +25 :end - 1]print url[0]while  i < 20:  url[0] =  ' http://www.zhihu.com '  + con[target + 25 :end - 1]  print url[0]  target = con.find (R ' <a  Target= "_blank", end)   base = con.find (R ' href= ', target)   end = con.find (' > ', Base)   i = i + 1while j < 30:     Content = urllib.urlopen (Url[j]). Read ()     print url[0]     open (R ' zhihu/' +url[j], ' w+ '). WRITE (content)     print  ' downloading ',     j = j +  1    time.sleep (15)

or B:

#!/usr/bin/env pythonimport timeimport urllibi = 0j = 0url = ["] *30name = [']*30con = urllib.urlopen (' http://www.zhihu.com/collection/19668036 '). Read () Target = con.find (R ' <a target= "_blank ') Base = con.find (R ' href= ', target) end =  con.find (' > ', Base) url[0] =  ' http://www.zhihu.com '  + con[target +25 :end  - 1]while target != -1 and base != -1 and end !=  -1 and i < 30:  url[0] =  ' http://www.zhihu.com '  +  Con[target +25 :end - 1]  name[0] =  con[base +16 :end  - 1]  target = con.find (R ' <a target= "_blank ', end)   base  = con.find (R ' href= ', target)   end = con.find (' > ', Base)   content =  urllib.urlopeN (url[0]). Read ()   open (R ' zhihu/' +name[0]+ '. html ', ' w+ '). Write (content)   print  ' Downloading ', Name[0]  time.sleep (5)   i = i + 1



Iv. Download all articles


A:

import timeimport urllibpage = 1url = [']*350i = 0link =  1while page <= 7:  con = urllib.urlopen (' http://blog.sina.com.cn/s/ articlelist_1191258123_0_ ' +str (page) + '. html '). Read ()   title = con.find (R ' <a title= ')   href = con.find (R ' href= ', title)   html = con.find (R '. html ', href)   while title != -1 and href != -1 and html !=  -1 and i < 350:    url[i] = con[href +6 : html +5 ]    print link,url[i]    title =  Con.find (R ' <a title= ', HTML)     href = con.find (R ' href= ', title)      html = con.find (R '. html ', href)     link = link +  1    i = i +1  else:    print  ' find end! '   page = page + 1else:    print  ' All find end ' J = 0while j < 50:    content = urllib.urlopen (url[ J]). Read ()     open (R ' tmp/' +url[j][-26:], ' w+ '). Write (content)     j  = j + 1    time.sleep (5) else:    print  ' download over! '

    
B:

#!/usr/bin/env pythonimport timeimport urllibi = 0link = 1page =  1url = [']*350while page <= 7:  con = urllib.urlopen (' HTTP// blog.sina.com.cn/s/articlelist_1191258123_0_ ' +str (page) + '. html '). Read ()   title = con.find ( R ' <a title= ')   href = con.find (R ' href= ', title)   html =  Con.find (R '. html ', href)   while title != -1 and href != -1  and html != -1 and i < 350:    url[i] =  con[href +6 :html +5 ]    print link,url[i]     title = con.find (R ' <a title= ', HTML)     href = con.find (r ' href= ', title)     html = con.find (R '. html ', href)     content  = urllib.urlopeN (url[i]). Read ()     open (R '/tmp/sina/' +url[i][-26:], ' w+ '). Write (content)      time.sleep (5)     link = link + 1    i =  i +1  page = page + 1else:    print  ' download over! '

Operation Result:

650) this.width=650; "src=" http://s3.51cto.com/wyfs02/M01/70/87/wKiom1W5lmbzdtMBAAJeGvbqJRE714.jpg "title=" 1.png " alt= "Wkiom1w5lmbzdtmbaajegvbqjre714.jpg"/>

This article is from the "World" blog, make sure to keep this source http://xiajie.blog.51cto.com/6044823/1679997

Python Simple crawler function implementation

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.