Python web crawler (1)-simple blog Crawler

Last Update:2014-07-06 Source: Internet

Author: User

Tags python web crawler

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Recently, I have been collecting and reading some in-depth news and interesting texts and comments on the Internet for the purposes of public accounts, and have chosen several excellent articles to publish them. However, I feel that it is really annoying to read an article. I want to find a simple solution to see if I can automatically collect online data and then use the unified filtering method. Unfortunately, I recently prepared to learn about web crawlers, so I learned how to write a small crawler by following the online tutorials. It was used to crawl Han's blog.

First, paste the complete code. If you need to test it, install the python environment, copy and paste it, and then press F5 to run it.

# Import the urllib library. python must import urllib # time class library import time # define a URL array to store captured URL addresses, that is, the url of the text address to be crawled = [''] * 50 # define the link variable, used to record the URL link = 1 # loop capture all article links on the first page of the blog directory, and download # define con variable to store urllib. urlopen opens the directory address of Han's blog. Pay special attention to '+ str (page) +', which is used to change the directory address con = urllib on each page. urlopen ('HTTP: // blog.sina.com.cn/s/articlelist_1191258123_0_1.html '). read () # The variable title is used to store the position of the element starting with <a title = con. find (R' <a title = ') # The variable href is used to store the hre found in the con variable F = 'position of the starting element href = position of the starting element html = con.find(r'.html ', href) # store the first connection Address url [0] = con [href + 6: html + 5] content = urllib. urlopen (url [0]). read () open (r'hanhan/'+ url [0] [-26:], 'W + '). write (content) print '0 have downloaded', url [0] # capture the address of each article cyclically, and store it in the URL array while title! =-1 and href! =-1 and html! =-1 and link <50: # con [href + 6: html + 5] is used to obtain the string url [link] = con [href + 6: html + 5] # Open and read the address of each article, and store content = urllib in the content. urlopen (url [link]). read () # Open the hanhan folder. If there is no url [link] [-26:] in the folder, write the content in the content, name it url [link] [-26:] open (r'hanhan/'+ url [link] [-26:], 'W + '). write (content) print link, 'have downloaded', url [link] title = con. find (R' <a title = ', html) href = con. find (r 'href = ', title) html = con.find(r'.html', href) # auto-incrementing link = link + 1

This crawler implements simple functions, but I think it is enough to get started. It only saves the HTML files of all the articles in the first directory of the blog, and does not capture specific content for saving.

In addition, I think people with programming basics should not seem very laborious. The basic idea is very simple, that is, they first crawl the address, then crawl an address, and then save it. I personally think that this code is still a bit messy and not concise and clear enough. It is hoped that higher quality code can be written through future study.

Some of the methods involved in it can be found by searching the python documentation. It is not difficult to find them. I almost mark the remarks for every statement in it.

Run:

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More