Python web crawler (1)-simple blog Crawler

Source: Internet
Author: User
Tags python web crawler

Recently, I have been collecting and reading some in-depth news and interesting texts and comments on the Internet for the purposes of public accounts, and have chosen several excellent articles to publish them. However, I feel that it is really annoying to read an article. I want to find a simple solution to see if I can automatically collect online data and then use the unified filtering method. Unfortunately, I recently prepared to learn about web crawlers, so I learned how to write a small crawler by following the online tutorials. It was used to crawl Han's blog.


First, paste the complete code. If you need to test it, install the python environment, copy and paste it, and then press F5 to run it.

# Import the urllib library. python must import urllib # time class library import time # define a URL array to store captured URL addresses, that is, the url of the text address to be crawled = [''] * 50 # define the link variable, used to record the URL link = 1 # loop capture all article links on the first page of the blog directory, and download # define con variable to store urllib. urlopen opens the directory address of Han's blog. Pay special attention to '+ str (page) +', which is used to change the directory address con = urllib on each page. urlopen ('HTTP: // blog.sina.com.cn/s/articlelist_1191258123_0_1.html '). read () # The variable title is used to store the position of the element starting with <a title = con. find (R' <a title = ') # The variable href is used to store the hre found in the con variable F = 'position of the starting element href = position of the starting element html = con.find(r'.html ', href) # store the first connection Address url [0] = con [href + 6: html + 5] content = urllib. urlopen (url [0]). read () open (r'hanhan/'+ url [0] [-26:], 'W + '). write (content) print '0 have downloaded', url [0] # capture the address of each article cyclically, and store it in the URL array while title! =-1 and href! =-1 and html! =-1 and link <50: # con [href + 6: html + 5] is used to obtain the string url [link] = con [href + 6: html + 5] # Open and read the address of each article, and store content = urllib in the content. urlopen (url [link]). read () # Open the hanhan folder. If there is no url [link] [-26:] in the folder, write the content in the content, name it url [link] [-26:] open (r'hanhan/'+ url [link] [-26:], 'W + '). write (content) print link, 'have downloaded', url [link] title = con. find (R' <a title = ', html) href = con. find (r 'href = ', title) html = con.find(r'.html', href) # auto-incrementing link = link + 1

This crawler implements simple functions, but I think it is enough to get started. It only saves the HTML files of all the articles in the first directory of the blog, and does not capture specific content for saving.


In addition, I think people with programming basics should not seem very laborious. The basic idea is very simple, that is, they first crawl the address, then crawl an address, and then save it. I personally think that this code is still a bit messy and not concise and clear enough. It is hoped that higher quality code can be written through future study.

Some of the methods involved in it can be found by searching the python documentation. It is not difficult to find them. I almost mark the remarks for every statement in it.

Run:


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.