See how I use Python to write simple web crawlers

Last Update:2015-02-11 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Usually nothing like to see freebuf articles, today in the article, the wireless network is always broken when continued, so his whim to write this web crawler, save the page for easy viewing

First analysis of the site content, the red part is the site article content Div, you can see, each page has 15 articles

Casually open a div, you can see that the blue part in addition to an article title there is no useful information, and note that the red part of the place I sketched out, you can know that it is a hyperlink to the address of the article, then the crawler as long as the capture of this address on it.

Next in a problem is the page turn problem, you can see, this and most of the site is different, the bottom of no page tag, but to see more, here let me suddenly a little bit.

But when I looked at the source file, I found a hyperlink as shown, and tested it to the next page, so by changing its final value, you could navigate to the corresponding number of pages.

Then by the above information, you can have a corresponding solution to the crawler steps
1. Crawl the location of all articles on each page

2. Capture the URL of each page of the article

3. Processing the captured URL

So here's the question, how do I position each article in its source code?

For the first article, for example, in the source code query "<dt><a href=" This string, why query this string? Because the URL of each article starts with it, I just find the string to locate the beginning of each article, after the beginning of the article, I must also locate the end of the article, in order to extract the middle URL, as shown in

Code:

Import Urllibimport string# defines the page URL to crawl = ' http://www.freebuf.com/articles ' #读取要抓取的页面globalcontent = Urllib.urlopen ( URL). Read () #捕捉文章列表 # here in the source query "<dt><a href=" this string new_inner01_h = Globalcontent.find (' <dt>a href= ') Print News_inner01_h

Run Results: You can see that the string position of the first article is the 13,607th character in the entire source code, and then continue to find the URL trailer for that article

Code:

Import Urllibimport string# defines the page URL to crawl = ' http://www.freebuf.com/articles ' #读取要抓取的页面globalcontent = Urllib.urlopen ( URL). Read () #捕捉文章列表 # here in the source query "<dt><a href=" this string new_inner01_h = Globalcontent.find (' <dt>a href= ') Print news_inner01_h# here in the source query ". html" This string new_inner01_l = Globalcontent.find ('. html ') Print news_inner01_l

Run Results: You can see that the end of the URL is on the 13,661th character, then you can then take the real article URL I want to extract the address

Code:

Import Urllibimport string# defines the page URL to crawl = ' http://www.freebuf.com/articles ' #读取要抓取的页面globalcontent = Urllib.urlopen ( URL). Read () #捕捉文章列表 # here in the source query "<dt><a href=" this string new_inner01_h = Globalcontent.find (' <dt>a href= ') Print news_inner01_h# here in the source query ". html" This string new_inner01_l = Globalcontent.find ('. html ') Print news_inner01_l# Here the document stream is fragmented, starting from the head of the first article found, to the end of the tail to extract # Note that the head I add 13, the tail plus 5, that is because the pointer is found in the beginning of the string, if not to do the result is not the data I want, so move the pointer forward News_ Inner01 = Globalcontent[news_inner01_h+13:news_inner01_l+5]print News_inner01

Operation Result:
As shown, to the successful extraction of the first article URL address, then the next thing is good to do, I just need to loop on the flow of the document as above, the address of each article can be, and finally to each article processing on the line

The following code for exception capture, I found that if the exception is not processed, then the URL return value will be more than one blank line, resulting in the crawled article can not be processed, so there is an exception to catch, ignore the caught exception

At this point, a most basic function of the web crawler to achieve, of course, you can add more features, I am here because I just write to play, after all, very late, too sleepy, really do not want to write, want to sleep, so write so much, here is just a train of thought, you can add a lot of features, I do not use object-oriented knowledge here, if the use of object-oriented knowledge, then the crawler can be more perfect.

import urllibimport stringurl =  ' Http://www.freebuf.com/articles ' globalcontent =  urllib.urlopen (URL). Read () news_start = globlacontentcout = 1while count  <= 16:    try:        news_inner_head  = news_start.find (' <dt><a href= ')         news _inner_tail = news_start.find ('. html ')         news_inner_url  = new_start[news_inner_head+13:news_inner_tail+5]         print news_inner_url        news_start = news_start[ news_inner_tail+5:]                 filename = news_inner_url[-10:]         Urllib.urlretrieve (News_inner_url,filename)        count += 1    except:         print  ' download success! ' &NBSP;&NBSP;&NBSP;&NBSP;FINALLY:&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;IF&NBSP;COUNT&NBSP;==&NBSP;16:             break

Well, not much to say, on two sheets, slept!

See how I use Python to write simple web crawlers

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

See how I use Python to write simple web crawlers

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

See how I use Python to write simple web crawlers

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support