See how I use Python to write simple web crawlers

Source: Internet
Author: User

Usually nothing like to see freebuf articles, today in the article, the wireless network is always broken when continued, so his whim to write this web crawler, save the page for easy viewing

First analysis of the site content, the red part is the site article content Div, you can see, each page has 15 articles

Casually open a div, you can see that the blue part in addition to an article title there is no useful information, and note that the red part of the place I sketched out, you can know that it is a hyperlink to the address of the article, then the crawler as long as the capture of this address on it.

Next in a problem is the page turn problem, you can see, this and most of the site is different, the bottom of no page tag, but to see more, here let me suddenly a little bit.


But when I looked at the source file, I found a hyperlink as shown, and tested it to the next page, so by changing its final value, you could navigate to the corresponding number of pages.

Then by the above information, you can have a corresponding solution to the crawler steps
1. Crawl the location of all articles on each page

2. Capture the URL of each page of the article

3. Processing the captured URL

So here's the question, how do I position each article in its source code?

For the first article, for example, in the source code query "<dt><a href=" This string, why query this string? Because the URL of each article starts with it, I just find the string to locate the beginning of each article, after the beginning of the article, I must also locate the end of the article, in order to extract the middle URL, as shown in

Code:

Import Urllibimport string# defines the page URL to crawl = ' http://www.freebuf.com/articles ' #读取要抓取的页面globalcontent = Urllib.urlopen ( URL). Read () #捕捉文章列表 # here in the source query "<dt><a href=" this string new_inner01_h = Globalcontent.find (' <dt>a href= ') Print News_inner01_h

Run Results: You can see that the string position of the first article is the 13,607th character in the entire source code, and then continue to find the URL trailer for that article

Code:

Import Urllibimport string# defines the page URL to crawl = ' http://www.freebuf.com/articles ' #读取要抓取的页面globalcontent = Urllib.urlopen ( URL). Read () #捕捉文章列表 # here in the source query "<dt><a href=" this string new_inner01_h = Globalcontent.find (' <dt>a href= ') Print news_inner01_h# here in the source query ". html" This string new_inner01_l = Globalcontent.find ('. html ') Print news_inner01_l

Run Results: You can see that the end of the URL is on the 13,661th character, then you can then take the real article URL I want to extract the address


Code:

Import Urllibimport string# defines the page URL to crawl = ' http://www.freebuf.com/articles ' #读取要抓取的页面globalcontent = Urllib.urlopen ( URL). Read () #捕捉文章列表 # here in the source query "<dt><a href=" this string new_inner01_h = Globalcontent.find (' <dt>a href= ') Print news_inner01_h# here in the source query ". html" This string new_inner01_l = Globalcontent.find ('. html ') Print news_inner01_l# Here the document stream is fragmented, starting from the head of the first article found, to the end of the tail to extract # Note that the head I add 13, the tail plus 5, that is because the pointer is found in the beginning of the string, if not to do the result is not the data I want, so move the pointer forward News_ Inner01 = Globalcontent[news_inner01_h+13:news_inner01_l+5]print News_inner01

Operation Result:
As shown, to the successful extraction of the first article URL address, then the next thing is good to do, I just need to loop on the flow of the document as above, the address of each article can be, and finally to each article processing on the line

The following code for exception capture, I found that if the exception is not processed, then the URL return value will be more than one blank line, resulting in the crawled article can not be processed, so there is an exception to catch, ignore the caught exception

At this point, a most basic function of the web crawler to achieve, of course, you can add more features, I am here because I just write to play, after all, very late, too sleepy, really do not want to write, want to sleep, so write so much, here is just a train of thought, you can add a lot of features, I do not use object-oriented knowledge here, if the use of object-oriented knowledge, then the crawler can be more perfect.

import urllibimport stringurl =  ' Http://www.freebuf.com/articles ' globalcontent =  urllib.urlopen (URL). Read () news_start = globlacontentcout = 1while count  <= 16:    try:        news_inner_head  = news_start.find (' <dt><a href= ')         news _inner_tail = news_start.find ('. html ')         news_inner_url  = new_start[news_inner_head+13:news_inner_tail+5]         print news_inner_url        news_start = news_start[ news_inner_tail+5:]                 filename = news_inner_url[-10:]         Urllib.urlretrieve (News_inner_url,filename)        count += 1    except:         print  ' download success! ' &NBSP;&NBSP;&NBSP;&NBSP;FINALLY:&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;&NBSP;IF&NBSP;COUNT&NBSP;==&NBSP;16:             break

Well, not much to say, on two sheets, slept!

See how I use Python to write simple web crawlers

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.