Remember the development experience of a crawler framework

Last Update:2016-06-11 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Crawlers are the coolest thing I've ever seen, because crawlers can really help you with some tedious work, like collecting pictures.

Cough ~ because a website will pack the pictures into seeds, publish, and provide download, for the old driver is a time-saving thing, but the seeds will be ineffective. There are some very want to see the atlas want to see but the seed fails, this time it will be very embarrassing, fortunately, the seed is gone but the official website image is still there! So I thought it would be nice if I wrote a crawler to help me complete the work of saving pictures.

I did not think, I began to plan.

The first step is to collect some of the Web page's front-end code (that is, in the browser right-click Page source code inside the view). Because I want to know if these websites have some rules.

Not here, because the code is a lot (perhaps I would like to ask why not to write a graph or something, the URL of the match to all thrown in, but one-on-one visit is not good?? But I want to say that the crawler does not necessarily have to be a search engine, we only implement a function will not need to share so many things.

Yes, we want to download just the picture itself, since he has a regular, we analyze the direct extraction than the search engine which method is much faster, and we need is efficiency!

Look at the structure of their site, roughly: home-and a picture Set preview page –> preview page

Because there are a lot of atlas in the home page, but not all of us like, so we designed the program only need to copy the address of a certain picture set, and then help us to download all the pictures of this atlas is OK.

To do this, I need to write a class that will help us complete the analysis of a picture set.

The object of this class is the preview interface of this book.

I name this class spider:

Here's some code

Using System;
Using System.Collections.Generic;
Using System.Net;
Using System.IO;
Using System.Text.RegularExpressions;

Namespace Spiders
{
public class Spider
{
Protected string _uri;//Save the specified start address
Public Spider (String uri)
{

}
}
}

To avoid unnecessary maintenance (too many constructor overloads are designed), we have specified that the Spider class must specify a Web page address as a parameter at initialization time. And using _uri, this variable is stored here to consider the problem of inheritance, so we changed the string modifier to proteced (in the original project did not take into account the inheritance, later found that no inheritance is not conducive to reuse)

Here, we have an opening to this framework.

Next I'll elaborate on my implementation of the idea.

The idea of this site is the outline to save the picture set of annotations, pictures, thumbnails and so on, but I can not simply in the outline of the mobile phone picture, because the thumbnail of his resolution is 200x134, which obviously can not be used to appreciate. Just click on the image, he will appear a preview screen (this is not the effect of jquery, jquery implementation of the page will be more complicated) and then preview the page will save the image of the address, I think, I initialize the Spider class, let him click the first picture, Go to the preview page and then cache the image address of each preview screen and look for the link address of the next page button.

Here I need to implement a method called Firststep, as the name implies is the first step, that is to find the first picture of the image of the preview page address, and return.

So this should be the function:

Protected string Firststep (string uri)
{

}

And then we'll talk about how it's going to come true!

We need to download the content of the webpage and then match it by regular.

So to use System.Net, System.Text.RegularExpressions; These two namespaces, we're going to be using classes that have two, WebClient, and regex

The main role of the WebClient class is to download the file (in fact, IE is the same as the download function, The effect may be inferior to thunder a lot) we mainly use the Webclient.downloadstring method here (in fact I use the Webclient.opentoread method, because the downloadstring there is coding problem, the Chinese can be garbled, because I need to get The correct Chinese title so I use the Opentoread method. ）

And the Regex class is the regular expression in C #, we mainly use IsMatch, Match.

And the regular expression of concrete how to build, we still self-baidu it! (Remember, do not believe that the online matching URL of this expression, this thing in C # inside is probably not good, the change of their own words spend less time than to learn it, after the introduction is actually easy to write.

So far, you have achieved a preview of the first image of the page address of the function (I'm sorry, this program I can not paste code, because I was to find a small yellow film, posted out feeling will be finished, another day I realize the inheritance, climb a serious image site, and then share the code. Understanding)

There is a regular ^ (Http|https|bbs) (://) (. +) that you write and can use.

But to say, this regular is not to tell the end of this address is the Web address or, these have to rely on their own subdivision. A regular validation tool is recommended because it takes a lot of time to verify that the regular is in Debug.

Regex Match Tracer

After that, the thing to do is to constantly look for the next page, and to save the picture of this page (you can also download, WebClient DownloadFile method and Downloadfileasyn method are available for you to use, you can live with this file flow, And then to implement the download function, but the individual is not very recommended. In addition: I would like to tell you that Downloadasyn is an asynchronous download, seemingly only subscribed to the Downloadfilecompeled event he entered the download, followed by DownloadFile is the synchronization method, will be stuck thread of ~ unless you want to download the picture no dead chain, download quickly , or you really need to think carefully about how to implement the download. Anyway, I want to download a page that has most of the images can download individual pictures are missing cases. Also pay attention to the error thrown by the catch! I have omitted everything because of too much effort, but such projects do not conform to engineering specifications.

Constantly looking for the next page this implementation is actually very simple, and constantly recursive to the good, considering that not everyone is spicy to the recursion is familiar, so I put my recursion to give you a reference

protected void Cicle (string uri)
{
Features that need to be implemented
if ()//Here is the condition to exit the recursion
{
If xx satisfies then recursion
Cicle (Next)
}
Elsse
{
Otherwise quit recursion
Return
}
//
}

Here gives a recursive format, specifically to meet the exit or not meet the exit is to look at the individual style, as well as the end of the cycle of conditions (here the condition is the code of the site, I want to crawl the site of his characteristics is no matter that night he will have the last page of the link address, While other pages may be on the last page, only the connections that return to the previous page are available. This is a good time to decide if only the previous link address is available, and I'm just going to determine if the last page is the current URL. However, I suggest that the next page is generally used <a> tags, this time we can not be as direct as software programming to determine whether he is enable=false, and the right thing to do is to compare the URI of the current URI and the specified <a> tag.

At this point the development of the crawler framework is over.

Because this framework is worth LTS, I'm going to keep updating this framework, so I'll put the code on a personal git or code cloud for everyone to use.

In the statement once, this reptile I use in the unhealthy use, the code will not post, lest go to the police station sign in.

But I will continue to update, when I do not know, or I would like to write a website for health purposes when I release it.

Thank you for your support!

Remember the development experience of a crawler framework

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Remember the development experience of a crawler framework

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support