Out of work needs, two years ago, wl363535796 and I wrote a micro crawler Library (not a crawler, but only encapsulation of some crawling operations ). Later, we did not care about it. Until recently, we fixed all detected bugs, improved some functions, and Code . Now it is open-source and named easyspider, which means a simple and practical crawling library. The goal of open source is to save some time for friends who have the same requirements. We would be honored to be lucky if anyone can use this project. (PS: I haven't touched C # for a long time. I hope you can see Hai Han, a friend, if your code is not well written .)
Recently I don't know why codeplex is slow to access, so I managed the project to the http://code.taobao.org. If you encounter any problems or have any suggestions for future versions, you are welcome to enter issues on the project homepage or leave a message in this blog, you can also send us an email. Wl363535796 Email: wl363535796@gmail.commangoalex Email: mangoalex@163.com
Project address: Workshop
1. Project Introduction
1. The main functions of this library are as follows: (1) encapsulate basic HTTP request operations, including get and post requests, file uploads, and resource downloads.
(2) encapsulate some common page operations, including JS filtering and downloading all page resources (download the images, JS, CSS, and Flash contained on the page to a specified local path, and then change the reference path on the page ).
(3) Some tool encapsulation, including converting relative URLs to absolute URLs, obtaining reference paths based on two different directories, and page encoding judgment.
2. Main project classes and descriptions: Spider This is the main class. All HTTP request operation portals are here, which encapsulates a webpage method related to the same asynchronous request. This is the main class library for page download. You need to download all the resources on the page to your local computer and use this class chttpwebrequest to encapsulate httpwebrequest, add some methods such as HTTP header settings, such as chttpwebresponse encapsulation of httpwebresponse, add some operations such as page encoding acquisition and decompression of the server output compressed stream 3. the main project directories are described as follows: Common You can see the name. This folder contains some common project configurations and the RegEx encapsulation of the basic class. Regular Expression-related encapsulation mainly encapsulates some common Regular Expressions
Utility Tool folder, which encapsulates common operations such as paths and files
4. Additional Notes:
(1) For performance consideration, chttpwebresponse implements the idisposable interface instead of using the destructor. In other words, you need to use the using syntax to explicitly release resources (in some cases, you can call methods that already contain garbage collection)
(2) For compatibility consideration, the entire project is based on Framework 2.0. However, my machine is installed with vs2010, so if your vs version is earlier than 2010, you need to switch the project to a lower version (for specific methods, refer to here: (http://www.cnblogs.com/hibernate6/archive/2011/11/28/2521991.html ), or create a new project and copy the code to the new project.
Ii. Some simple examples
1. synchronous get and post requests
// 1.1 Use a synchronous GET request to obtain the returned html
Using (Chttpwebresponse cresponse = spider. Get ( " Http://www.baidu.com " ))
{
// The first parameter is the encoding of the output stream of the parsing server. If it is set to null, it indicates automatic acquisition. The second parameter is whether to release resources after the method is called, because using is explicitly used, set this parameter to false.
String Html = cresponse. getcontent ( Null , False );
}
// 1.2 use the synchronous POST method to submit strings and files and obtain the returned html
Dictionary < String ,String > Postdic = New Dictionary < String , String > ();
List New List // Send "name = mango & age = 20" to the server
Postdic. Add ( " Name " , " Mango " );
Postdic. Add ( " Age " , " 20 " );
// Send two files to the server. The first parameter of fromfilepath is the local path of the file, the second parameter is the name of the file in the form, and the third parameter is the MIME type of the file, if it is set to null, it indicates automatic acquisition.
Files. Add (httppostfile. fromfilepath ( @" D: \ 1.txt " , " File1 " , Null ));
Files. Add (httppostfile. fromfilepath ( @" D: \ 2.txt " , " File2 " , Null ));
// The first parameter of the POST method is URL, the second parameter is the key-value pair of the form field to be submitted, and the third parameter is the list of files to be submitted, if you do not need to submit a file, set it to null. The fourth parameter indicates the encoding of the post string. If it is null, it uses utf8 encoding.
Using (Chttpwebresponse cresponse = spider. Post ( " Http://www.baidu.com " , Postdic, files, Null ))
{
String Html = cresponse. getcontent ( Null , False );
}
2. asynchronous get and post requests
The Asynchronous Method is similar to the synchronous method, with only one delegate parameter missing. Let's take the asynchronous get method as an example:
// 2.1 asyncget is not blocked because asynchronous calls are usedProgramTo see the HTML returned by the server, call console. Readline after this method to suspend the program.
Spider. asyncget (Spider. createrequest ( " Http://www.baidu.com " ), New Responsecallback ( Delegate (Chttpwebresponse cresponse ){
// Note: The using syntax is not used here because it is encapsulated. getcontent () actually calls getcontent (null, true)
String Html = cresponse. getcontent ();
Console. writeline (HTML );
}));
Console. Readline ();
3. Download Page Resources
Using (Chttpwebresponse cresponse = spider. Get (Spider. createrequest ( " Http://www.taobao.com " )))
{
Webpage page = cresponse. getwebpage ();
// The first parameter specifies the file name to be saved; the second parameter indicates that JS is not filtered; the third parameter specifies the Save folder (you can also specify different folders for CSS, JS, and so on, as long as it is in the same directory)
Page. savehtmlandresource ( @" Taobao.html " , False , New Dirconfig ( @" D: \ test " ));
}
After the above code is executed, the following directories will be generated in D: \ test:
Open taobao.html and the following page is displayed: