Recently, we need to classify websites to detect the habits of local users accessing the Internet. In this way, we need to classify domain names. First, we thought of www.hao123.com.
Through this analysis, you can understand C # some simple methods to obtain the source code of the web page and call regular expressions.
And some skills in use.
1. Obtain the webpage source code
For convenience, we can directly write functions here, as shown below:
Static string getsource (string pageurl) {webrequest request = webrequest. create (pageurl); // webrequest. the create method returns the webrequest subclass httpwebrequest webresponse response = request. getresponse (); // webrequest. the getresponse method returns the response stream resstream = response to the Internet request. getresponsestream (); // webresponse. the getresponsestream method returns data streams from Internet resources. Encoding ENC = encoding. getencoding ("gb2312"); // if it is garbled, change it to UTF-8/gb2312 streamreader sr = new streamreader (resstream, ENC); // namespace: system. io. The streamreader class implements a textreader (textreader class, indicating that the reader can read the continuous character series) so that it can read characters from the byte stream with a specific encoding. String source = sr. readtoend (); // output (HTML code). contenthtml is a Textbox Control resstream in multiline mode. close (); Sr. close (); Return source; // return the source code string}
Using the above Code can solve most of the problems, sometimes you will find that the obtained source code is different from what the browser sees. At this time, you need to use UA for processing.
2. Create a website class
This is simple
class Page { public string Url; public string Class; }
3. Obtain the source code of the home page of hao123.com.
Use the function in step 1 to analyze www.hao123.com and obtain the source code.
By analyzing the webpage, we can see that the main URLs of hao123 are distributed in the list Section on the right, as shown in
You can use this list to analyze the desired category URL.
Click the green text to go to the detailed list. We analyzed the source code of the green text and found that:
<Span
Class = "box-sort_title"> <
Href = "http://v.hao123.com/movie/"> shadow </a> </span>
In addition, this is the entry to the category list, so we can use this string as our template for analysis:
4. Regular Expression Analysis:
What we need is the link behind a href = "and the title behind the link. Here, we change them (.*?), That is
<Span
Class = "box-sort_title"> <
Href = "(.*?) "> (.*?) </A> </span>
. * Represents any string? The brackets on both sides of non-Greedy processing indicate that we need to use them below.
After regular expression analysis, we can obtain the URL and corresponding category:
// ******************* Obtain the category list section *************** * ***** list <page> classpages = new list <page> (); string strreg = "box-sort_title \"> <a href = \"(. *?) \ "> (.*?) </A> "; foreach (Match m in RegEx. matches (source, strreg, regexoptions. ignorecase) {console. writeline (string. format ("{0} {1}", M. groups [2]. value, M. groups [1]. value); page p = new page (); p. url = m. groups [1]. value. replace ("", ""); p. class = m. groups [2]. value. replace ("", ""); classpages. add (p );} // ******************* obtain the category list section *************** *****
Foreach indicates processing each matching item
After such processing, our URLs and categories are saved in the classpages list.
5. Analyze webpages of each category
The same as above, we found the list of URLs in the category, and did not go into details directly with the code:
// *************** Analyze the webpage directly, do not save ************* foreach (page p in classpages) {string temp = getsource (P. URL); strreg = "
Finally, the category of the website is saved in the TXT file under drive C.
Note that because I use the append mode, I will continue to add it to the file during the second running of the program, if you are interested, you can ask the program to automatically delete the original file.
Last code: http://download.csdn.net/detail/icyfox_bupt/4389810
There are a lot of crawlers. It is one thing to find that crawling is actually a method, and some skills are also very important, for example, it is much easier to add some preprocessing during analysis in regular expressions, which requires you to explore it slowly.