Use C # and regular expressions to analyze the URL list of hao123.com

Source: Internet
Author: User

Recently, we need to classify websites to detect the habits of local users accessing the Internet. In this way, we need to classify domain names. First, we thought of www.hao123.com.

Through this analysis, you can understand C # some simple methods to obtain the source code of the web page and call regular expressions.

And some skills in use.

1. Obtain the webpage source code

For convenience, we can directly write functions here, as shown below:

Static string getsource (string pageurl) {webrequest request = webrequest. create (pageurl); // webrequest. the create method returns the webrequest subclass httpwebrequest webresponse response = request. getresponse (); // webrequest. the getresponse method returns the response stream resstream = response to the Internet request. getresponsestream (); // webresponse. the getresponsestream method returns data streams from Internet resources. Encoding ENC = encoding. getencoding ("gb2312"); // if it is garbled, change it to UTF-8/gb2312 streamreader sr = new streamreader (resstream, ENC); // namespace: system. io. The streamreader class implements a textreader (textreader class, indicating that the reader can read the continuous character series) so that it can read characters from the byte stream with a specific encoding. String source = sr. readtoend (); // output (HTML code). contenthtml is a Textbox Control resstream in multiline mode. close (); Sr. close (); Return source; // return the source code string}

Using the above Code can solve most of the problems, sometimes you will find that the obtained source code is different from what the browser sees. At this time, you need to use UA for processing.

2. Create a website class

This is simple

    class Page    {        public string Url;        public string Class;    }

3. Obtain the source code of the home page of hao123.com.

Use the function in step 1 to analyze www.hao123.com and obtain the source code.

By analyzing the webpage, we can see that the main URLs of hao123 are distributed in the list Section on the right, as shown in

You can use this list to analyze the desired category URL.

Click the green text to go to the detailed list. We analyzed the source code of the green text and found that:

<Span
Class = "box-sort_title"> <
Href = "http://v.hao123.com/movie/"> shadow </a> </span>

In addition, this is the entry to the category list, so we can use this string as our template for analysis:


4. Regular Expression Analysis:

What we need is the link behind a href = "and the title behind the link. Here, we change them (.*?), That is

<Span
Class = "box-sort_title"> <
Href = "(.*?) "> (.*?) </A> </span>

. * Represents any string? The brackets on both sides of non-Greedy processing indicate that we need to use them below.

After regular expression analysis, we can obtain the URL and corresponding category:

// ******************* Obtain the category list section *************** * ***** list <page> classpages = new list <page> (); string strreg = "box-sort_title \"> <a href = \"(. *?) \ "> (.*?) </A> "; foreach (Match m in RegEx. matches (source, strreg, regexoptions. ignorecase) {console. writeline (string. format ("{0} {1}", M. groups [2]. value, M. groups [1]. value); page p = new page (); p. url = m. groups [1]. value. replace ("", ""); p. class = m. groups [2]. value. replace ("", ""); classpages. add (p );} // ******************* obtain the category list section *************** *****

Foreach indicates processing each matching item

After such processing, our URLs and categories are saved in the classpages list.


5. Analyze webpages of each category

The same as above, we found the list of URLs in the category, and did not go into details directly with the code:

// *************** Analyze the webpage directly, do not save ************* foreach (page p in classpages) {string temp = getsource (P. URL); strreg = "

Finally, the category of the website is saved in the TXT file under drive C.

Note that because I use the append mode, I will continue to add it to the file during the second running of the program, if you are interested, you can ask the program to automatically delete the original file.


Last code: http://download.csdn.net/detail/icyfox_bupt/4389810


There are a lot of crawlers. It is one thing to find that crawling is actually a method, and some skills are also very important, for example, it is much easier to add some preprocessing during analysis in regular expressions, which requires you to explore it slowly.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.