Use C # and regular expressions to analyze the URL list of hao123.com

Last Update:2018-12-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Recently, we need to classify websites to detect the habits of local users accessing the Internet. In this way, we need to classify domain names. First, we thought of www.hao123.com.

Through this analysis, you can understand C # some simple methods to obtain the source code of the web page and call regular expressions.

And some skills in use.

1. Obtain the webpage source code

For convenience, we can directly write functions here, as shown below:

Static string getsource (string pageurl) {webrequest request = webrequest. create (pageurl); // webrequest. the create method returns the webrequest subclass httpwebrequest webresponse response = request. getresponse (); // webrequest. the getresponse method returns the response stream resstream = response to the Internet request. getresponsestream (); // webresponse. the getresponsestream method returns data streams from Internet resources. Encoding ENC = encoding. getencoding ("gb2312"); // if it is garbled, change it to UTF-8/gb2312 streamreader sr = new streamreader (resstream, ENC); // namespace: system. io. The streamreader class implements a textreader (textreader class, indicating that the reader can read the continuous character series) so that it can read characters from the byte stream with a specific encoding. String source = sr. readtoend (); // output (HTML code). contenthtml is a Textbox Control resstream in multiline mode. close (); Sr. close (); Return source; // return the source code string}

Using the above Code can solve most of the problems, sometimes you will find that the obtained source code is different from what the browser sees. At this time, you need to use UA for processing.

2. Create a website class

This is simple

    class Page    {        public string Url;        public string Class;    }

3. Obtain the source code of the home page of hao123.com.

Use the function in step 1 to analyze www.hao123.com and obtain the source code.

By analyzing the webpage, we can see that the main URLs of hao123 are distributed in the list Section on the right, as shown in

You can use this list to analyze the desired category URL.

Click the green text to go to the detailed list. We analyzed the source code of the green text and found that:

<Span
Class = "box-sort_title"> <
Href = "http://v.hao123.com/movie/"> shadow </a> </span>

In addition, this is the entry to the category list, so we can use this string as our template for analysis:

4. Regular Expression Analysis:

What we need is the link behind a href = "and the title behind the link. Here, we change them (.*?), That is

. * Represents any string? The brackets on both sides of non-Greedy processing indicate that we need to use them below.

After regular expression analysis, we can obtain the URL and corresponding category:

// ******************* Obtain the category list section *************** * ***** list <page> classpages = new list <page> (); string strreg = "box-sort_title \"> <a href = \"(. *?) \ "> (.*?) </A> "; foreach (Match m in RegEx. matches (source, strreg, regexoptions. ignorecase) {console. writeline (string. format ("{0} {1}", M. groups [2]. value, M. groups [1]. value); page p = new page (); p. url = m. groups [1]. value. replace ("", ""); p. class = m. groups [2]. value. replace ("", ""); classpages. add (p );} // ******************* obtain the category list section *************** *****

Foreach indicates processing each matching item

After such processing, our URLs and categories are saved in the classpages list.

5. Analyze webpages of each category

The same as above, we found the list of URLs in the category, and did not go into details directly with the code:

// *************** Analyze the webpage directly, do not save ************* foreach (page p in classpages) {string temp = getsource (P. URL); strreg = "

Finally, the category of the website is saved in the TXT file under drive C.

Note that because I use the append mode, I will continue to add it to the file during the second running of the program, if you are interested, you can ask the program to automatically delete the original file.

Last code: http://download.csdn.net/detail/icyfox_bupt/4389810

There are a lot of crawlers. It is one thing to find that crawling is actually a method, and some skills are also very important, for example, it is much easier to add some preprocessing during analysis in regular expressions, which requires you to explore it slowly.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Use C # and regular expressions to analyze the URL list of hao123.com

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Use C # and regular expressions to analyze the URL list of hao123.com

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support