C # analysis of article collection

Source: Internet
Author: User
Text/dirain source/blog

I used to write a brief introduction to "Baidu video collection". I hope I can summarize the news collection when I see the only person leaving a message. Take advantage of the popular blog site todayArticleSample Collection. I have to declare that after several months of mixing in the blog garden, I found that the articles published on the homepage of the blog garden are generally excellent and of great reference value. But I am a newbie. I ask you to leave a message directly if you have any questions in this article, because I may always think that what you have written is correct.
Next, go to the topic. First, you must note that the only way to collect data on the webpage is to obtainSource codeYou must be aware of this. Because we do not know the connection method of the database server of the other website, we can onlyCodeTo find what we want. This is undoubtedly the processing of a large number of strings. How can we deal with the code containing a large number of HTML tags and content? There may be many ways to solve the problem, but I think it would be good to use regular expressions to solve this problem.
Through the above words, I talked about two knowledge points. Let's summarize the process.
1. Obtain the source code of the page to be collected.
2. Use regular expressions to process the content we want in these codes.
Below are some preparations for writing an object class to store the information of the article. For example: title, author, release time, browsing times, etc.
Document Information Entity

Code [Copy to clipboard]

Code: using system;
Using system. Collections. Generic;
Using system. text;

Namespace plug. Article. Entity
{
/** // <Summary>
/// Collect document information. Some attributes can be left blank. If the title and address are empty, an exception is thrown when the default value is assigned.
/// </Summary>
[Serializable]
Public class article
{
Private string category;

/** // <Summary>
/// Document category
/// </Summary>
Public String category
{
Get {return category ;}
Set {Category = value ;}
}
Private string URL;
/** // <Summary>
/// Article connection address
/// </Summary>
Public String URL
{
Get
{
Return URL;
}
Set
{
If (value = "" | value. Length <= 0)
{
Throw new applicationexception ("the connection address of the article cannot be blank! ");
}
Url = value;
}
}
Private String title;
/** // <Summary>
/// Article title
/// </Summary>
Public String title
{
Get
{
Return title;
}
Set
{
If (value = "" | value. Length <= 0)
{
Throw new applicationexception ("the title of the article cannot be blank! ");
}
Title = value;
}
}
Private int views;
/** // <Summary>
/// Number of articles browsed
/// </Summary>
Public int views
{
Get
{
Return views;
}
Set
{
Views = value;
}
}
Private int replys;
/** // <Summary>
/// Number of comments
/// </Summary>
Public int replys
{
Get
{
Return replys;
}
Set
{
Replys = value;
}
}
Private string datatime;
/** // <Summary>
/// Document publication date
/// </Summary>
Public String datatime
{
Get
{
Return datatime;
}
Set
{
Datatime = value;
}
}
Private string author;
/** // <Summary>
/// Author
/// </Summary>
Public String author
{
Get
{
Return author;
}
Set
{
Author = value;
}
}
Private string site;
/** // <Summary>
/// Author's website and collection website
/// </Summary>
Public String site
{
Get
{
Return site;
}
Set
{
Site = value;
}
}
}
}

The method for obtaining the source code of the web page is also very simple. I made a class separately.
Obtain webpage source code

Code [Copy to clipboard]

Code: using system;
Using system. Collections. Generic;
Using system. text;
Using system. net;

Namespace plug. Article
{
/** // <Summary>
/// Webpage operation
/// </Summary>
Public class html
{
/** // <Summary>
/// Obtain the source code of the webpage
/// </Summary>
/// <Param name = "url"> URL path </param>
/// <Returns> </returns>
Public String gethtml (string URL)
{
WebClient web = new WebClient ();
Byte [] buffer = web. downloaddata (URL );
Return encoding. Default. getstring (buffer );
}
}
}

After obtaining the source code, it is time to go to the key step and write a regular expression to collect data. Before collection, we need to understand the characteristics of the web page source code. If we do not know what we want, I am afraid we cannot write a regular expression. We want to collect the page is http://www.cnblogs.com/TopPosts.aspx this page, blog garden articles read rankings. Today's reading ranking, yesterday's reading ranking, and other information. However, we only need the following information:

· 2 months in a foreign company (read: 1909) (Comment: 21) Yesry
· Why avoid using triggers whenever possible? (Read: 1490) (Comment: 15) Noodles
· Discuz! NT System Architecture Analysis (read: 1391) (Comment: 18) Han Long
· Hard drive (read: 1342) (Comment: 15) Li Zhan

You only need to get the title, number of reads, comments, time, and author. Then we will analyze the source code features of key information.

Code [Copy to clipboard]

Code: <tr>
<TD style = "width: 80%">
· <A id = "ctl00_cphmain_toppostspaged?postsrepeater_ctl01_lnktitle" href = "http://www.cnblogs.com/yesry/archive/2008/06/25/1229587.html" target = "_ blank"> 2 months in a foreign company </a> <SPAN class = "title"> (read: 1909) (Comment: 21) () </span>

</TD>
<TD Height = "20">
<A id = "ctl00_cphmain_toppostspaged?postsrepeater_ctl01_lnkauthor" href = "http://yesry.cnblogs.com/"> yesry </a>
</TD>
</Tr>

This is the source code for collecting information. Before writing a regular expression, I need to note that, as we all know, these contents are also dynamically generated. Therefore, their format must be fixed. In this way, we can use a regular expression to correctly collect all information on the page. I don't think it is necessary to explain the meaning of a regular expression in detail in this article because it requires more practice.

Code [Copy to clipboard]

Code: RegEx regexarticles = new RegEx (". + · <A \ s + id = \". + \ "href = \"(? <URL>. +) \ "\ s + target = \" _ blank \ "> (? <Title>. +) </a> <span \ s + class = \ ". + \" >\\ (read :(? <Views> \ D +) \). * \ (comment :(? <Reply> \ D + )\\).*\\((? <Time>. + )\\) </span> \ s * </TD> \ s * <TD \ s + Height = \ "\ D + \"> \ s + <\ \ s + id = \". + \ "href = \"(? <Blog>. +) \ "> (? <Author>. +) </a> ");

These may make it difficult for you to read, but those who want to learn Regular Expressions will laugh at me, because my regular expression writing is not flexible enough. I want to give a brief introduction to my friends who have never been familiar with regular expressions. I am just getting started. A regular expression is used to describe the features of a string for matching. This is why we need to analyze the page source code. As for how to match, it is not difficult. I will provide some articles for your reference.
Regular Expression learning notes: http://hedong.3322.org/archives/000244.html
Regular Expression 30 minutes Getting Started: http://unibetter.com/deerchao/zhengzhe-biaodashi-jiaocheng-se.htm

I used these two articles to get started and use regular expressions to write my favoriteProgram. More articles can be searched on the Internet.

The key regular expression is mentioned above. We also need to explain how to obtain it.

Collect key code

Code [Copy to clipboard]

Code: // webpage operation object, which is used to obtain the webpage source code
HTML html = new HTML ();

// Collect the daily ranking data of the blog Garden
String htmlcode = html. gethtml ("http://www.cnblogs.com/TopPosts.aspx", "UTF-8 ");

// Extract the regular expression of the blog garden ranking article information
RegEx regexarticles = new RegEx (". + · <A \ s + id = \". + \ "href = \"(? <URL>. +) \ "\ s + target = \" _ blank \ "> (? <Title>. +) </a> <span \ s + class = \ ". + \" >\\ (read :(? <Views> \ D +) \). * \ (comment :(? <Reply> \ D + )\\).*\\((? <Time>. + )\\) </span> \ s * </TD> \ s * <TD \ s + Height = \ "\ D + \"> \ s + <\ \ s + id = \". + \ "href = \"(? <Blog>. +) \ "> (? <Author>. +) </a> ");

// All content matching the expression
Matchcollection marticles = regexarticles. Matches (htmlcode );

/** // Traverse matched content
Foreach (Match m in marticles)
{
Entity. Article test = new entity. Article ();
Test. Category = "blog garden popular articles"; // you can specify a category.
Test. Title = M. Groups ["title"]. value; // set the title
Test. url = M. Groups ["url"]. value; // set the connection
Test. Views = int. parse (M. Groups ["views"]. Value); // you can specify the number of views.
Test. replys = int. parse (M. Groups ["reply"]. Value); // you can specify the number of comments.
Test. datatime = M. Groups ["time"]. value; // set the release time.
Test. Author = M. Groups ["author"]. value; // set the author
Test. Site = M. Groups ["blog"]. value; // you can specify the document source.
List. Add (test );
}
Matchcollection marticles = regexarticles. Matches (htmlcode );

Use this code to obtain multiple matching contents.

Code [copy to clipboard]

code: foreach (Match m in marticles)

in a loop, use the match class to obtain a matching content. groups ["title"]. value is used to retrieve information from the specified group (? . +), "? "this is the code that groups the Matching content into titles. The code is like this. There is no technical content. Let's summarize a collection process.
1. obtain the source code of the specified page
2. analyze the features of the content in the source code.
3. write Regular Expressions for matching by features
4. this is the process of traversing the Matching content and loading the set

. I pack the code of the entire case for your reference. If you have any questions, leave a message.
source code download:
the collected data may not be in the same order as that displayed in the blog garden, because it is an article on the entire page and is not classified. However, the data is absolutely consistent.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.