Analysis of core technology of network acquisition Software series (1)---How to use C # language to get all the essay links and titles of a blogger in a blog park

Source: Internet
Author: User
Tags svn client

An overview of a series of essays and the resulting background

This series of articles in addition to explaining the network acquisition and the use of a variety of important technologies, but also provides a number of solutions to problems and interface development programming experience, very suitable. NET development of the beginner, intermediate readers, I hope you have a lot of support.

Many beginners often have this kind of confusion, "Why I read the book, C # related to all aspects of knowledge, but it is impossible to write a decent application?" ”,

This actually still did not learn to use the knowledge comprehensively, exercise out of programming thinking, build up learning interest, I think the series of articles may help you, I hope so.

Development environment: VS2008

SOURCE Location: Https://github.com/songboriceboy/NetworkGatherEditPublish

Source code Download method: Install the SVN client (provided at the end of this article), and then checkout the following address: Https://github.com/songboriceboy/NetworkGatherEditPublish

The outline of the article series is as follows:

1. How to use the C # language to obtain the blog of a blogger's full essay link and title;
2. How to use the C # language to get the content of the blog post;
3. How to convert HTML pages to PDF (html2pdf) using the C # language
4. How to use the C # language to download all the pictures in the blog post to local and can browse offline
5. How to use the C # language to synthesize multiple individual PDF files into a PDF and generate a table of contents
6. NetEase Blog Links How to use C # language to obtain, NetEase blog particularity;
7. How to use the C # language to download the public number articles;
8. How to obtain the full text of any one article
9. How to use the C # language to get rid of all the tags in HTML for plain text (html2txt)
10. How to compile multiple HTML files into CHM (Html2chm) using the C # language
11. How to use the C # language to publish articles remotely to Sina Blog
12. How to develop a static site builder using the C # language
13. How to build a program framework using the C # language (Classic WinForm interface, top menu bar, toolbars, left tree list, right multi-tab interface)
14. How to implement the Web page Editor (Winform) using the C # language ...

The first section of the main content of the introduction (how to use the C # language to get a blog in the garden of a blogger all the essay link and title)

Get a blogger's full blog post link and title solution, demo demo as shown: executable file download

Three basic principles

It takes 2 steps to collect all of the blog page addresses of a blogger:

1. Obtain the source code of the Web page through the paging link;

2. Parse the address and title of the article from the source code of the Web page obtained;

The first step is to find a paging link, such as my blog

First page http://www.cnblogs.com/ice-river/default.html?page=1

Second page http://www.cnblogs.com/ice-river/default.html?page=2

We can write a function to save the paging address string in a queue, as shown in the following code.

In the code below, we saved 500 pages by default, 500 pages * 20 posts = 10,000 posts, which is generally enough, unless for a particularly productive blogger.

Also, a friend may ask, 500 pages is not too much, some bloggers only 2, 3 pages, we need to collect 500 pages to get all the blog link?

This is because we don't know how many blogs the blogger has written (divided into pages), so let's start with 500 pages by default.

, we'll talk about a way to judge that we've got all the links in the article, but we're not going to be able to access 500 pages per blogger.

 protected voidGatherinitcnblogsfirsturls () {stringStrpagepre ="http://www.cnblogs.com/"; stringStrpagepost ="/default.html?page={0}&onlytitle=1"; stringStrpage = Strpagepre + This. Txtboxcnblogsblogid.text +Strpagepost;  for(inti = -; i >0; i--)            {                stringstrtemp =string.                Format (strpage, i); M_wd.            Addurlqueue (strtemp); }        }

As for getting a Web page source file (that is, you are in the browser, right-click on a page---View the page source code function)

C # language has provided us with a ready-made HttpWebRequest class, I encapsulate it into a webdownloader class, the details of which you can refer to the source code, the main function is implemented as follows:

      Public stringGetpagebyhttpwebrequest (stringURL, Encoding Encoding,stringStrrefer) {            stringresult =NULL; WebResponse response=NULL; StreamReader Reader=NULL; Try{HttpWebRequest Request=(HttpWebRequest) webrequest.create (URL); Request. UserAgent="mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;. NET CLR 2.0.50727)"; Request. Accept="image/gif, Image/x-xbitmap, Image/jpeg, Image/pjpeg, Application/x-shockwave-flash, Application/vnd.ms-excel, Application/vnd.ms-powerpoint, Application/msword, */*"; if(!string. IsNullOrEmpty (Strrefer)) {Uri U=NewUri (Strrefer); Request. Referer=U.host; }                Else{request. Referer=Strrefer; } request. Method="GET"; Response=request.                GetResponse (); Reader=NewStreamReader (response.                GetResponseStream (), encoding); Result=Reader.                            ReadToEnd (); }            Catch(Exception ex) {result=""; }            finally            {                if(Reader! =NULL) reader.                Close (); if(Response! =NULL) Response.                            Close (); }            returnresult; }

The first parameter passed in is the 500 paging address we formed above, the return value of the function is the source code of the Web page (we want the article address and the title is in it, and then we want to parse them out).

Step Two: Parse the address and title of the article from the source code of the Web page obtained

We want to use the famous Htmlagilitypack class library, Htmlagilitypack is an HTML document parsing tool, through which we can easily get the title of the page, text, classification, date and so on, theoretically any element, related documents online there are many, There's not much to say here. Here we have added an extension method for Htmlagilitypack to extract all hyperlinks getreferences and links corresponding to the text Getreferencestext of any page source file.

    Private voidgetreferences () {htmlnodecollection hrefs= M_Doc.DocumentNode.SelectNodes ("//a[@href]"); if(Equals (HREFs,NULL) ) {References=New string[0]; return; } References=HREFs. Select (href= href. attributes["href"].                Value).                Distinct ().        ToArray (); }
Private voidGetreferencestext () {Try{m_diclink2text.clear (); Htmlnodecollection HREFs= M_Doc.DocumentNode.SelectNodes ("//a[@href]"); if(Equals (HREFs,NULL))                {                    return; }                foreach(Htmlnode nodeinchhrefs) {                    if(!m_diclink2text.keys.contains (node. attributes["href"]. Value.tostring ()))if(! Httputility.htmldecode (node. InnerHtml). Contains ("img SRC")                            &&! Httputility.htmldecode (node. InnerHtml). Contains ("img")                            &&! Httputility.htmldecode (node. InnerHtml). Contains ("src") ) M_diclink2text.add (node. attributes["href"]. Value.tostring (), Httputility.htmldecode (node.                InnerHtml)); }                intA =0; }            Catch(System.Exception e) {System.Console.WriteLine (e.tostring ()); }        }

But notice that at this point we're getting all the link addresses in a page, which is actually a lot closer to what we want, so we need to filter out the address of the posts we really want in these collection of links.

At this point we need to use a powerful regular expression tool, the same C # provides a ready-made support class, but we need to understand the regular expression, here does not explain the relevant knowledge of the regular expression, do not understand the Baidu itself.

First we need to look at the format of the Post link address:

Find a few blog posts:

http://www.cnblogs.com/ice-river/p/3475041.html

http://www.cnblogs.com/Zhijianliutang/p/4042770.html

We found that the link is related to the blogger ID, so the blogger ID we need to have a variable (this.txtBoxCnblogsBlogID.Text) to record,

The above link pattern can be represented as follows with a regular expression:

"www\.cnblogs\.com/" + This.txtBoxCnblogsBlogID.Text + "/p/.*?\.html$";

A simple explanation: \ means escape, because. There is an important meaning in the regular expression; $ stands for the end, html$ means ending in HTML. . * What is it, important and not very well understood

There are two modes, one for greedy mode (default) and the other for lazy mode, the following example: (ABC) DFE (GH) is used for the above string (. *) will match the entire string, because the regular default is as many matches as possible. Although (ABC) satisfies our expression, the (ABC) DFE (GH) is equally satisfied, so the regular matches the number of that one. If we just want to match (ABC) and (GH) We need to use the following expression (. *? in the repeating meta-character * or + followed by one, the function is to match as little as possible under the conditions of satisfaction.

So, the regular expression above means "contains the www.cnblogs.com/ and then the main ID and then /p/ then any number of characters until the HTML is encountered." End ".

Then, we can filter all the links that conform to this pattern through C # code, the main code is as follows:

MatchCollection matchs =regex.matches (Normalizedlink, M_strcnblogsurlfilterrule, regexoptions.singleline); if(Matchs. Count >0)                {                    stringStrlinktext =""; if(Links.m_dicLink2Text.Keys.Contains (normalizedlink)) Strlinktext=Links.m_diclink2text[normalizedlink]; if(Strlinktext = ="")                    {                        if(Links.m_dicLink2Text.Keys.Contains (link)) Strlinktext=Links.m_diclink2text[link]. TrimEnd ().                    TrimStart (); } printlog (Strlinktext+"\ n"); Printlog (Normalizedlink+"\ n");                Lstthistimesurls.add (Normalizedlink); }

Judging all the article links get done: Before, we are planning to collect 500 paging addresses, but it is possible that bloggers have only a few pages of all the blog posts, then how can we tell that all the articles are downloaded to complete?

The method is actually very simple, is that we use 2 sets, one is the current download of all the article collection, one is the download to the collection of articles, if this download all the articles, previously downloaded all the collection has, then all the articles are downloaded to complete.

program, I encapsulate this judgment as a function with the following code:

  Private BOOLCheckarticles (list<string>lstthistimesurls) {            BOOLBRet =true; foreach(stringstrtempinchlstthistimesurls) {                if(!m_lsturls.contains (strtemp)) {BRet=false;  Break; }            }            foreach(stringstrtempinchlstthistimesurls) {                if(!m_lsturls.contains (strtemp)) M_lsturls.add (strtemp); }                     returnBRet; }

Four other more important knowledge

1.BackgroundWorker worker threads are used because our acquisition task is a relatively time-consuming job, so we should not put it into the main thread of the interface to do it, we should start a background thread, C # The most convenient way to use a background thread is to use the BackgroundWorker class.

2. Because we need to parse out the address and title of each article, printed on the interface, and because we can not modify the interface control in the worker thread, so here we need to use the proxy delegate technology in C #, through a callback to achieve the output of information on the interface.

Taskdelegate deles =NewTaskdelegate (Newcctaskdelegate (refreshtask));  Public voidRefreshtask (Delegatepara dp) {//If you need to execute in a secure thread context            if( This. invokerequired) { This. Invoke (Newcctaskdelegate (Refreshtask), DP); return; }                      //Conversion Parameters            stringStrlog = (string) Dp.strlog;        Writelog (Strlog); }        protected voidPrintlog (stringStrlog) {Delegatepara DP=NewDelegatepara (); Dp.strlog=Strlog; Deles.        Refresh (DP); }         Public voidWritelog (stringStrlog) {            Try{Strlog= System.DateTime.Now.ToLongTimeString () +" : "+Strlog;  This. Richtextboxlog.appendtext (Strlog);  This. Richtextboxlog.selectionstart =int.                MaxValue;  This. Richtextboxlog.scrolltocaret (); }            Catch            {            }        }

Analysis of core technology of network acquisition Software series (1)---How to use C # language to get all the essay links and titles of a blogger in a blog park

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.