Analysis of core technology of network acquisition Software series (2)---How to get the body and title of any site blog in C # language

Source: Internet
Author: User
Tags svn client

An overview of a series of essays and the resulting background

This series of the opening by everyone's warm welcome, this is a great encouragement to bloggers, this is the second chapter of this series, I hope you continue to support, for me to continue writing to provide power.

own development of the Bean John Blog backup expert software tools since the advent of more than 3 years, by the vast number of bloggers writing and reading enthusiasts love. At the same time, there are some technical enthusiasts consulting me, the software is a variety of practical functions of how to achieve.

The software is used. NET technology development, in order to give back to the community, now the software used in the core technology, open up a column, write a series of articles to treat the vast number of technology enthusiasts.

This series of articles in addition to explaining the network acquisition and the use of a variety of important technologies, but also provides a number of solutions to problems and interface development programming experience, very suitable. NET development of the beginner, intermediate readers, I hope you have a lot of support.

Many beginners often have this kind of confusion, "Why I read the book, C # related to all aspects of knowledge, but it is impossible to write a decent application?" ”,

This actually still did not learn to use the knowledge comprehensively, exercise out of programming thinking, build up learning interest, I think the series of articles may help you, I hope so.

Development environment: VS2008

This section source location: Https://github.com/songboriceboy/GetWebContent

Source code Download method: Install the SVN client (provided at the end of this article), and then checkout the following address: https://github.com/songboriceboy/GetWebContent

The outline of the article series is as follows:

1. How to use the C # language to obtain the blog of a blogger's full essay link and title;
2. How to use C # language to obtain the text and title of the blog post;
3. How to convert HTML pages to PDF (html2pdf) using the C # language
4. How to use the C # language to download all the pictures in the blog post to local and can browse offline
5. How to use the C # language to synthesize multiple individual PDF files into a PDF and generate a table of contents
6. NetEase Blog Links How to use C # language to obtain, NetEase blog particularity;
7. How to use the C # language to download the public number articles;
8. How to obtain the full text of any one article
9. How to use the C # language to get rid of all the tags in HTML for plain text (html2txt)
10. How to compile multiple HTML files into CHM (Html2chm) using the C # language
11. How to use the C # language to publish articles remotely to Sina Blog
12. How to develop a static site builder using the C # language
13. How to build a program framework using the C # language (Classic WinForm interface, top menu bar, toolbars, left tree list, right multi-tab interface)
14. How to implement the Web page Editor (Winform) using the C # language ...

Second section of the main content introduction (how to use C # language to get the body and title of any site blog)

The solution for getting the body and title of any site post in the C # language demo as shown in: Executable file download

Three basic principles

To get the body and title of any page article, in addition to using the HtmlAgilityPack.dll assembly mentioned in the previous section, we need to use another useful assembly Fizzler.dll (http://fizzlerex.codeplex.com/)

Htmlagilitypack parsing HTML elements through XPath is relatively cumbersome, and Fizzler provides a way to parse HTML elements like a CSS selector, which is very much in line with our habits.

Usually for an article, we just want to keep the text of the article (remove ads, sidebar, and other surrounding page layout elements), then we will look at the operation steps, here we need to use a powerful browser tool.

1. Use the Firefox browser or Chrome browser to open the page we want to extract the text, Firefox to install the Firebug plugin, chrome directly press F12, here we use Firefox example:

For example, open our previous section of the blog (http://www.cnblogs.com/ice-river/p/4110799.html) as shown in:

First in the top right corner of the bug icon after we installed the Firebug plug-in, click on it, the browser at the bottom of the debugging interface, in the debugging interface, click on my Red Line box up the icon (a blue box, there is an arrow), you will find that the various elements of the page can be framed, after we have selected the text, It will be found that the corresponding DIV element in the following debug interface is highlighted, we right-click on the div element (blog Park here is div#cnblogs_post_body), Popup Right-click menu, as shown in:

we click [Copy CSS Path menu] and we get the CSS path of the body in the Clipboard. [HTML body div#home div#main div#maincontent div.forflow div#topics div.post div.postbody div# Cnblogs_post_body]

For Fizzler, we only need to provide the last part of the div#cnblogs_post_body (we remember, we just need to look from the long string of the CSS path obtained from the back forward, the last space after the string, here is the div# Cnblogs_post_body

Fill in this string into the [body CSS Path] section of our demo, as shown in:

In fact, for Fizzler, just one line of code:

   ienumerable

Isn't it simple?

For other technical blogs, you can practice on your own to see if you understand the method I mentioned above, here are a few common technical blog Text CSS path answer:

Site--->CSS Path"Cnblogs"--->"Div#cnblogs_post_body""Csdn"--->"div#article_content.article_content""51CTO"--->"div.showcontent""Iteye"--->"div#blog_content.blog_content""itpub"--->"Div. BLOG_WZ1""Chinaunix"--->"Div. BLOG_WZ1"

All right, come back and let's talk about the key code in this section of the demo:

Get the blog body title:

Private voidGetTitle () {stringstrcontent= M_wd. Getpagebyhttpwebrequest ( This. Textboxurl.text, Encoding.UTF8); Htmlagilitypack.htmldocument Htmldoc=Newhtmlagilitypack.htmldocument {optionadddebuggingattributes=false, Optionautocloseonend=true, Optionfixnestedtags=true, Optionreadencoding=true            };            Htmldoc.loadhtml (strcontent); stringStrtitle =""; Htmlnodecollection nodes= HtmlDoc.DocumentNode.SelectNodes ("//title"); //Extract Title            if(! Equals (Nodes,NULL) ) {strtitle=string. Join (";", nodes. Select (n=n.innertext). ToArray ()).            Trim (); } strtitle= Strtitle.replace ("Blog Park",""); Strtitle= Regex.Replace (Strtitle,@"[|/\;:*? <>&#-]","").            ToString (); Strtitle= Regex.Replace (Strtitle,"[\"]","").            ToString ();  This. Textboxtitle.text =Strtitle.trimend (); }

The main process is to first use the Webdownloader class given in our previous section to get the source code of the Web page, and then get the page title from the following line of code:

Htmlnodecollection nodes = htmlDoc.DocumentNode.SelectNodes ("//title");

Here with the help of Htmlagilitypack's selectnodes function to extract the title element of the page, note that the general well-formed web page has the title element, because it is convenient for the search engine index to ingest our article, explain what is the title element

Attention, I circled the Red Pen 2 places, should be self-evident, do not explain.

Get the content of the blog body:

  Private voidgetmaincontent () {stringstrcontent= M_wd. Getpagebyhttpwebrequest ( This. Textboxurl.text, Encoding.UTF8); Htmlagilitypack.htmldocument Htmldoc=Newhtmlagilitypack.htmldocument {optionadddebuggingattributes=false, Optionautocloseonend=true, Optionfixnestedtags=true, Optionreadencoding=true            };            Htmldoc.loadhtml (strcontent); IEnumerable<HtmlNode> nodesmaincontent = HtmlDoc.DocumentNode.QuerySelectorAll ( This. Textboxcsspath.text); if(Nodesmaincontent.count () >0)            {                 This. richTextBox1.Text = Nodesmaincontent.toarray () [0].                outerhtml;  This. Webbrowser1.documenttext = This. richTextBox1.Text; }        }

It's very simple to call the HTMLDOC.DOCUMENTNODE.QUERYSELECTORALL function, and the argument passes to the CSS path of the body div we talked about above, and finally Nodesmaincontent.toarray () [0]. outerHTML is the source code of the content of the Web page, put in the richTextBox1.Text display HTML source code, put in the Webbrowser1.documenttext to display the content of the Web page.

A preview of the day

The crawl of Web pages is divided into 3 main steps:

1. Crawl to all articles link collection via pagination link (Section one)

2. Get the title and text of the article through each article link (this section)

3. From the text of the article to resolve all the picture link, and the entire picture of the article to download to local (next section content)

These 3 steps have, after you want how to toss on how to toss, various processing processing, generate Pdf,chm, static site, remote publishing to other sites and so on (please continue to pay attention to this series of articles, and the generous point of recommendation, your support is my greatest power of writing).

Song Bo
Source: http://www.cnblogs.com/ice-river/
The copyright of this article is owned by the author and the blog Park, welcome reprint, but without the consent of the author must retain this paragraph, and in the article page obvious location to give the original link.
is looking at my blog This children's shoes, I see you imposing, there is a faint of the king's Breath, there will be a future! Next to the word "recommended", you can conveniently point it, action quasi, I do not accept a penny, you also good to come back to me!

Analysis of core technology of network acquisition Software series (2)---How to get the body and title of any site blog in C # language

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.