An overview of a series of essays and the resulting background
This series of the opening by everyone's warm welcome, this is a great encouragement to bloggers, this is the third in this series, I hope you continue to support, for my continued writing to provide the impetus.
own development of the Bean John Blog backup expert software tools since the advent of more than 3 years, by the vast number of bloggers writing and reading enthusiasts love. At the same time, there are some technical enthusiasts consulting me, the software is a variety of practical functions of how to achieve.
The software is used. NET technology development, in order to give back to the community, now the software used in the core technology, open up a column, write a series of articles to treat the vast number of technology enthusiasts.
This series of articles in addition to explaining the network acquisition and the use of a variety of important technologies, but also provides a number of solutions to problems and interface development programming experience, very suitable. NET development of the beginner, intermediate readers, I hope you have a lot of support.
Many beginners often have this kind of confusion, "Why I read the book, C # related to all aspects of knowledge, but it is impossible to write a decent application?" ”
This actually still did not learn to use the knowledge comprehensively, exercise out of programming thinking, build up learning interest, I think the series of articles may help you, I hope so.
Development environment: VS2008
This section source location: Https://github.com/songboriceboy/GetWebAllPics
Source code Download method: Install the SVN client (provided at the end of this article), and then checkout the following address: Https://github.com/songboriceboy/GetWebAllPics.git
The outline of the article series is as follows:
1. How to use the C # language to obtain the blog of a blogger's full essay link and title;
2. How to use C # language to obtain the text and title of the blog post;
3. How to convert HTML pages to PDF (html2pdf) using the C # language
4. How to use the C # language to download all the pictures in the blog post to local and can browse offline
5. How to use the C # language to synthesize multiple individual PDF files into a PDF and generate a table of contents
6. NetEase Blog Links How to use C # language to obtain, NetEase blog particularity;
7. How to use the C # language to download the public number articles;
8. How to obtain the full text of any one article
9. How to use the C # language to get rid of all the tags in HTML for plain text (html2txt)
10. How to compile multiple HTML files into CHM (Html2chm) using the C # language
11. How to use the C # language to publish articles remotely to Sina Blog
12. How to develop a static site builder using the C # language
13. How to build a program framework using the C # language (Classic WinForm interface, top menu bar, toolbars, left tree list, right multi-tab interface)
14. How to implement the Web page Editor (Winform) using the C # language ...
Section Two: Introduction to the main content (how to download all the pictures in the post in the C # language to local and offline browsing)
The crawl of Web pages is divided into 3 main steps:
1. Crawl to all articles link collection via pagination link (Section one)
2. Get the title and text of the article through each article link (section II)
3. From the text of the article to resolve all the picture link, and the entire picture of the article download to local (this section)
These 3 steps have, after you want how to toss on how to toss, various processing processing, generate Pdf,chm, static site, remote publishing to other sites and so on.
How to use the C # language to download all the pictures in a blog post to a local and offline browse solution demo as shown in: Executable file download
When you click the Download Body button, a folder is generated in the directory where the executable program is located (the text is the title of the page), and the folder contains an HTML file (the body of the page) and all the pictures in the body of the page. The HTML file has been processed for the original body HTML file, and the image link has been modified to the local image file.
Three basic principles
Download all the pictures in the blog post can be broken down into 3 steps:
1. Download the text of the Web page to find out all the image link addresses;
2. For each image link address, download the image to local (from a filename), and replace the original image address for the file name we just started;
3. The second step after all the pictures are downloaded, save all the image links to the page body as a new HTML file (index.html).
Next we'll take a step-by-step look at how to do it:
1. Download the text of the Web page to find out all the image link addresses;
How to download the body of the page please refer to the second section, let's see how to get all the picture links in the body of the page:
Private voidgetsrclinks () {htmlnodecollection Atts= M_Doc.DocumentNode.SelectNodes ("//*[@src]"); if(Equals (Atts,NULL)) { return; } Links=Atts. SelectMany (n=New[] {parselink (n,"src"), }). Distinct (). ToArray (); }
The HTMLDocument class in Htmlagilitypack finds the node for all SRC attributes, and then extracts the Web page address from it through LINQ.
2. For each image link address, download the image locally , as shown in the following code:
Documentwithlinks links =htmldoc.getsrclinks (); inti =1; stringBASEURL =NewUri (strlink). Getleftpart (uripartial.authority); foreach(stringStrpiclinkinchlinks. Links) {if(string. IsNullOrEmpty (Strpiclink)) {Continue; } Try { stringStrextension =System.IO.Path.GetExtension (Strpiclink); if(strextension = =". js"|| Strextension = =". swf") Continue; if(strextension = ="") {strextension=". jpg"; } stringNormalizedpiclink =Getnormalizedlink (BASEURL, Strpiclink); Strnewpage= Downloadpicinternal (WC, strnewpage, Strpagetitle, Strpiclink, Normalizedpiclink, Strextension,refi); } Catch(Exception ex) {}//End Try}
The implementation code for Downloadpicinternal is as follows:
protected stringDownloadpicinternal (WebClient WC,stringStrnewpage,stringStrpagetitle,stringStrpiclink,stringStrturelink,stringStrextension,ref inti) {strpagetitle= Strpagetitle.replace ("\\",""). Replace ("/",""). Replace (":",""). Replace ("*",""). Replace ("?","") . Replace ("\"",""). Replace ("<",""). Replace (">",""). Replace ("|",""); Strpagetitle= Regex.Replace (Strpagetitle,@"[|/\;. ':*? <>-]",""). ToString (); Strpagetitle= Regex.Replace (Strpagetitle,"[\"]",""). ToString (); Strpagetitle= Regex.Replace (Strpagetitle,@"\s",""); if(! Directory.Exists (Application.startuppath +"\\"+ Strpagetitle))//determine if there is{directory.createdirectory (Application.startuppath+"\\"+ Strpagetitle);//Create a new path } int[] Narrayoffset =New int[2]; Narrayoffset= M_bf.getoffset (strpiclink); strnewpage = Strnewpage.replace (Strpiclink, narrayoffset[0]. ToString () + narrayoffset[1]. ToString () + strextension); stringStrsavedpicpath = Path.Combine (Strpagetitle, narrayoffset[0]. ToString () + narrayoffset[1]. ToString () +strextension); Printlog ("start download article ["+ Strpagetitle +"] the first"+ i.tostring () +"picture \ n"); Strturelink=Httputility.urldecode (Strturelink); Wc. DownloadFile (Strturelink, Application.startuppath+"\\"+Strsavedpicpath); Printlog ("Download complete article ["+ Strpagetitle +"] the first"+ i.tostring () +"picture \ n"); System.Threading.Thread.Sleep ( -); I++; returnStrnewpage; }
Where the pink code part of the M_BF variable is an object of type Bloomfilter, Bloomfilter is a powerful tool for Web pages to be redirected, here to translate the image link into a unique file name.
Strnewpage = Strnewpage.replace (Strpiclink, narrayoffset[0]. ToString () + narrayoffset[1]. ToString () + strextension);
This line of code replaces the image link in the original page with the new picture file name. Other parts of the code are explained in the previous section, please refer to it yourself.
3. The second step after all the pictures are downloaded, save all the image links to replace the page body as a new HTML file (index.html), the main code is as follows:
Strpagetitle = Strpagetitle.replace ("\\",""). Replace ("/",""). Replace (":",""). Replace ("*",""). Replace ("?","") . Replace ("\"",""). Replace ("<",""). Replace (">",""). Replace ("|",""); Strpagetitle= Regex.Replace (Strpagetitle,@"[|/\;. ':*? <>-]",""). ToString (); Strpagetitle= Regex.Replace (Strpagetitle,"[\"]",""). ToString (); Strpagetitle= Regex.Replace (Strpagetitle,@"\s",""); File.writealltext (Path.Combine (Strpagetitle,"index.html"), Strnewpage, Encoding.UTF8);
A bunch of substitutions above are because Windows requires a folder name---cannot contain special characters, where we remove these special characters by regular substitution.
So far, we have implemented the text of any page in the body of the download to the local function, and at the same time modified the original page in the body of the image link, in order to achieve offline browsing purposes.
After the generation of Pdf,chm are based on this, this section is the most serious, interested students can expand the code I provide, it is a site to change the picture collector should also be a simple thing.
A preview of the day
How to convert HTML pages into PDF (html2pdf) using the C # language.
Song Bo
Source: http://www.cnblogs.com/ice-river/
The copyright of this article is owned by the author and the blog Park, welcome reprint, but without the consent of the author must retain this paragraph, and in the article page obvious location to give the original link.
is looking at my blog This children's shoes, I see you imposing, there is a faint of the king's Breath, there will be a future! Next to the word "recommended", you can conveniently point it, action quasi, I do not accept a penny, you also good to come back to me!
Analysis of the core technology of network acquisition software series (3)---How to download all the pictures in the post in the C # language to local and browse offline