C # Web browser for network programming: Getting webpage URLs and downloading images from webpages,
This article mainly uses the web browser of C # network programming to obtain the url in the web page and simply try the pictures in the web page, mainly for the basic learning of future network development. the application mainly integrates web page knowledge and regular expressions to implement browsing, obtaining URLs, and downloading images. it is clear that each step is implemented based on the previous step.
I. Interface Design
Shows the interface design. Add a control and set the Anchor attribute of webBrowser1 to Top, Bottom, Left, and Right to zoom in and out the dialog box. Set groupBox1 to its Dock (define the border to be bound to the Container Control) when the browser is scaled, groupBox1 is always at the bottom. Set the HorizontalScrollbar attribute of listBox to True to display the horizontal scroll bar.
Ii. Source Code 1. namespace
// Add a new namespace using System. Net; using System. IO; using System. Text. RegularExpressions; // Regular Expression
2. Browsing
Click the "Browse" button to generate the button#click (object sender, EventArgs e). Click the event and add the following code to view the webpage:
Private void button#click (object sender, EventArgs e) {webBrowser1.Navigate (textBox1.Text. Trim (); // display webpage}
Call the webBrowser Navigate method to load the documents at the specified position into the control. One of the methods for overloading Navigate (urlString) load the document at the specified URL of the unified resource locator to the WebBrowser control and replace the previous document.
3. Get
Click "get" to generate the button2_Click (object sender, EventArgs e). Click the event and add the following code."Html. OuterHtml"The HTML content of the current webpage. The URL Hyperlink and image URL of all content in the webpage are obtained using the regular expression and displayed in the listBox control.
// Define the number of image URLs retrieved from the num record listBox2 public int num = 0; // click the "get" button private void button2_Click (object sender, EventArgs e) {HtmlElement html = webBrowser1.Document. body; // defines the HTML element string str = html. outerHtml; // obtain the current element's HTML code MatchCollection matches; // define a regular expression match set // clear listBox1.Items. clear (); listBox2.Items. clear (); // get try {// get the regular expression <a href> </a> content url matches = Regex. matches (str, "<a href = \" ([^ \ "] *?) \ ". *?> (.*?) </A> ", RegexOptions. ignoreCase); foreach (Match match in matches) {listBox1.Items. add (match. value. toString ();} // obtain the image url matches = Regex. matches (str, @ "] *? \ Bsrc [\ s \ t \ r \ n] * = [\ s \ t \ r \ n] * ["']? [\ S \ t \ r \ n] * (? [^ \ s \ t \ r \ n "" '<>] *) [^ <>] *? /? [\ S \ t \ r \ n] *> ", RegexOptions. ignoreCase); foreach (Match match in matches) {listBox2.Items. add (match. value. toString ();} // record image count num = listBox2.Items. count;} catch (Exception msg) {MessageBox. show (msg. message); // Exception Handling }}
MatchCollection Regex. matches (string input, string pattern, RegexOption options) indicates that all results of the specified regular expression are searched in the input string using the specified matching option pattern. the above RegexOptions. ignoreCase indicates case-insensitive matching. in the download process, I will display the successful download result to listBox2. Therefore, use num to calculate the total number of images.
4. Download
In "get", we have obtained the URLs and image URLs of all webpage content. Here we want to download the image, but its format is usually: " Http://www.baidu.com/img/bdlogo.gif "Width =" 270 "height =" 129 ">". Therefore, you only need to obtain the content in src to access the image. You can simply download the image by calling file-related knowledge. the value in src is obviously obtained through a regular expression. the Code is as follows:
// Click "Download" to download the image private void button3_Click (object sender, EventArgs e) {string imgsrc = string. empty; // define // loop download for (int j = 0; j <num; j ++) {string content = listBox2.Items [j]. toString (); // obtain the image url Regex reg = new Regex (@ " [^ ""] *) "" [^>] *> ", RegexOptions. ignoreCase); MatchCollection mc = reg. matches (content); // you can specify the string foreach (Match m in mc) {try {WebRequest request = WebRequest. create (m. groups ["src"]. value); // image src content WebResponse response = request. getResponse (); // Stream reader = response. getResponseStream (); string path = "E: //" + j. toString () + ". jpg "; // name of the image path FileStream writer = new FileStream (Path, FileMode. openOrCreate, FileAccess. write); byte [] buff = new byte [512]; int c = 0; // actual number of bytes read while (c = reader. read (buff, 0, buff. length)> 0) {writer. write (buff, 0, c);} // release the resource writer. close (); writer. dispose (); reader. close (); reader. dispose (); response. close (); // the download is successful. add (path + ": The image is saved successfully! ");} Catch (Exception msg) {MessageBox. Show (msg. Message );}}}}
This part of code may have several problems:
(1). The format for obtaining images is not necessarily jpg. Here we mainly want to demonstrate an idea. Just set the image acquisition parameters for different images;
(2 ). the download speed using the file stream method is very slow. You can use other methods, WebClient. downloadFile () and so on, because I just studied the file knowledge and web crawler, so I used this basic method;
(3) The two-layer loop in the Code is a bit redundant, but MatchCollection mc obtains a matching set, and the overall feeling is still a bit messy;
(4). If you want to download images in batches, it is best to use the thread and other knowledge. At the same time, you can use some excellent algorithms (with emphasis on algorithms) and get them in memory. This program is only basic knowledge.
Iii. Running results Shows the running result. Click "Browse" to browse the webpage. Click "get" to obtain the URL of the webpage and display it in The listBox control, finally, click "Download" to save the image to the E-disk directory. below is the logo icon downloaded when Browsing Baidu. (If the image does not have a source URL path, you need to implement it by yourself, for example, )
4. Basic Web Page Knowledge This article mainly introduces the basic knowledge of hyperlinks and image links in HTML webpage creation to help you better understand this article. (Refer to Zhao fengnian's "webpage creation tutorial")
1. to create A hyperlink on A webpage, you must use the identifier. The end identifier is </A>. its basic attribute is href, which is used to specify the hyperlink target. You can create different types of hyperlinks by specifying different values through the href attribute. you can also click an object between <A> and </A> as the hyperlink source (text or image ). such as Baidu homepage: "<a href =" http://news.baidu.com "> New & nbsp; news </a> ". (This section does not describe the anchor link.) 2. insert an image in HTML insert an image into the webpage using the IMG tag. The two basic attributes are src and alt. set the location of the image file and replace the text. (1 ). the src attribute indicates the file name of the image to be inserted. It must contain an absolute or relative path. (2 ). the alt attribute indicates a simple text description of the image. It is used to replace the display in a browser that cannot display the image or when the display time is too long. for example, Baidu homepage logo icon image "
5. Regular Expressions A Regular Expression is a string composed of characters. It defines a pattern for searching and matching strings. many languages include Perl, PHP, Python, JavaScript, and JScript. Regular Expressions are supported to process text. Some text editors use regular expressions to implement advanced search-replace functions. one of the regular expressions I have come into contact with is the user name and password settings and the web page knowledge, so I also need to learn this part. here we mainly use three regular expressions. The following two codes are very useful:
1. Retrieve URLs of all images in HTML (Reference: http://blog.csdn.net/smeller/article/details/7108502)
/// <Summary> /// obtain the URL of all images in HTML /// </summary> /// <param name = "sHtmlText"> HTML code </param> /// <returns> image URL list </returns> public static string [] GetHtmlImageUrlList (string sHtmlText) {// define a regular expression to match the img Tag Regex regImg = new Regex (@ "] *? \ Bsrc [\ s \ t \ r \ n] * = [\ s \ t \ r \ n] * ["']? [\ S \ t \ r \ n] * (? [^ \ s \ t \ r \ n "" '<>] *) [^ <>] *? /? [\ S \ t \ r \ n] *> ", RegexOptions. ignoreCase); // search for the matched string MatchCollection matches = regImg. matches (sHtmlText); int I = 0; string [] sUrlList = new string [matches. count]; // obtain the Match list foreach (match in matches) {sUrlList [I ++] = Match. groups ["imgUrl"]. value;} return sUrlList ;}
2. Obtain the src path of the image and save it. (Reference: http://bbs.csdn.net/topics/320001867)
/// <Summary> /// obtain the image path and store it. /// </summary> /// <param name = "M_Content"> content to be retrieved </param >/// <returns> IList </returns> public static IList <string> GetPicPath (string M_Content) {IList <string> im = new List <string> (); // defines a generic character class Regex reg = new Regex (@ " [^ ""] *) "" [^>] *> ", RegexOptions. ignoreCase); MatchCollection mc = reg. matches (M_Content); // you can specify the string foreach (Match m in mc) {im. add (m. groups ["src"]. value);} return im ;}
Vi. Summary This article mainly explains how to obtain URLs and download images from web crawlers in C # network knowledge. It clearly tells you that you need to browse the Web page to obtain URLs first, at least obtain the HTML content of the webpage. Obtain the <A href> </A> content through A simple regular expression; if you want to download an image, you need to get the image URL get the src URL. After you download the image from this URL, you can use a regular expression to obtain the image, many download methods can be used. Here, file streams are used. It is best to use multiple threads and other batch download methods.(Free of charge: Http://download.csdn.net/detail/eastmount/6355125)
This document mainly introduces some basic network knowledge. At the same time, I am constantly studying and explaining the two basic concepts of regular expressions and web pages. finally, I would like to thank the bloggers and some people on the website in this article. I hope this article will be helpful to everyone. If there are any errors or deficiencies in this article, please ask Hai Han.