C # Network programming webbrowser get web page urls and download pictures in Web pages

Source: Internet
Author: User

This article is mainly through the C # Network programming webbrowser to get the URL in the Web page and simply try to download the pictures in the Web page, mainly for the future development of the foundation of the network learning. The main application of web-based knowledge, regular expressions to achieve the browse, get URLs, Download image three features. And it's clear that each step is based on a previous step.

I. Interface design

Interface design as shown, add control, set webBrowser1 its Anchor property is top, Bottom, left, right, implement dialog zoom, set groupBox1 its dock (define the border to bind to the container control) is Buttom, The GroupBox1 is always at the bottom when the browser is scaled, and the Horizontalscrollbar property of the listbox is set to True to display the horizontal scroll bar.

two. Source Code1. Namespaces [CSharp]View Plaincopy
    1. Add New namespaces
    2. Using System.Net;
    3. Using System.IO;
    4. Using System.Text.RegularExpressions; Regular expressions
2. Browse

Click the "Browse" button to generate Button1_Click (object sender, EventArgs e) Click on the event to add the following code to implement the Browse Web page:

[CSharp]View Plaincopy
    1. private void Button1_Click (object sender, EventArgs e)
    2. {
    3. WebBrowser1.Navigate (TextBox1.Text.Trim ()); Show Web page
    4. }

Call the Navigate method of WebBrowser to load the document at the specified location into the control, where an overloaded method navigate (urlstring) loads the document at the established Uniform Resource Locator URL into the WebBrowser control to replace the previous document.

3. Get

Click on the "Get" button to generate Button2_Click (object sender, EventArgs e) Click on the event to add the following code, by getting "HTML. outerHTML " the HTML content of the current Web page, using regular expressions to get URL hyperlinks and pictures of all content in a Web page, and display them in a ListBox control.

[CSharp]View Plaincopy
  1. <strong>//defines the number of picture URLs obtained in num record ListBox2
  2. public int num = 0;
  3. Click on the "Get" button
  4. private void Button2_Click (object sender, EventArgs e)
  5. {
  6. HtmlElement html = webBrowser1.Document.Body; Defining HTML Elements
  7. String str = HTML.                       outerhtml; Gets the HTML code for the current element
  8. MatchCollection matches; Defining a regular expression matching collection
  9. Empty
  10. ListBox1.Items.Clear ();
  11. ListBox2.Items.Clear ();
  12. Get
  13. Try
  14. {
  15. Regular expressions get <a href></a> content URL
  16. matches = regex.matches (str, "<a href=\" ([^\]]*?) \ ".*?> (. *?) </a> ", regexoptions.ignorecase);
  17. foreach (match match in matches)
  18. {
  19. LISTBOX1.ITEMS.ADD (match. Value.tostring ());
  20. }
  21. Regular expression get picture URL
  22. matches = regex.matches (str, @ "]*?\bsrc[\s\t\r\n]*=[\s\t\r\n]*[" "']?[ \s\t\r\n]* (? [^\s\t\r\n "" ' <>]* ") [^<>]*?/? [\s\t\r\n]*> ", regexoptions.ignorecase);
  23. foreach (match match in matches)
  24. {
  25. LISTBOX2.ITEMS.ADD (match. Value.tostring ());
  26. }
  27. Total number of recorded pictures
  28. num = ListBox2.Items.Count;
  29. }
  30. catch (Exception msg)
  31. {
  32. MessageBox.Show (Msg.    Message); Exception handling
  33. }
  34. }</strong>

where MatchCollection regex.matches (string input,string pattern,regexoption options) Represents all the results of a specified regular expression in the input string, using the specified matching option. The above regexoptions.ignorecase indicates a case-insensitive match. Because in the download I will show the download success results into ListBox2, So here Num First calculates the total number of pictures.

4. Download

in "get" we have obtained the URL of all the page content and the URL of the picture, here want to download the picture, but its format is usually: " Http://www.baidu.com/img/bdlogo.gif " width=" "height=" 129 ">" So here just need to get SRC content implementation to access the image, in the call file-related knowledge to implement a simple download picture. And getting the values in SRC is obviously also obtained through regular expressions. The code is as follows:

[CSharp]View Plaincopy
  1. Click "Download" to implement the download image
  2. private void Button3_Click (object sender, EventArgs e)
  3. {
  4. String imgsrc = String.             Empty; Defined
  5. Cycle download
  6. for (int j = 0; j < Num; j + +)
  7.  {  
  8. String content = Listbox2.items[j].    ToString (); Get Picture URL
  9. Regex reg = new Regex (@ "[^" ""]*) "" [^>]*> ", regexoptions.ignorecase);
  10. MatchCollection mc = Reg.        Matches (content); Set the string to find
  11. foreach (Match m in MC)
  12. {
  13. Try
  14. {
  15. WebRequest request = webrequest.create (m.groups["src"). Value);//Picture src content
  16. WebResponse response = Request. GetResponse ();
  17. File stream Get Picture operation
  18. Stream reader = response. GetResponseStream ();
  19. String path = "e://" + j.tostring () + ". jpg"; Picture path naming
  20. FileStream writer = new FileStream (path, FileMode.OpenOrCreate, FileAccess.Write);
  21. byte[] buff = new byte[512];
  22. int c = 0; The number of bytes actually read
  23. while (c = reader. Read (buff, 0, buff. Length)) > 0)
  24. {
  25. Writer. Write (buff, 0, c);
  26. }
  27. Freeing resources
  28. Writer. Close ();
  29. Writer. Dispose ();
  30. Reader. Close ();
  31. Reader. Dispose ();
  32. Response. Close ();
  33. Download successful
  34. LISTBOX2.ITEMS.ADD (Path + ": Picture saved successfully!");
  35. }
  36. catch (Exception msg)
  37. {
  38. MessageBox.Show (Msg. Message);
  39. }
  40. }
  41. }
  42. }

There may be several problems with this part of the code: (1). Get the picture format is not necessarily a JPG format, here the main want to show a thought, specific pictures to get set up can be; (2). The method of using this file stream is very slow to download, can use other methods, Webclient.downloadfile (), etc., because I happen to study the file knowledge and web crawler, so the use of this basic method; (3). The two-layer loop in the code is a bit redundant, but the MatchCollection MC gets a matching set, and the overall feeling is a bit messy; (4). If you want to download pictures in bulk, it is best to use the knowledge of the online process, while using some excellent algorithms (emphasis is the algorithm), in-memory access, the program is only the basic knowledge.

three. Running Results

The results are as follows: Click the "Browse" button can be implemented to browse the Web page, click "Get" can get the URL of the page and display in the ListBox control, and finally click "Download" to save the image to the e-disk directory, the following is the logo icon to browse Baidu download. (If the image does not have a source URL path, you need to implement it yourself, such as )

Four. Basic knowledge of the web

here is the main introduction of HTML Web pages in the production of hyperlinks and pictures of the basic knowledge of links, better to facilitate the understanding of this article. (Refer to the "Webpage making tutorial" of Zhao Harvest) 1. Page links Creating a hyperlink in a Web page requires A marker with the end tag of </a> its most basic attribute is the HREF, which specifies the target of the hyperlink, specifying different values through the href attribute, you can create different types of hyperlinks. At the same time <A> and </A> You can use the Click Object as the source (text or picture) of the hyperlink. such as Baidu home: "<a href=" http://news.baidu.com "> New &nbsp; Smell </a>". (Anchor connection is not introduced here.) 2. Insert Picture insert a picture into a Web page using an IMG tag in HTML Its two essential properties are SRC and alt. Sets the location and alternate text for the image file, respectively. (2). The Alt attribute represents a simple text description of the image, which is used to replace the display when the browser or display time is too long to display the image. such as Baidu Home logo icon image " "When you access the URL directly, you can access the image, the program above is the main way to download the pictures in the Web page."

Five. Regular Expressions

Regular Expressions (Regular expression) are a string of characters that define a pattern to search for matching strings. Many languages, including Perl, PHP, Python, JavaScript, and JScript, support the processing of text with regular expressions, Some text editors use regular expressions to implement advanced "search-and-replace" functionality. The regular expression I am exposed to is the user name password setting and the knowledge of the Web page, so I also need to learn that part of the knowledge. Here are 3 regular expressions, with two of the following code very useful:

1. Get the URL of all pictures in HTML

(Reference: http://blog.csdn.net/smeller/article/details/7108502)

[CSharp]View Plaincopy
  1. <summary>
  2. Gets the URL of all the pictures in the HTML
  3. </summary>
  4. <param name= "Shtmltext" >html code </param>
  5. <returns> URL List of images </returns>
  6. public static string[] Gethtmlimageurllist (string shtmltext)
  7. {
  8. Define a regular expression to match an IMG tag
  9. Regex regimg = new Regex (@ "]*?\bsrc[\s\t\r\n]*=[\s\t\r\n]*[" "']?[ \s\t\r\n]* (? [^\s\t\r\n "" ' <>]* ") [^<>]*?/? [\s\t\r\n]*> ", regexoptions.ignorecase);
  10. Search for matching strings
  11. MatchCollection matches = regimg.matches (Shtmltext);
  12. int i = 0;
  13. string[] surllist = new string[matches. Count];
  14. Get a list of matches
  15. foreach (match match in matches)
  16. {
  17. surllist[i++] = match. groups["Imgurl"]. Value;
  18. }
  19. return surllist;
  20. }
2. Obtain the SRC path of the image and save

(Reference: http://bbs.csdn.net/topics/320001867)

[CSharp]View Plaincopy
  1. <summary>
  2. Get the path to the picture and store it
  3. </summary>
  4. <param name= "M_content" > What to retrieve </param>
  5. <returns>IList</returns>
  6. public static ilist<string> Getpicpath (String m_content)
  7. {
  8. ilist<string> im = new list<string> ();//define a generic character class
  9. Regex reg = new Regex (@ "[^" ""]*) "" [^>]*> ", regexoptions.ignorecase);
  10. MatchCollection mc = Reg. Matches (m_content); Set the string to find
  11. foreach (Match m in MC)
  12. {
  13. Im. ADD (m.groups["src"). Value);
  14. }
  15. return im;
  16. }
Six. Summary

This article is to do the C # network knowledge about the web crawler to get URLs and simple download pictures of the basic explanation, it is clear that the first to get a URL to browse the Web, at least to get the HTML content of the Web page, in a simple regular expression to get <a href></a > content; If you want to download pictures to get pictures of url Get src url, download the URL in the image, get the method or use regular expression, download method can use a lot, here is the file stream, it is best to use multi-threading and other bulk download means . (free:http://download.csdn.net/detail/eastmount/6355125) mainly through this document to introduce some basic knowledge of the network, while I am also constantly learning to study, while the regular expression and the basic concept of the Web page two concepts. Finally thank the article in the blogger and some people, hope that the article can be helpful to everyone, At the same time if there are errors or shortcomings in the article, please also Haihan.

C # Network programming webbrowser get web page urls and download pictures in Web pages

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.