HttpGet protocol and regular expression

Source: Internet
Author: User

write in front Recently, I've been looking for a book about ASP. This book, in the first chapter, is about generating a Web request without using a browser to get the content returned by the server. So on the Internet to search for information about HTTP requests, found that a lot of information is about the resources based on HttpGet and HttpPost request server, but according to the word meaning of Get and post is about to know that get (get) means to get the resources from the service, The post (send) means that the packet is sent back to the server before the server resources are fetched. Of course there are some other differences between them, but this is not the main story of this article. When I know how to use Get and post requests to access the server's data, I can't wait to find some Web pages to do the test, so I have the WinForm version of the Encyclopedia of embarrassment. Let's look at the results below. Below I will divide this process into the following sections to explain, and at the end of the article to provide download links. 1. Analyze the webpage of embarrassing encyclopedia and construct Web request. 2. Analyze the HTML source code of the Web page and extract the required information. 3, data binding. 1. Analyze the webpage of embarrassing encyclopedia, construct Web request

Open the homepage of the embarrassing encyclopedia joke, here I only take embarrassing things jokes in the text of this section, click on the text of this menu bar. Such as.

1.1 Get the URL of the embarrassing encyclopedia contentAs can be seen, the text version of the URL link is: http://www.qiushibaike.com/textnew/page/2/?s=4869039. Depending on the content of the link, it can be seen that the host part of the page is http://www.qiushibaike.com, /textnew/pageThe page that represents the subject of the word joke is also unchanged, and the following Number 2And ? s=4869039is the key to change the content of different pages in the URL, through the analysis to know Number 2The number of pages representing different words and jokes, and ? s=4869039Without making it very clear that the estimate is an identifier or something, but it doesn't affect it, we'll fix it and not change it. All we need is a change in http://www.qiushibaike.com/textnew/page/2/?s=4869039. the number 2 can be used to get the content of text jokes on different pages. 1.2 Constructing header information for HttpGet requestsIn the previous step we obtained the URL of the text content page, below I need to simulate the browser to construct a GetRequest to get the embarrassing page data. Open the browser's developer tool, where you can see the details of the HTTP request that the browser constructs, such as the header information. We then follow the request header information in the code to request the server resources. Note: The portion of the red line marked in the instantiation of an HTTP request class needs to be set, otherwise you will get the wrong return result. 1.3 C # to realize the crawl of embarrassing Web pagesBased on the above analysis, I use the C # language and utilize the two classes of HttpWebRequest and HttpWebResponse in the System.Net assembly to crawl the content of a Web page. The source code is as follows:
Const string Qsbkmainurl = "http://www.qiushibaike.com";//Gets the URL of a embarrassing hundred-word joke page private static string Getwbjokeurl (int            pageIndex) {StringBuilder url = new StringBuilder (); Url.            Append (Qsbkmainurl); Url.            Append ("/textnew/page/"); Url.            Append (Pageindex.tostring ()); Url.            Append ("/?s=4869039"); Return URL.    ToString (); }//gets the HTML source of the Web page based on the URL of the Web page, private static string geturlcontent (string url) {try {HTTPW                Ebrequest request = (HttpWebRequest) webrequest.create (URL); Request. useragent = "mozilla/5.0 (Windows NT 6.3;                WOW64) applewebkit/537.36 (khtml, like Gecko) maxthon/4.4.8.1000 chrome/30.0.1599.101 safari/537.36 "; Request.                Method = "GET"; Request.                ContentType = "Text/html;charset=utf-8"; HttpWebResponse response = (HttpWebResponse) request.                GetResponse (); Stream Myresponsestream = Response.                GetResponseStream (); StreAmreader Mystreamreader = new StreamReader (Myresponsestream, encoding.getencoding ("Utf-8"));//Because we know that the code for the embarrassing pages is Utf-8                String retstring = Mystreamreader.readtoend ();                Mystreamreader.close ();                Myresponsestream.close ();            return retstring;  } catch {return null;} }

2. Analyze the HTML source code of the Web page and extract the required information In 1 we've got different page content depending on the page index, and the task of this step is to get the jokes we want from the returned HTML source code. Our jokes for extracting web text include three parts: Post a joke person's avatar, publish a joke's nickname, publish content. 2.1 Analyzing Web page construction regular ExpressionsFirst we analyze the HTML source code and find out the location of the tags where we want the content, and the structure of their HTML. This is my analysis of what we need in the HTML source of the label location, because a page each joke of the HTML display tag is the same, so as long as you can extract the content of a joke, then the page of other jokes can also be extracted. Since this structure is basically fixed, each part of the joke is represented by the same HTML tags, and the location is the same, so in the writing of regular expression, you can use a lot of constant characters to fixed, which can speed up the regular matching efficiency. The following is a regular expression of the content of a match joke, which captures the different contents of a joke through a grouping implementation. Of course, this regular expression may have some situations that cannot be exactly matched.
Regular : \s</a>\s<a href= "([^"]*] "[^>]*>\s
where the contents of the first parenthesis represent "Avatar address", the contents of the second parenthesis represent "nicknames", and the contents of the third parenthesis represent "joke content"2.2 encode all jokes to get the page A, the first to build a joke of the entity class
public class Jokeitem    {        private string nickname;        <summary>//Nickname///        </summary> public        string nickname        {            get {return nickname;}            set {nickname = value;}        }         Private Image headimage;        <summary>//Portrait///        </summary> public        Image headimage        {            get {return Headimage; }            set {headimage = value;}        }        private string jokecontent;        <summary>///Jokes content///        </summary> public        string jokecontent        {            get {return Jokecontent; }            set {jokecontent = value;}        }         private string Jokeurl;        <summary>///joke address///        </summary> public        string Jokeurl        {            get {return Jokeurl; }            set {Jokeurl = value;}}        }

b, using the regular to get jokes content

<summary>///Get Jokes List///</summary>//<param name= "Htmlcontent" ></param&gt        ; public static list<jokeitem> getjokelist (int pageIndex) {string Htmlcontent=geturlcontent (GETWBJ            Okeurl (PageIndex));            list<jokeitem> jokelist = new list<jokeitem> (); Regex RG = new Regex (@ "\s</a>\s<a href=" "([^" "]*)" "[^>]*>\s 

C. Get avatar based on avatar URL address

private static Image Getwebimage (string webUrl)        {            try            {                Encoding encode = encoding.getencoding ("Utf-8 ");//page encoding ==encoding.utf8                  HttpWebRequest req = (HttpWebRequest) webrequest.create (new Uri (WEBURL));                HttpWebResponse ress = (httpwebresponse) req. GetResponse ();                Stream sstreamres = ress. GetResponseStream ();                Return System.Drawing.Image.FromStream (sstreamres);             }            catch {return null;}        }

3. Data binding

Data is obtained, data binding is the easiest step, because the data acquisition this step involves a Web request, there will be a few seconds of network latency, it is necessary to use a background worker to request data. The BackgroundWorker control is used here to implement asynchronous request data. Where the UI part borrows two third-party controls, one is the loaded wait bar and the other is a data-bound control. The data binding code is not posted. You can download my source code below.4. Summary In this process, I have a further understanding of how HTTP is requested, and have finally played a useful role in regular expressions of normal learning. The common learning technology combined with a good idea will make their own unexpected small program, I hope I can learn more about their own technology and practice together. Development environment: vs2013,.net2.0Source Address: http://download.csdn.net/detail/mingge38/9504931

HttpGet protocol and regular expression

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.