Google Search-based intra-site Search, C # custom Regular Expression Parsing

Source: Internet
Author: User

Requirement: a search function in the website requires an out-of-site search. Here I use Google search.

 

Custom Search address: http://www.google.com/custom? The. com and cn search result pages contain some advertisements, which is not conducive to resolution.

 

Enter the website + keyword in the search, and seo friends all know what site: So the search results are all the results of your website indexed by google with the search keyword.

Search url:

Http://www.google.com/custom? Hl = en & newwindow = 1 & q = site: www.hx-soft.cn ++ Archives Digitalization & btnG = Google + search

These are the default parameters. Of course, there are other parameters.

For details about other parameters, see:

Http://blog.csdn.net/hean/archive/2008/03/03/2142689.aspx

To search the results on the page, I only need two parts:

This part needs to obtain the total number of search results, that is, 42.

This part is the main search result list. All you need is this. The general steps are to first download the html source code of the search page, and then obtain the desired part through regular expression parsing.

First, create a SearchByGoogle static class and add the method:

/// <Summary> /// obtain the remote html source code based on the url /// </summary> /// <param name = "url"> Search url </param>/ // <returns> return DownloadData </returns> public static string GetSearchHtml (string url) {WebClient MyWebClient = new WebClient (); MyWebClient. credentials = CredentialCache. defaultCredentials; // gets or sets the network creden。 used to authenticate requests to Internet resources. Byte [] pageData = MyWebClient. DownloadData (url); // download the data from the specified url return Encoding. UTF8.GetString (pageData); // get the website page using the UTF-8}

This method gets the html according to the url, and you can start parsing and add a method to get the total number of search results:

/// <Summary> /// determines the total number of results to be searched. /// </summary> /// <param name = "pageHtml"> DownloadData </param>/ // <returns> result count </returns> public static int IsExistResult (string pageHtml) {int count = 0; // number of results Regex reg = new Regex (@ "(? : Sabout )? <B> (d +) </B> from "); // The regular expression if (reg. isMatch (pageHtml) {Match m = reg. match (pageHtml); if (m. groups. count> = 2) {count = int. parse (m. groups [1]. value) ;}} return count ;}

This also involves a paging of results, with 10 results per page. The above method can be used to determine whether results are found. The paging parameter can be set through the count total number.

Analyze url pagination:

Http://www.google.com/custom? Hl = en & newwindow = 1 & q = site: www.hx-soft.cn ++ Archives Digitalization & start = 10 & sa = N

This is the second page. There is a parameter "start = 10". The first page is "start = 0", and the third page is "start = 20 ".

The method is as follows:

/// <Summary> /// obtain the start position of the search result page /// </summary> /// <param name = "count"> Number of Results </param>/ // <returns> returns the array containing the start position of the page </returns> public static int [] GetPageStarts (int count) {// calculate the number of page numbers int pageTotal = 0; pageTotal = count % 10 = 0? Count/10: (count/10) + 1; // start Number of pages int [] starts = new int [pageTotal]; for (int I = 0; I <pageTotal; I ++) {starts [I] = (pageTotal-I) * 10;} return starts ;}

Here, an array containing the start parameter of the page is created for you to add a LinkButton on the self-created search result page.

The final part is to parse the Intermediate List. Fortunately, if there is no Link style in the search page, you can directly use the style in its head. Here, you only need to parse the middle part.

You also need to remove the Cached Similar in the list, such as the red box section. Add method:

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.