C # Implementation to automatically crawl remote Web page information through the program

Source: Internet
Author: User
web| Program | process | Web page
The program automatically reads the information displayed by other Web pages, similar to the crawler program. Let's say we have a system to extract the ranking of song searches on the Baidu site. The analysis system analyzes the data according to the obtained data. Provide reference data for the business.
In order to complete the above requirements, we need to simulate the browser to browse the Web page, the page data is analyzed, and finally the structure of the analysis, that is, the collation of data written to the database. So our idea is:
1, send HttpRequest request.
2, receive the results of HttpResponse return. Gets the HTML source file for a particular page.
3, take out the part that contains the data source.
4, according to the HTML source code generation HTMLDocument, the circular extraction data.
5, write to the database.

The procedure is as follows:

//According to URL address to get the HTML source page
private string Getwebcontent (String Url)
{
String strresult= "";
Try
{
HttpWebRequest request = (HttpWebRequest) webrequest.create (URL);
     //Declaration of a HttpWebRequest request
Request. Timeout = 30000;
//Set the connection timeout time
Request. Headers.set ("Pragma", "No-cache");
HttpWebResponse response = (HttpWebResponse) request. GetResponse ();
Stream streamreceive = Response. GetResponseStream ();
Encoding Encoding = encoding.getencoding ("GB2312");
StreamReader StreamReader = new StreamReader (streamreceive, encoding);
strresult = Streamreader.readtoend ();
}
Catch
{
MessageBox.Show ("error");
}
return strresult;
}
In order to use HttpWebRequest and HttpWebResponse, you need to fill in the name space reference
Using System.Net;

The following is the process of implementing the program:
private void Button1_Click (object sender, EventArgs e)
{
//URL address to crawl
String Url = "Http://list.mp3.baidu.com/topso/mp3topsong.html?id=1#top2";

//Get the source code of the specified URL
String strwebcontent = Getwebcontent (URL);

richTextBox1.Text = strwebcontent;
   Take out the source code associated with the data
int ibodystart = Strwebcontent.indexof ("<body", 0);
int iStart = Strwebcontent.indexof ("Song TOP500", Ibodystart);
int itablestart = Strwebcontent.indexof ("<table", IStart);
int itableend = Strwebcontent.indexof ("</table>", Itablestart);
String strweb = Strwebcontent.substring (Itablestart, Itableend-itablestart + 8);

Generate HTMLDocument
WebBrowser Webb = new WebBrowser ();
Webb. Navigate ("About:blank");
HTMLDocument htmldoc = Webb. Document.opennew (TRUE);
Htmldoc. Write (Strweb);
HtmlElementCollection htmltr = Htmldoc. getElementsByTagName ("TR");
foreach (HtmlElement tr in htmltr)
{
String Strid = tr. getElementsByTagName ("TD") [0]. InnerText;
String strName = Splitname (tr. getElementsByTagName ("TD") [1]. InnerText, "Musicname");
String Strsinger = Splitname (tr. getElementsByTagName ("TD") [1]. InnerText, "Singer");
Strid = Strid.replace (".", "");
//Insert DataTable
                 AddLine (Strid, StrName, Strsinger, "0");

                  string strID1 = tr. getElementsByTagName ("TD") [2]. InnerText;
                 String strName1 = Splitname (tr. getElementsByTagName ("TD") [3]. InnerText, "Musicname");
                 String StrSinger1 = Splitname (tr. getElementsByTagName ("TD") [3]. InnerText, "Singer");
                                  strID1 = Strid1.replace (".", "");
                 AddLine ( StrID1, strName1, StrSinger1, "0");

                  string strID2 = tr. getElementsByTagName ("TD") [4]. InnerText;
                 String StrName2 = Splitname (tr. getElementsByTagName ("TD") [5]. InnerText, "Musicname");
                 String StrSinger2 = Splitname (tr. getElementsByTagName ("TD") [5]. InnerText, "Singer");
               
                 strID2 = Strid2.replace (".", "");
                 AddLine ( StrID2, StrName2, StrSinger2, "0");

}
//Insert Database
InsertData (DT);
   
Datagridview1.datasource = dt. DefaultView;
}




Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.