C # web crawler and search engine Research Code Detail Introduction

Source: Internet
Author: User
Effects page:

General idea:

A portal link, For example: www.sina.com.cn, starting from it to crawl, found the link, (in this can parse out the page content, enter a keyword, to interpret whether to include the input keyword, including the link and page related content into the cache), the crawl to the connection into the cache, recursive execution.

Do a relatively simple, as a summary of their own.

At the same time start 10 threads, each thread corresponding to the respective connection pool cache, the connection containing the keyword into the same cache inside, prepare a service page, timed refresh, display the current results (just simulation, the real search engine must first use Word segmentation method to parse the keyword, Then combined with the content of the Web page to the eligible pages and links to the file, the next time the search must be from the file to find results, their crawler crawl 24 hours. Let's take a look at the specific implementation.

Entity class:

Using system;using system.collections.generic;using system.linq;using system.web;using System.Threading;namespace spiderdemo.entity{////crawler thread    publicclass clamthread    {public       Thread _thread {get; set;}       Public list<link> Lnkpool {get; set;}} Crawl to the connection  Publicclass link    {public       string Href {get; set;}       public string Linkname {get; set;}       public string Context {get; set;}        public int Theadid {get; set;}    } }

Cache class:

Using system;using system.collections.generic;using system.linq;using system.web;using SpiderDemo.Entity;using System.Threading;        namespace spiderdemo.searchutil{public static class Cachehelper {public static bool Enablesearch; <summary>//Start URL///</summary> Public const string StartURL = "Http://www.sina.com.c         n "; <summary>///The maximum number of crawls, performance optimization, if you can release resources in a timely manner can be crawled up///</summary> Public const int maxnum =        300;         <summary>///up to 1000 results crawled///</summary> Public const int maxresult = 1000;        <summary>///Current number of crawls///</summary> public static int spidenum;        <summary>///keywords///</summary> public static string KeyWord;        <summary>///Run time///</summary> public static int runingtime; <summary>//MAX run time///</summary> public static int maxruningtime; <summary>///10 threads to crawl at the same time//</summary> public static clamthread[] Threadlist = new Clamt        HREAD[10]; <summary>////First crawl to the connection, connection pool///</summary> public static list<link> Lnkpool = new L        Ist<link> (); <summary>///Received legal connection///</summary> public static list<link> Validlnk = new list& Lt        Link> (); <summary>////When connecting, don't take the same///</summary> public static readonly Object syncobj = new    Object (); }}

HTTP request class:

Using system;using system.collections.generic;using system.linq;using system.web;using System.Text;using System.Net; Using system.io;using system.threading; namespace spiderdemo.searchutil{public static class Httppostutility {//<summary>///For temporary synchronization, etc. later re-optimization//</summary>//<param name= "url" ></param>//<returns></returns&gt       ; public static Stream sendreq (string url) {try {if (string).                IsNullOrEmpty (URL)) {return null;                }//WebProxy WP = NEWWEBPROXY ("10.0.1.33:8080"); Wp. Credentials = new System.Net.NetworkCredential ("* * * * * *", "******", "Feinno"), or/or use a proxy before///to HttpWebRequest my                Request = (HttpWebRequest) webrequest.create (URL);                Myrequest.proxy = WP;                 HttpWebResponse myresponse = (HttpWebResponse) myrequest.getresponse (); Returnmyresponse.getresponseSTREAM ();           ////requests permission to some websites is restricted by catch (Exception ex) {return null; }       }    }}

Parsing the Web page class, here is a component, HtmlAgilityPack.dll, good to use, download connection: http://www.php.cn/

Using system;using system.collections.generic;using system.linq;using system.web;using System.Threading;using System.text;using system.xml;using system.xml.linq;using htmlagilitypack;using System.IO;using SpiderDemo.Entity; namespace spiderdemo.searchutil{public static class Urlanalysisprocessor {public static void Gethrefs (Link URL, stream s, List<link>lnkpool) {try {////no HTML stream, return directly I                F (s = = null) {return;                }////parse out the connection to the cache inside, waiting for the front page to take, currently each thread up to 300 cache, a lot of it will not be saved, there is too slow to take!                if (Lnkpool.count >=cachehelper.maxnum) {return; }////load HTML, find Htmlagilitypack, try this component Htmlagilitypack.htmldocumentdoc = new Htmldocumen                 T (); The UTF8 encoding is specified, and in theory there is no Chinese garbled doc.                 Load (S, Encoding.default); Get all Connections ienumerablE

Search page code BEHIND:

Using system;using system.collections.generic;using system.linq;using system.web;using System.Web.UI;using System.web.ui.webcontrols;using spiderdemo.searchutil;using system.threading;using System.IO;using spiderdemo.entity; namespace spiderdemo{public partial class SearchPage:System.Web.UI.Page {protected void Page_Load (object s Ender, EventArgs E) {if (!           IsPostBack) {initsetting (); }} private void Initsetting () {} private void Startwork () {Cac           Hehelper.enablesearch = true;            Cachehelper.keyword = Txtkeyword.text;            The first request to Sina, gets the returned HTML stream stream Htmlstream = Httppostutility.sendreq (Cachehelper.starturl); Link Startlnk = new link () {href = Cachehelper.starturl, linkname = "<a href =         "+ Cachehelper.starturl +" ' > Sina "+cachehelper.starturl +" </a> "};   Parse out the connection urlanalysisprocessor.gethrefs (Startlnk, Htmlstream,cachehelper.lnkpool); for (int i = 0; i < CacheHelper.ThreadList.Length; i++) {Cachehelper.threadlist[i] =               Newclamthread ();           Cachehelper.threadlist[i].lnkpool = new list<link> (); ////divide the connection to each thread for (int i = 0; i < CacheHelper.LnkPool.Count; i++) {in               T tindex = i%cachehelper.threadlist.length;           Cachehelper.threadlist[tindex].lnkpool.add (Cachehelper.lnkpool[i]); } action<clamthread> clamit = new Action<clamthread> (CLT) = {Stream S =httppostutility.sendreq (Clt.lnkpool[0].                HREF);           DoIt (CLT, S, Clt.lnkpool[0]);             }); for (int i = 0; i < CacheHelper.ThreadList.Length; i++) {Cachehelper.threadlist[i]._thread = n EW Thread (new ThreadStart () =>                {Clamit (cachehelper.threadlist[i]);                 })); When each thread starts to work, it sleeps 100ms cachehelper.threadlist[i]._thread.                Start ();           Thread.Sleep (100); }} private void DoIt (Clamthreadthread, Stream htmlstream, Link URL) {if (!           Cachehelper.enablesearch) {return;           } if (Cachehelper.spidenum > Cachehelper.maxresult) {return;             }////parse the page, the URL matches the condition into the cache, and the page connection is captured into the cache urlanalysisprocessor.gethrefs (URL, Htmlstream, thread.lnkpool);  If there is a connection, take the first send request, no end it, anyway so resource-consuming thing if (Thread.lnkPool.Count > 0) {Link                Firstlnk;                Firstlnk = thread.lnkpool[0];                 Remove Thread.lnkPool.Remove (Firstlnk) from the cache after the connection is received;               Firstlnk.theadid =thread.currentthread.managedthreadid; StrEAM content =httppostutility.sendreq (FIRSTLNK.HREF);           DoIt (thread, content,firstlnk); } else {//is not connected, stop, look at the performance of other threads Thread._thread.           Abort (); }} protected void btnSearch_Click (object sender, EventArgs e) {this.        Startwork (); } protected void Btnshow_click (object sender, EventArgs e) {} protected void Btnstop_click (obj ECT sender, EventArgs e) {foreach (var t in Cachehelper.threadlist) {T._thread.               Abort (); T._thread.           Disablecomobjecteagercleanup ();           } Cachehelper.enablesearch =false;           CacheHelper.ValidLnk.Clear ();           CacheHelper.LnkPool.Clear ();       CacheHelper.validLnk.Clear (); }    }}

Search page foreground code:

<%@ page language= "C #" autoeventwireup= "true" codebehind= "SearchPage.aspx.cs" inherits= "spiderdemo.searchpage"% > <! DOCTYPE HTML PUBLIC "-//w3c//dtdxhtml 1.0 transitional//en" "Http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd "> 

StateServicePage.cs

Using system;using system.collections.generic;using system.linq;using system.web;using System.Text;using Spiderdemo.searchutil;using spiderdemo.entity; namespace spiderdemo{//<summary>///Stateservicepage///</summary> public class Stateservic Epage:ihttphandler {public void ProcessRequest (HttpContext context) {context.             Response.ContentType = "Text/plain"; if (context. Request["OP"]! = null &&context. Request["OP"] = = "Info") {context.           Response.Write (Showstate ());           }} public string Showstate () {StringBuilder Sbret = new StringBuilder (100);            string ret = Getvalidlnkstr ();                           int count = 0;  for (int i = 0; I <CacheHelper.ThreadList.Length; i++) {if (Cachehelper.threadlist[i] ! = NULL && cachehelper.threadlist[i].lnkpool!= null) Count + = CacheheLper.                Threadlist[i].lnkpool.count;           } sbret.appendline ("Service is running:" + cachehelper.enablesearch + "<br/>");           Sbret.appendline ("Total number of connection pools:" + count + "<br/>");            Sbret.appendline ("Search Result: <br/>" + ret);       return sbret.tostring ();           } private String Getvalidlnkstr () {StringBuilder sb = new StringBuilder (120);            link[] Clonelnk = new Link[cachehelper.validlnk.count];            CacheHelper.validLnk.CopyTo (Clonelnk, 0); for (int i = 0; i < clonelnk.length; i++) {sb. Appendline ("<br/>" + clonelnk[i). Linkname + "<br/>" +clonelnk[i].           Context); } return SB.       ToString ();           } public bool IsReusable {get {return false; }       }    }}

The above is the C # web crawler and search engine research code details of the content, more relevant content please pay attention to topic.alibabacloud.com (www.php.cn)!

  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.