Web crawler +htmlagilitypack+windows Services Crawl 200,000 blog posts from the blogging park

Source: Internet
Author: User

1. Preface

The latest in the company to do a project, need some article class data, then thought of using web crawler to some technical website crawl Some, of course I often go is the blog park, so there is the following this article.

2. Preparatory work

I need to take my data from the blog park, the best way to save, of course, is saved to the database, well, we first build a database, in a table, save our data, is actually very simple ah, as shown in

Blogarticleid Bowen Id,blogtitle blog post title, Blogurl blog address, Blogauthor Bowen author, blogtime blog release time, Blogmotto author motto, blogdepth spider crawler crawl depth, IsDeleted is deleted.

Database tables are also created, let's start with a database helper class.

    <summary>///Database Help classes///</summary> public class Mssqlhelper {#region Field properties <summary>///database connection string///</summary> private static string conn = "Data source=.;i Nitial Catalog=cnblogs; User Id=sa;        Password=123 "; #endregion #region DataTable Write data public static void GetData (string title, string url, string author, string            Time, string motto, string depth, DataTable dt) {DataRow dr; Dr = dt.            NewRow ();            dr["Blogtitle"] = title;            dr["blogurl"] = URL;            dr["Blogauthor"] = author;            dr["blogtime"] = time;            dr["Blogmotto"] = motto;            dr["blogdepth"] = depth; 2.0 Append the DR to DT dt.        Rows.Add (DR);        #endregion #region Insert data into the database///<summary>//Insert data into database///</summary>      public static void Insertdb (DataTable dt) {try      {using (System.Data.SqlClient.SqlBulkCopy copy = new System.Data.SqlClient.SqlBulkCopy (conn)) {//3.0.1 Specifies that the data is inserted into the target table name copy.                    DestinationTableName = "Blogarticle"; 3.0.2 tells the SqlBulkCopy object in the memory table OrderNO1 and USERID1 to insert into which columns in the Orderinfos table are in copy.                    Columnmappings.add ("Blogtitle", "Blogtitle"); Copy.                    Columnmappings.add ("Blogurl", "Blogurl"); Copy.                    Columnmappings.add ("Blogauthor", "Blogauthor"); Copy.                    Columnmappings.add ("Blogtime", "blogtime"); Copy.                    Columnmappings.add ("Blogmotto", "Blogmotto"); Copy.                    Columnmappings.add ("Blogdepth", "blogdepth"); 3.0.3 inserts the data from the memory table DT into the Orderinfos table in a single batch.                    WriteToServer (DT); Dt.                Rows.clear (); }} catch (Exception) {dt.            Rows.clear ();  }        }      #endregion} 
3. Log

A log, convenient for us to view, the code is as follows.

    <summary>/    //Log Help class    ///</summary> public class Loghelper    {        #region Write log        // Write log public        static void Writelog (string text)        {            //streamwriter sw = new StreamWriter ( AppDomain.CurrentDomain.BaseDirectory + "\\log.txt", true);            StreamWriter SW = new StreamWriter ("F:" + "\\log.txt", true);            Sw. WriteLine (text);            Sw. Close ();//write        }        #endregion    }
4. Crawler

My web spider crawler, with a third-party class library, the code is as follows.

AddUrlEventArgs.csBloomFilter.cs CrawlErrorEventArgs.csCrawlExtension.csCrawlMaster.csCrawlSettings.csCrawlStatus.csDataReceivedEventArgs.csSecurityQueue.csUrlInfo.csUrlQueue.cs5. Create a Windows service.

After all these work is ready to be finished, finally come to our point, we all know that the console program is very unstable, and our this from the blog park above the article to crawl the matter needs to go on for a long time, this need to proceed very stably, so I think of Windows services, Create our Windows service with the following code.

Using feng.simplecrawler;using feng.dbhelper;using feng.log;using htmlagilitypack;namespace Feng.Demo{//<  summary>///Windows Services///</summary> partial class Fengcnblogsservice:servicebase {#region            constructors///<summary>///constructors////</summary> public Fengcnblogsservice () {        InitializeComponent ();        } #endregion #region Field Properties///<summary>///settings spider Crawler///</summary>        private static readonly crawlsettings Settings = new Crawlsettings ();         <summary>///Temporary Memory table stores data///</summary> private static datatable dt = new DataTable (); <summary>//About Filter url:http://www.cnblogs.com/heaad/archive/2011/01/02/1924195.html/        </summary> private static bloomfilter<string> filter; #endregion #region Start-up service//<summary>///TODO: InAdd code here to start the service. </summary>//<param name= "args" ></param> protected override void OnStart (string[] ar        GS) {processstart ();        #endregion #region Stop service///<summary>///TODO: Add code here to perform the shutdown required to stop the service.        </summary> protected override void OnStop () {} #endregion #region program starts processing            <summary>///program starts processing///</summary> private void Processstart () { Dt.            Columns.Add ("Blogtitle", typeof (String)); Dt.            Columns.Add ("Blogurl", typeof (String)); Dt.            Columns.Add ("Blogauthor", typeof (String)); Dt.            Columns.Add ("Blogtime", typeof (String)); Dt.            Columns.Add ("Blogmotto", typeof (String)); Dt.            Columns.Add ("Blogdepth", typeof (String));            Filter = new bloomfilter<string> (200000);            Const string cityname = "";  #region Set the seed address          Sets the seed address Settings.SeedsAddress.Add (string.            Format ("http://www.cnblogs.com/{0}", CityName));            SETTINGS.SEEDSADDRESS.ADD ("Http://www.cnblogs.com/artech");            SETTINGS.SEEDSADDRESS.ADD ("http://www.cnblogs.com/wuhuacong/");            SETTINGS.SEEDSADDRESS.ADD ("http://www.cnblogs.com/dudu/");            SETTINGS.SEEDSADDRESS.ADD ("http://www.cnblogs.com/guomingfeng/");            SETTINGS.SEEDSADDRESS.ADD ("http://www.cnblogs.com/daxnet/");            SETTINGS.SEEDSADDRESS.ADD ("Http://www.cnblogs.com/fenglingyi");            SETTINGS.SEEDSADDRESS.ADD ("http://www.cnblogs.com/ahthw/");            SETTINGS.SEEDSADDRESS.ADD ("http://www.cnblogs.com/wangweimutou/");            #endregion #region Set the URL keyword SETTINGS.HREFKEYWORDS.ADD ("a");            SETTINGS.HREFKEYWORDS.ADD ("b/");            SETTINGS.HREFKEYWORDS.ADD ("c/");            SETTINGS.HREFKEYWORDS.ADD ("d/");            SETTINGS.HREFKEYWORDS.ADD ("e/"); Settings.hrefkEywords.            ADD ("f/");            SETTINGS.HREFKEYWORDS.ADD ("g/");            SETTINGS.HREFKEYWORDS.ADD ("h/");            SETTINGS.HREFKEYWORDS.ADD ("i/");            SETTINGS.HREFKEYWORDS.ADD ("j/");            SETTINGS.HREFKEYWORDS.ADD ("k/");            SETTINGS.HREFKEYWORDS.ADD ("l/");            SETTINGS.HREFKEYWORDS.ADD ("m/");            SETTINGS.HREFKEYWORDS.ADD ("n/");            SETTINGS.HREFKEYWORDS.ADD ("o/");            SETTINGS.HREFKEYWORDS.ADD ("p/");            SETTINGS.HREFKEYWORDS.ADD ("q/");            SETTINGS.HREFKEYWORDS.ADD ("r/");            SETTINGS.HREFKEYWORDS.ADD ("s/");            SETTINGS.HREFKEYWORDS.ADD ("t/");            SETTINGS.HREFKEYWORDS.ADD ("u/");            SETTINGS.HREFKEYWORDS.ADD ("v/");            Settings.HrefKeywords.Add ("w/");            SETTINGS.HREFKEYWORDS.ADD ("x/");            SETTINGS.HREFKEYWORDS.ADD ("y/");            SETTINGS.HREFKEYWORDS.ADD ("z/");            #endregion//Set Number of crawl threads Settings.threadcount = 1; Set Crawl Depthdegree settings.depth = 55;            When setting a Link that is ignored when crawling, multiple Settings.EscapeLinks.Add ("http://www.oschina.net/") can be added by means of a suffix name;            Set the automatic speed limit, the automatic speed limit of 1-5 seconds random interval settings.autospeedlimit = false;            Settings are locked domain name, after the removal of level two domain name, determine whether the domain name is equal, equal is considered to be the same site Settings.lockhost = false; SETTINGS.REGULARFILTEREXPRESSIONS.ADD (@ "http://([w]{3}.)            +[cnblogs]+.com/");            var master = new Crawlmaster (Settings); Master.            Addurlevent + = masteraddurlevent; Master.            Datareceivedevent + = masterdatareceivedevent; Master.        Crawl ();        #endregion #region Print the URL///<summary>//The Master Add URL event.        </summary>//<param name= "args" >//The args.        </param>//<returns>//The <see cref= "bool"/&GT;.    </returns> private static bool Masteraddurlevent (Addurleventargs args) {        if (!filter. Contains (args. URL) {filter. Add (args.                URL); Console.WriteLine (args.                URL); if (dt.                    Rows.Count >) {mssqlhelper.insertdb (dt); Dt.                Rows.clear ();            } return true; } return false; Returns false for: not added to queue} #endregion #region parsing HTML//<summary>//The Master D        ATA received event.        </summary>//<param name= "args" >//The args.            </param> private static void Masterdatareceivedevent (Simplecrawler.datareceivedeventargs args) { Parse the page here, you can use something like Htmlagilitypack (page parsing component), you can also use regular expressions, and you can parse the string yourself htmldocument doc = new Htmldocume            NT (); Doc. Loadhtml (args.            Html); Htmlnode node = doc.            Documentnode.selectsinglenode ("//title"); String title = node.       InnerText;     Htmlnode Node2 = doc.            Documentnode.selectsinglenode ("//*[@id = ' post-date ']"); String time = Node2.            InnerText; Htmlnode node3 = doc.            Documentnode.selectsinglenode ("//*[@id = ' topics ']/div/div[3]/a[1]"); String author = Node3.            InnerText; Htmlnode node6 = doc.            Documentnode.selectsinglenode ("//*[@id = ' blogtitle ']/h2"); String motto = Node6.            InnerText; Mssqlhelper.getdata (title, args. Url, author, time, motto, args.            Depth.tostring (), DT);            Loghelper.writelog (title); Loghelper.writelog (args.            URL);            Loghelper.writelog (author);            Loghelper.writelog (time); Loghelper.writelog (Motto = = ""? ")            Null ": Motto); Loghelper.writelog (args. Depth + "&dt. Rows.count= "+ dt.            Rows.Count); Each time more than 100 data is deposited into the database, you can set the number of if (DT) according to their own circumstances.                Rows.Count >) {mssqlhelper.insertdb (dt); Dt.            Rows.clear (); }} #endreGion}} 

Here we use reptiles to crawl from the blog Park to get the blog, we need to use this htmlagilitypack third-party tools to resolve the fields we need, post title, blog author, blog URL, and so on some information. At the same time we can set up some information about the service

In the web crawler, we want to set some parameters, set the seed address, URL keyword, as well as the depth of the crawl, and so on, after these work is complete, we just need to install our Windows services, we are done. Hey...

6.0 Installing Windows Services

Here we use the VS bring-your-own tool to install Windows services.

Once the installation is successful, open our Windows service to see the Windows services we have installed.

You can also view our log files to see the information that we crawled from the blog post. As shown in.

This time to check our database, my service has been running for a day ...

Reprint: http://www.cnblogs.com/fenglingyi/p/4708006.html

Web crawler +htmlagilitypack+windows Services Crawl 200,000 blog posts from the blogging park

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.