C # crawler uses agent Brush csdn article view volume

Source: Internet
Author: User
Tags ip number

Yesterday wrote a "C # bulk Grab free agents and verify the effectiveness of" the article, and then yesterday's goal to continue to finish it, the ultimate goal is to refresh the CSDN article of the view (in fact, it is very simple, before the blog Park article can also use proxy IP to brush, and later not), Brush the amount of articles viewed itself is shameful, without any meaning, of course, technically innocent. Prior to writing in Csdn, since the CSDN revision is mainly written in the blog park.

1. How to maintain the proxy IP library?

Want to use proxy IP, that must have a certain number, sufficient effective proxy IP library, in the learning stage, casual play that can only from the free proxy IP site crawl, not a certain number of agent brush article wandering volume is very slow, the first is to maintain their own proxy IP library

Used before the West Thorn agent, 66ip comparison, the West Shrine seems to have anti-grilled, encountered once, do not know is the West Shrine website problem or anti-stripping strategy, the two sites per minute to crawl can use agent about 2, 3, this has been calculated on the more objective, data5u, fast agent, ip3366 Web page update very little, and the effectiveness is low, fast proxy crawl Web page must also set useragent, found that the IP port and Web page is set after the inconsistency, very playful is it, no way free is this, otherwise people will charge, of course, the agent is not stable, But it's definitely a lot better than free.

  • Maintain agent Quality
    From the web-side crawl down the agent, it must be verified and then put into storage, the simplest way is to initiate a request status code is 200. I recommend the free agent or the above two West Thorn agent and 66ip, relative to other free agent effectiveness, the number is relatively high.
  • How agents are stored
    I am using Redis to store these valid proxies, and the data structure is best to use set, which does not allow the same IP to be stored. The effectiveness of the agent can not be known, some may be a few 10 seconds, some 10 minutes, in use should record those times can not use the IP, reached a certain number of times, should be removed from the set. Unable to determine the agent's aging, proxy IP to be used in a timely manner, you can use the timer to remove the agent from Redis.

    2. What are some common mechanisms for anti-crawlers?

    The principle of anti-crawler is to determine whether it is a real user, some of the more important data will be mixed with a variety of mechanisms, so that the cost of the crawler to become larger or even unable to crawl, header field settings, IP restrictions, cookies, etc.

  • IP restrictions
    Some sites in order to prevent reptiles, may be the frequency of access to each IP limit, the frequency of access is the speed, you can sleep with Thread.Sleep, pause for a while to crawl; an IP number this can be set by crawling the free proxy.
  • limitations in Header
    user-agent: User agent, this is very simple, can collect some common browser proxy header, at the time of the request randomly set user-agent
    Referer : Access to the target link is from which link, do anti-image hotlinking can use it to handle, of course, this refresh can also be forged.
    Cookies: After login or some other user action, the server will return some cookie information, no cookie is easily recognized as a forgery request, can be local through JS, according to the service side returned some information, the region set Cooke, of course, this is not so simple , the process of encrypting and decrypting is usually involved. This is a difficult point of the reptile.

    3. Use proxy IP to refresh the CSDN article's browsing volume csdn articles are still relatively good brush, the premise is that you have enough agents, no more agent efficiency will be very slow. In the previous article we have been from a few free agents to crawl the agent, here is not much to do introduction, here we go on the last piece to use. C # Bulk Crawl free proxies and verify validity 1. I use multithreaded bulk send requests, which are more efficient and each thread distributes a certain number of agent execution requests evenly. 2. Regular acquisition of the agent in Redis 3. Use the Concurrentdictionary dictionary collection under the System.Collections.Concurrent namespace to count the number of failures, and remove the agent directly from the library if a certain number of times is reached. About the main function of the code is to achieve, the lack of place is too few agents, inefficient.
    Effects

    Yesterday evening read an article, the story is very strong, wary of hanging open-source signboards everywhere swindling garbage projects, such as ibase4j, so find Yumbo in csdn This article exposure to Beijing, a non-wage rogue company Nanchong culture, the boss called Wan Ming, The brush time is not short, mainly because the agent is too few. The


    Main code is as follows:
Class Program {static bool finishiscompleted=true; Static concurrentdictionary<string, int> failstatis;//Save request failed IP address: Key failure Count: value static string refreshLink =        "80734388";        static string Requestsuccesskey,requestfailkey;            Static Async Task Main (string[] args) {threadpool.setminthreads (500, 100);            Failstatis = new concurrentdictionary<string, int> ();            Requestsuccesskey = "List_request_success" +datetime.now.tostring ("hhmm");            Requestfailkey = "List_request_fail" + DateTime.Now.ToString ("hhmm");                    Timer timer = new Timer (Async (state) = {if (finishiscompleted) {                    finishiscompleted = false;                    Get proxy var proxyips = Redishelper.getproxy ();                    int threadcount = 1; if (Proxyips.count >) {ThreadcoUNT = PROXYIPS.COUNT/10;                    }//Evenly assign each thread to execute 15 requests int requestcount = Proxyips.count/threadcount; for (var i = 0; i < ThreadCount; i++) {var templist = Proxyips.get                        Range (i * requestcount, requestcount); if (i = = threadCount-1) {templist.addrange (Proxyips.getrange (Threadco                        UNT * RequestCount, Proxyips.count-threadcount * requestcount)); The thread thread = new Thread (Async () = {//                        The line initiates the request await Finish (templist);                        }); Thread.                    Start ();            }}}, "Processing Timer event", 0, 1000*30);        Console.ReadLine (); } public static Async Task Finish (list<string> ProXyips) {for (int i = 0; i < Proxyips.count; i++) {string ip = proxyips[i]                ; int index = IP.                IndexOf (":"); string ipAddress = IP.                Substring (0, index); int ipport = Int. Parse (IP.                Substring (index + 1)); Random Sleep Thread.Sleep (new random ().                Next (1,4) *1000); Await Get (ipAddress, Ipport, 10000, Randomuseragent (), RefreshLink, () = {Redishelp Er.                    Addrequestok (requestsuccesskey,ip+ "" +datetime.now.toshorttimestring (), true);                    Console.foregroundcolor = Consolecolor.white;                Console.WriteLine (ip+ "Success"); }, (Error) = {Redishelper.addrequestok (requestfailkey, IP + "" + Da                    TeTime.Now.ToShortTimeString (), false);                    Console.foregroundcolor = consolecolor.red; Console.WriteLine (ipaddress+error+ "Lost"+ (Failstatis.containskey (IP) failstatis[ip]: 1) +" Times ");                        if (Failstatis.containskey (IP)) {if (failstatis[ip] = = 6)                        {redishelper.removesetvalue (IP);                    } else failstatis[ip]++; } else {failstatis.addorupdate (IP, 1, (Key,oldvalue) =&GT;OLDV                    ALUE+1);            }                }                );        } finishiscompleted = true; } private static String Randomuseragent () {string[] usersagents = new string[] {"Moz illa/5.0 (Linux; U Android 2.3.7; En-us; Nexus one build/frf91) applewebkit/533.1 (khtml, like Gecko) version/4.0 Mobile safari/533.1 "," Mqqbrowser/26 M ozilla/5.0 (Linux; U Android 2.3.7; ZH-CN; MB200 build/grj22; CyanogenMod-7) applewebkit/533.1 (khtml, like Gecko) version/4.0 Mobile safari/533.1 "," JUC (Linux; U 2.3.7; ZH-CN; MB200; 320*480) ucweb7.9.3.103/139/999 "," mozilla/5.0 (Windows NT 6.1; WOW64; RV:7.0A1) gecko/20110623 firefox/7.0a1 fennec/7.0a1 "," opera/9.80 (Android 2.3.4; Linux; Opera mobi/build-1107180945; U EN-GB) presto/2.8.149 version/11.10 "," mozilla/5.0 (Linux; U Android 3.0; En-us; Xoom build/hri39) applewebkit/534.13 (khtml, like Gecko) version/4.0 safari/534.13 "," mozilla/5.0 (IPhone; U CPU iPhone os 3_0 like Mac os X; En-US) applewebkit/420.1 (khtml, like Gecko) version/3.0 mobile/1a542a safari/419.3 "," mozilla/5.0 (IPhone; U CPU iPhone os 4_0 like Mac os X; En-US) applewebkit/532.9 (khtml, like Gecko) version/4.0.5 mobile/8a293 safari/6531.22.7 "," mozilla/5.0 (IPad; U CPU os 3_2 like Mac os X; En-US) applewebkit/531.21.10 (khtml, like Gecko) version/4.0.4 mobile/7b334b safari/531.21.10 "," mozilla/5.0 (B LackberRy U BlackBerry 9800; EN) applewebkit/534.1+ (khtml, like Gecko) version/6.0.0.337 Mobile safari/534.1+ "," mozilla/5.0 (Hp-tablet; Linux; hpwos/3.0.0; U En-US) applewebkit/534.6 (khtml, like Gecko) wosbrowser/233.70 safari/534.6 touchpad/1.0 "," mozilla/5.0 (Symbia nos/9.4; series60/5.0 nokian97-1/20.0.019;  profile/midp-2.1 configuration/cldc-1.1) applewebkit/525 (khtml, like Gecko) browserng/7.1.18124 "," Mozilla/5.0 (Compatible; MSIE 9.0; Windows Phone OS 7.5; trident/5.0; iemobile/9.0; HTC; Titan) "," mozilla/5.0 (Windows NT 6.1) applewebkit/537.36 (khtml, like Gecko) chrome/41.0.2228.0 safari/537.36 " , "mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) applewebkit/537.36 (khtml, like Gecko) chrome/41.0.2227.1 safari/537.36 "," mozilla/5.0 (X11; U Linux x86_64; ZH-CN; rv:1.9.2.10) gecko/20100922 ubuntu/10.10 (Maverick) firefox/3.6.10 "," mozilla/5.0 (Windows NT 5.1; U En rv:1.8.1) gecko/20061208 firefox/2.0.0Opera 9.50 "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/534.57.2 (khtml, like Gecko) version/5.1.7 safari/534.57.2 "," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/30.0.1599.101 safari/537.36 "," mozilla/5.0 (Windows NT 6.1 ; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/38.0.2125.122 ubrowser/4.0.3214.0 safari/537.36 "," Mozilla /5.0 (Linux; U Android 2.2.1; ZH-CN; htc_wildfire_a3333 build/frg83d) applewebkit/533.1 (khtml, like Gecko) version/4.0 Mobile safari/533.1 "," Mozil la/5.0 (BlackBerry; U BlackBerry 9800; EN) applewebkit/534.1+ (khtml, like Gecko) version/6.0.0.337 Mobile safari/534.1+ "," mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; trident/5.0; iemobile/9.0; HTC; Titan) "," mozilla/4.0 (compatible; MSIE 6.0;            ) opera/ucweb7.0.2.37/28/999 "," openwave/ucweb7.0.2.37/28/999 "," nokia5700/ucweb7.0.2.37/28/999 ", "Ucweb7.0.2.37/28/999 "," mozilla/5.0 (Hp-tablet; Linux; hpwos/3.0.0; U  En-US) applewebkit/534.6 (khtml, like Gecko) wosbrowser/233.70 safari/534.6 touchpad/1.0 "," mozilla/5.0 (Linux; U Android 3.0; En-us; Xoom build/hri39) applewebkit/534.13 (khtml, like Gecko) version/4.0 safari/534.13 "," opera/9.80 (Android 2.3.4 ; Linux; Opera mobi/build-1107180945; U EN-GB) presto/2.8.149 version/11.10 "," mozilla/5.0 (IPad; U CPU os 4_3_3 like Mac os X;            En-US) applewebkit/533.17.9 (khtml, like Gecko) version/5.0.2 mobile/8j2 safari/6533.18.5 ",};            Random random = new random (); var randomnumber = random. Next (0, Usersagents.            Length);        return Usersagents[randomnumber]; public static Async Task Get (string proxyip, int proxyport,int Timeout, string randomuseragent, string URL, Act            Ion success, action<string> fail) {HttpWebRequest request = null; HttpWebResponse response = NULL;                try {request = (HttpWebRequest) webrequest.create (URL); Request.                Timeout = timeout; Request.                useragent = randomuseragent; Request.                Proxy = new WebProxy (proxyip,proxyport); Response = await request.                                Getresponseasync () as HttpWebResponse; if (response.                StatusCode = = Httpstatuscode.ok) {success (); } else {fail (response+ ":" +response.                Statusdescription); }} catch (Exception ex) {fail (ex.            Message.tostring ()); } finally {if (request! = NULL) {request.                    Abort ();                request = NULL; } if (response! = NULL) {response.                    Close ();Response = NULL; }            }        }    }

RedisHelper.cs

public class Redishelper {private static readonly object Locker = new Object ();        private static Connectionmultiplexer _redis;        Private Const string connecttionstring = "127.0.0.1:6379,defaultdatabase=3";        Public Const string redis_set_ket_success = "Set_success_ip";                private static Connectionmultiplexer Manager {get {if (_redis = = null) {Lock (Locker) {if (_redis! = null) return _redis                        ;                        _redis = GetManager ();                    return _redis;            }} return _redis; }} private static Connectionmultiplexer GetManager (string connectionString = null) {if ( String.            IsNullOrEmpty (connectionString)) {connectionString = connecttionstring; } return Connectionmultiplexer.connect(connectionString); public static void Addrequestok (String key,string value,bool issuccess) {var db = manager.getd            Atabase (); if (issuccess) db.            Listleftpush (Key,value); Else db.        Listleftpush (key, value); } public static list<string> GetProxy () {list<string> result = new List<string&gt            ;();            var db = Manager.getdatabase (); var values = db.            Setmembers (redis_set_ket_success); foreach (var value in values) {result. ADD (value.            ToString ());        } return result;            public static bool Insertset (string value) {var db = Manager.getdatabase (); Return DB.        Setadd (redis_set_ket_success, value);            public static bool Removesetvalue (string value) {var db = Manager.getdatabase (); Return DB. Setremove (REDIS_SET_KET_SUCCess,value); }    }

Original: C # use Agent to refresh Csdn article views
Original link: https://www.cnblogs.com/zhangmumu/p/9275190.html
Zhang Lin
Free to reprint 2018-07-06 without the author's permission

C # crawler uses agent Brush csdn article view volume

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.