Today, on ChinaUnix, I saw a code record crawler program:
Function saveRobot ($ dir) {$ addtime = date ('Y-m-d H: I: s', time (); $ GetLocationURL = "http ://". $ _ SERVER ["HTTP_HOST"]. $ _ SERVER ['request _ URI ']; $ agent1 = $ _ SERVER ["HTTP_USER_AGENT"]; $ agent = strtolower ($ agent1); $ Bot = ''; if (strpos ($ agent, "googlebot")>-1) {$ Bot = "Google";} if (strpos ($ agent, "mediapartners-google")> -1) {$ Bot = "Google";} if (strpos ($ agent, "baiduspider")>-1) {$ Bot = "Baidu ";} if (strpos ($ Gent, "sogou spider")>-1) {$ Bot = "Sogou";} if (strpos ($ agent, "sosospider")>-1) {$ Bot = "Soso";} if ($ Bot! = "") {$ MDateTime = date ("Y-m-d"); // check whether the table exists today. If it does not exist, it is created. File_put_contents ($ dir. "/define mdatetime.html", "$ Bot-$ GetLocationURL-$ addtime <br>", FILE_APPEND); // echo $ agent. '-'. $ Bot. '-'. $ GetLocationURL ;}}
Inspired by this, it can be seen that when a crawler accesses your website, it identifies itself through $ _ SERVER ["HTTP_USER_AGENT"]. Different crawlers have different names.
I searched for a complete crawler Record Program on the Internet and posted it for your reference:
<? Php/*** name: file cls_spider.php * -------- description ----------------- * The role of the class file is to monitor the operations of search engine crawlers on websites. * This class uses php Code and is only applicable to php websites. * If the code is not used in a database, you can directly write the record in a text file. Create a spider folder in the root directory. * Records generated by code are for reference only and do not necessarily contain all records, because files not running the Code are not recorded. * This code is free of charge. You can copy and modify it as needed, but you want to retain some of my copyright information. * -------- Usage ------------- * Add the following code to the page for statistics and call the code. Generally, the code is modified in the globally called file. * Require (ROOT_PATH. 'Directory of the current file/cls_spider.php '); * $ spider = new spider (); * if there is a friend who cannot install it, contact me through the following methods. * QQ: 235534 * EMAIL: dreamisok@qq.com * blog: http://blog.toptao123.com * please support my website http://www.ataobao.net http://www.toptao123.com welcome exchange link */class spider {var $ searchbot = ""; var $ tlc_thispage = ""; var $ filename = ""; var $ timestr = ""; var $ spider_array = array ("Googlebot" => "googlebot ", "google adsense" => "mediapartners-google", "YODAO" => "yodaobot", "MSNbot" => "msnbot", "Yahoobot" => "slurp ", "Baidus Pider "=>" baiduspider "," Sohubot "=>" sohu-search "," IASK "=>" iaskspider "," SOGOU "=>" sogou ", "Robozilla" => "robozilla", "Lycos" => "lycos"); function _ construct () {$ this-> tlc_thispage = addslashes ($ _ SERVER ["REQUEST_URI"]); $ this-> filename = 'spider /'. date ("ymd" 2.16.'.txt '; $ this-> timestr = $ this-> nowtime (); $ this-> searchbot = $ this-> get_naps_bot (); $ this-> spider ();} function spider () {if (! Empty ($ this-> searchbot) {$ writestring = "Time :". $ this-> timestr. "Robot :". $ this-> searchbot. "URL :". $ this-> tlc_thispage. "\ n"; $ data = fopen ($ this-> filename, "a"); fwrite ($ data, $ writestring); fclose ($ data );}} function get_naps_bot () {if (isset ($ _ SERVER ['HTTP _ USER_AGENT ']) {$ useragent = strtolower ($ _ SERVER ['HTTP _ USER_AGENT']); foreach ($ this-> spider_array as $ key => $ value) {if (strpos ($ useragent, $ valu E )! = False) {return $ key ;}} return false;} function nowtime () {$ date = date ("Y-m-d.G: I: s "); return $ date ;}}?>