Python Crawler Design Brush Blog Visit volume (brush visits, like, crawl pictures)

Last Update:2016-08-26 Source: Internet

Author: User

Tags set time python script

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Tools to prepare:

Installing the Python software: https://www.python.org/

Fiddler grab package software:http://blog.csdn.net/qq_21792169/article/details/51628123

Brush Blog Access to the principle is: Open a page blog access to increase the number of times. (Sina, Sohu and other bloggers meet this requirement)

count.py

<span style= "FONT-SIZE:18PX;" >import WebBrowser as Web  import time  import OS  import random  count = Random.randint ($)  j=0 While  J<count:      i=0 while      i<=8:          web.open_new_tab (' Http://blog.sina.com.cn/s/blog_ 552d7c620100aguu.html ')  #网址替换这里        i=i+1          time.sleep (3)  #这个时间根据自己电脑处理速度设置, unit is s    else:          Time.sleep (<span)  style= "font-family:arial, Helvetica, Sans-serif;" > #这个时间根据自己电脑处理速度设置, the Unit is s</span>        os.system (' taskkill/f/im chrome.exe ')  #google浏览器, the other replacement on the line        #print ' time webbrower closed '            j=j+1  </span>

You need to use Fiddler to get the request header data, such as cookie,host,referer,user-agent, etc.

sina.py

<span style= "FONT-SIZE:18PX;" >import Urllib.requestimport syspoints = 2   #how count? If Len (SYS.ARGV) > 1:    points = int (sys.argv[1]) ARITC Leurl = ' Point_header = {    ' Accept ': ' */* ',    ' Cookie ': ', #填你的cookie信息    ' Host ': ',  #主机    ' Referer ': ' ,    ' user-agent ': ' mozilla/5.0 (Windows NT 5.1) applewebkit/537.36 (khtml, like Gecko) chrome/49.0.2623.110 safari/537 . + ',}for I in range (points):    point_request = Urllib.request.Request (aritcleurl, headers = point_header)    point _response = Urllib.request.urlopen (point_request) </span>

The header headers above can be obtained by grabbing the packet data, here just to provide ideas.

To crawl a picture on a webpage:

getimg.py

#coding =utf-8import urllibimport urllib2import redef gethtml (URL): headers = {' user-agent ': ' mozilla/5.0 (Windows; U Windows NT 6.1; En-us; rv:1.9.1.6) gecko/20091201 firefox/3.5.6 '}req = urllib2. Request (url,headers=headers) page = Urllib2.urlopen (req); html = Page.read () return htmldef getimg (HTML):     reg = R ' src = "(h.*?g)" '    #reg = R ' 


1,. *? Three symbols can match any number of arbitrary symbols
2, \. is to escape '. ', which represents the HTML.
3, () means that we only take part of the parentheses, omitted.


Crawl CSDN Traffic csdn.py
#!usr/bin/python#-*-coding:utf-8-*-import urllib2import re# Current Blog List page number Page_num = # is not a page of the last list notlast = 1FS = open (' Blogs. TXT ', ' w ') account = str (raw_input (' Input csdn account: ')) and while notlast: #首页地址 baseUrl = ' http://blog.csdn.net/' +accou NT #连接页号, which makes up the crawled page URL Myurl = baseurl+ '/article/list/' +str (page_num) #伪装成浏览器访问, direct access to CSDN will reject User_agent = ' Mozill a/4.0 (compatible; MSIE 5.5; Windows NT) ' headers = {' User-agent ': user_agent} #构造请求 req = Urllib2. Request (myurl,headers=headers) #访问页面 myresponse = Urllib2.urlopen (req) mypage = Myresponse.read () #在页面中查找是否存在 ' Last ' This one tab to determine if the last page notlast = Re.findall (' <a href= ". *? > Last </a> ', Mypage,re. S) print '-----------------------------page%d---------------------------------'% (Page_num,) fs.write ('------------- -------------------page%d--------------------------------\ n '% page_num) #利用正则表达式来获取博客的href title_href = Re.findall (' & Lt;span class= "Link_title" ><a href= "(. *?)" > ', Mypage,re. S) titlElisthref=[] for items in Title_href:titleListhref.append (str (items). Lstrip (). Rstrip ()) #利用正则表达式来获取博客的 title= re.findall (' <span class= ' link_title "><a href=". *? " > (. *?) </a></span> ', Mypage,re.    S) titlelist=[] for items in Title:titleList.append (str (items). Lstrip (). Rstrip ()) #利用正则表达式获取博客的访问量 View = Re.findall (' <span class= "Link_view". *?><a href= ". *?" title= "read" > Read </a>\ ((. *?) \) </span> ', Mypage,re. S) viewlist=[] for items in View:viewList.append (str (items), Lstrip (). Rstrip ()) #将结果输出 for N in rang E (Len (titlelist)): print ' Traffic:%s href:%s title:%s '% (Viewlist[n].zfill (4), Titlelisthref[n],titlelist[n]) fs.write (' Traffic :%s\t\thref:%s\t\t title:%s\n '% (Viewlist[n].zfill (4), titlelisthref[n],titlelist[n])) #页号加1 page_num = page_num + 1



This regular expression does not write very complete, if there is a sticky article, the title of the article will be more than the <font color= "Red" >[top]</font&gt, so here should add a judgment statement, the reader can try it on their own.

To generate the IP list manually creat_ip:
#-*-coding:utf-8-*-#!/usr/bin/pythonimport Timetime_start = Time.time () def get_ip (number= ' ten ', start= ' 1.1.1.1 '): FI    Le = open (' Ip_list.txt ', ' w ') starts = Start.split ('. ') a = Int (starts[0]) B = Int (starts[1]) C = Int (starts[2]) D = Int (starts[3]) for A in range (a,256): F or b in range (b, up): For C in range (c, up): for D in range (d, c): IP = "                                                 %d.%d.%d.%d "% (a,b,c,d) If number > 1:                                          File.write (ip+ ' \ n ') number-= 1                    elif Number = = 1: #解决最后多一行回车问题 file.write (IP) Number--= 1                                     Else:file.close () Print IP return D = 0 C = 0 B = 0 get_ip (100000, ' 101.23.228.102 ') Time_end = Time.time () time = Time_end-time_startprint ' elapsed%s seconds '%time   


grab_ip.py Crawl Proxy IP site, read the IP and port number, specifically how to use these IP and ports to see personal reality.
#!/usr/bin/python#-*-coding:utf-8-*-import urllib,time,re,loggingimport urllib import urllib2 import reimport Timeimp ORT osimport randomurl = ' http://www.xicidaili.com/' csdn_url= ' http://blog.csdn.net/qq_21792169/article/details/ 51628142 ' header = {' user-agent ': ' mozilla/5.0 (Windows; U Windows NT 6.1; En-us; rv:1.9.1.6) gecko/20091201 firefox/3.5.6 '}user_agent_list = [' mozilla/5.0 (Windows; U Windows NT 5.1; It rv:1.8.1.11) gecko/20071127 firefox/2.0.0.11 ', ' opera/9.25 (Windows NT 5.1; U EN) ', ' mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;. NET CLR 1.1.4322;. NET CLR 2.0.50727) ', ' mozilla/5.0 (compatible; konqueror/3.5; Linux) khtml/3.5.5 (like Gecko) (Kubuntu) ', ' mozilla/5.0 (X11; U Linux i686; En-us; rv:1.8.0.12) gecko/20070731 ubuntu/dapper-security firefox/1.5.0.12 ', ' LYNX/2.8.5REL.1 libwww-FM/2.14 SSL-M m/1.4.1 gnutls/1.2.9 ']def getproxyhtml (URL): headers = {' User-agent ': ' mozilla/5.0 (Windows; U Windows NT 6.1; En-us; rv:1.9.1.6) gecko/20091201 firefox/3.5.6 '} req = Urllib2.      Request (url,headers=headers) page = Urllib2.urlopen (req); html = Page.read () return htmldef Ipportgain (HTML): Ip_re = Re.compile (R ' (\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}). +\n.+& GT, (\d{1,5}) < ') Ip_port = Re.findall (ip_re,html) return ip_portdef Proxyip (ip_port): #to IP deal with[' 221.238.28.   158:8,081 ', ' 183.62.62.188:9999 '] format Proxyip = []for i in Range (0,len (ip_port)):p roxyip.append (': '. Join (Ip_port[i])) Logging.info (Proxyip[i]) #to IP deal with[{' http ': ' http://221.238.28.158:8081 '}, {' http ': ' http://183.62.62.188:9999 '}] Format proxy_list = []for i in Range (0,len (PROXYIP)): a0 = ' http://%s '%proxyip[i]a1 = {' http ': '%s '%a0}proxy_list.ap Pend (A1) return proxy_listdef Csdn_brush (IP):p rint ip#use Ping Verify IP if alive def ping_ip (IP): ping_cmd = ' ping-    C 2-w 5%s '% IP Ping_result = os.popen (ping_cmd). Read ()print ' Ping_cmd:%s, Ping_result:%r '% (ping_cmd, Ping_result) if Ping_result.find (' 100% packet loss ') < 0: print ' ping%s ok '% IP return True else:print ' ping%s fail '% IPFH = open (' Proxy_ip.txt ', ' W ') h tml=getproxyhtml (URL) ip_port=ipportgain (HTML) Proxy_list=proxyip (Ip_port) for proxy_ip in Proxy_list:ping_ip (proxy_ IP) fh.write ('%s\n '% (proxy_ip,)) Res=urllib.urlopen (CSDN_URL,PROXIES=PROXY_IP). Read () #这里可以添加一个for循环, Write the blog so that the article all use this IP request once, and then the number of visits to the blog post =ip* * process number
(There is time interval, about half an hour, CSDN set time detection, so we cooperate on C language) Fh.close ()


Such a complete brush access script is written successfully, so that a script run is just a process, a process occurs to my problem, the entire program can not be executed, here write a C language script.
#include <stdlib.h>int main (int Argc,char **argv) {while (1) {char *cmd= "python/home/book/csdn.py";  /* Here is the Python script path */system (cmd) for csdn brush traffic;   /* Here is the execution of a process, a process problem, immediately open a new process, a process to run the script for about half an hour, so csdn time detection is not valid, one day visit =ip* the total number of posts *24*2*/return 0;}}


Finally a more reliable way: Grab the broiler, execute our script, safe, reliable.

Recommended article: http://blog.csdn.net/qq_21792169/article/details/5162702
 Python crawler Design Brush Blog visits (brush visits, likes, crawl pictures)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More