Python Crawler Design Brush Blog Visit volume (brush visits, like, crawl pictures)

Source: Internet
Author: User
Tags set time python script

Tools to prepare:

Installing the Python software: https://www.python.org/

Fiddler grab package software:http://blog.csdn.net/qq_21792169/article/details/51628123


Brush Blog Access to the principle is: Open a page blog access to increase the number of times. (Sina, Sohu and other bloggers meet this requirement)

count.py

<span style= "FONT-SIZE:18PX;" >import WebBrowser as Web  import time  import OS  import random  count = Random.randint ($)  j=0 While  J<count:      i=0 while      i<=8:          web.open_new_tab (' Http://blog.sina.com.cn/s/blog_ 552d7c620100aguu.html ')  #网址替换这里        i=i+1          time.sleep (3)  #这个时间根据自己电脑处理速度设置, unit is s    else:          Time.sleep (<span)  style= "font-family:arial, Helvetica, Sans-serif;" > #这个时间根据自己电脑处理速度设置, the Unit is s</span>        os.system (' taskkill/f/im chrome.exe ')  #google浏览器, the other replacement on the line        #print ' time webbrower closed '            j=j+1  </span>


You need to use Fiddler to get the request header data, such as cookie,host,referer,user-agent, etc.

sina.py

<span style= "FONT-SIZE:18PX;" >import Urllib.requestimport syspoints = 2   #how count? If Len (SYS.ARGV) > 1:    points = int (sys.argv[1]) ARITC Leurl = ' Point_header = {    ' Accept ': ' */* ',    ' Cookie ': ', #填你的cookie信息    ' Host ': ',  #主机    ' Referer ': ' ,    ' user-agent ': ' mozilla/5.0 (Windows NT 5.1) applewebkit/537.36 (khtml, like Gecko) chrome/49.0.2623.110 safari/537 . + ',}for I in range (points):    point_request = Urllib.request.Request (aritcleurl, headers = point_header)    point _response = Urllib.request.urlopen (point_request) </span>


The header headers above can be obtained by grabbing the packet data, here just to provide ideas.


To crawl a picture on a webpage:

getimg.py

#coding =utf-8import urllibimport urllib2import redef gethtml (URL): headers = {' user-agent ': ' mozilla/5.0 (Windows; U Windows NT 6.1; En-us; rv:1.9.1.6) gecko/20091201 firefox/3.5.6 '}req = urllib2. Request (url,headers=headers) page = Urllib2.urlopen (req); html = Page.read () return htmldef getimg (HTML):     reg = R ' src = "(h.*?g)" '    #reg = R ' 


1,. *? Three symbols can match any number of arbitrary symbols

2, \. is to escape '. ', which represents the HTML.

3, () means that we only take part of the parentheses, omitted.



Crawl CSDN Traffic csdn.py

#!usr/bin/python#-*-coding:utf-8-*-import urllib2import re# Current Blog List page number Page_num = # is not a page of the last list notlast = 1FS = open (' Blogs. TXT ', ' w ') account = str (raw_input (' Input csdn account: ')) and while notlast: #首页地址 baseUrl = ' http://blog.csdn.net/' +accou NT #连接页号, which makes up the crawled page URL Myurl = baseurl+ '/article/list/' +str (page_num) #伪装成浏览器访问, direct access to CSDN will reject User_agent = ' Mozill a/4.0 (compatible; MSIE 5.5; Windows NT) ' headers = {' User-agent ': user_agent} #构造请求 req = Urllib2. Request (myurl,headers=headers) #访问页面 myresponse = Urllib2.urlopen (req) mypage = Myresponse.read () #在页面中查找是否存在 ' Last ' This one tab to determine if the last page notlast = Re.findall (' <a href= ". *? > Last </a> ', Mypage,re. S) print '-----------------------------page%d---------------------------------'% (Page_num,) fs.write ('------------- -------------------page%d--------------------------------\ n '% page_num) #利用正则表达式来获取博客的href title_href = Re.findall (' & Lt;span class= "Link_title" ><a href= "(. *?)" > ', Mypage,re. S) titlElisthref=[] for items in Title_href:titleListhref.append (str (items). Lstrip (). Rstrip ()) #利用正则表达式来获取博客的 title= re.findall (' <span class= ' link_title "><a href=". *? " > (. *?) </a></span> ', Mypage,re.    S) titlelist=[] for items in Title:titleList.append (str (items). Lstrip (). Rstrip ()) #利用正则表达式获取博客的访问量 View = Re.findall (' <span class= "Link_view". *?><a href= ". *?" title= "read" > Read </a>\ ((. *?) \) </span> ', Mypage,re. S) viewlist=[] for items in View:viewList.append (str (items), Lstrip (). Rstrip ()) #将结果输出 for N in rang E (Len (titlelist)): print ' Traffic:%s href:%s title:%s '% (Viewlist[n].zfill (4), Titlelisthref[n],titlelist[n]) fs.write (' Traffic :%s\t\thref:%s\t\t title:%s\n '% (Viewlist[n].zfill (4), titlelisthref[n],titlelist[n])) #页号加1 page_num = page_num + 1



This regular expression does not write very complete, if there is a sticky article, the title of the article will be more than the <font color= "Red" >[top]</font&gt, so here should add a judgment statement, the reader can try it on their own.


To generate the IP list manually creat_ip:

#-*-coding:utf-8-*-#!/usr/bin/pythonimport Timetime_start = Time.time () def get_ip (number= ' ten ', start= ' 1.1.1.1 '): FI    Le = open (' Ip_list.txt ', ' w ') starts = Start.split ('. ') a = Int (starts[0]) B = Int (starts[1]) C = Int (starts[2]) D = Int (starts[3]) for A in range (a,256): F or b in range (b, up): For C in range (c, up): for D in range (d, c): IP = "                                                 %d.%d.%d.%d "% (a,b,c,d) If number > 1:                                          File.write (ip+ ' \ n ') number-= 1                    elif Number = = 1: #解决最后多一行回车问题 file.write (IP) Number--= 1                                     Else:file.close () Print IP return D = 0 C = 0 B = 0 get_ip (100000, ' 101.23.228.102 ') Time_end = Time.time () time = Time_end-time_startprint ' elapsed%s seconds '%time   


grab_ip.py Crawl Proxy IP site, read the IP and port number, specifically how to use these IP and ports to see personal reality.

#!/usr/bin/python#-*-coding:utf-8-*-import urllib,time,re,loggingimport urllib import urllib2 import reimport Timeimp ORT osimport randomurl = ' http://www.xicidaili.com/' csdn_url= ' http://blog.csdn.net/qq_21792169/article/details/ 51628142 ' header = {' user-agent ': ' mozilla/5.0 (Windows; U Windows NT 6.1; En-us; rv:1.9.1.6) gecko/20091201 firefox/3.5.6 '}user_agent_list = [' mozilla/5.0 (Windows; U Windows NT 5.1; It rv:1.8.1.11) gecko/20071127 firefox/2.0.0.11 ', ' opera/9.25 (Windows NT 5.1; U EN) ', ' mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;. NET CLR 1.1.4322;. NET CLR 2.0.50727) ', ' mozilla/5.0 (compatible; konqueror/3.5; Linux) khtml/3.5.5 (like Gecko) (Kubuntu) ', ' mozilla/5.0 (X11; U Linux i686; En-us; rv:1.8.0.12) gecko/20070731 ubuntu/dapper-security firefox/1.5.0.12 ', ' LYNX/2.8.5REL.1 libwww-FM/2.14 SSL-M m/1.4.1 gnutls/1.2.9 ']def getproxyhtml (URL): headers = {' User-agent ': ' mozilla/5.0 (Windows; U Windows NT 6.1; En-us; rv:1.9.1.6) gecko/20091201 firefox/3.5.6 '} req = Urllib2.      Request (url,headers=headers) page = Urllib2.urlopen (req); html = Page.read () return htmldef Ipportgain (HTML): Ip_re = Re.compile (R ' (\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}). +\n.+& GT, (\d{1,5}) < ') Ip_port = Re.findall (ip_re,html) return ip_portdef Proxyip (ip_port): #to IP deal with[' 221.238.28.   158:8,081 ', ' 183.62.62.188:9999 '] format Proxyip = []for i in Range (0,len (ip_port)):p roxyip.append (': '. Join (Ip_port[i])) Logging.info (Proxyip[i]) #to IP deal with[{' http ': ' http://221.238.28.158:8081 '}, {' http ': ' http://183.62.62.188:9999 '}] Format proxy_list = []for i in Range (0,len (PROXYIP)): a0 = ' http://%s '%proxyip[i]a1 = {' http ': '%s '%a0}proxy_list.ap Pend (A1) return proxy_listdef Csdn_brush (IP):p rint ip#use Ping Verify IP if alive def ping_ip (IP): ping_cmd = ' ping-    C 2-w 5%s '% IP Ping_result = os.popen (ping_cmd). Read ()print ' Ping_cmd:%s, Ping_result:%r '% (ping_cmd, Ping_result) if Ping_result.find (' 100% packet loss ') < 0: print ' ping%s ok '% IP return True else:print ' ping%s fail '% IPFH = open (' Proxy_ip.txt ', ' W ') h tml=getproxyhtml (URL) ip_port=ipportgain (HTML) Proxy_list=proxyip (Ip_port) for proxy_ip in Proxy_list:ping_ip (proxy_ IP) fh.write ('%s\n '% (proxy_ip,)) Res=urllib.urlopen (CSDN_URL,PROXIES=PROXY_IP). Read () #这里可以添加一个for循环, Write the blog so that the article all use this IP request once, and then the number of visits to the blog post =ip* * process number
(There is time interval, about half an hour, CSDN set time detection, so we cooperate on C language) Fh.close ()


Such a complete brush access script is written successfully, so that a script run is just a process, a process occurs to my problem, the entire program can not be executed, here write a C language script.

#include <stdlib.h>int main (int Argc,char **argv) {while (1) {char *cmd= "python/home/book/csdn.py";  /* Here is the Python script path */system (cmd) for csdn brush traffic;   /* Here is the execution of a process, a process problem, immediately open a new process, a process to run the script for about half an hour, so csdn time detection is not valid, one day visit =ip* the total number of posts *24*2*/return 0;}}


Finally a more reliable way: Grab the broiler, execute our script, safe, reliable.


Recommended article: http://blog.csdn.net/qq_21792169/article/details/5162702

Python crawler Design Brush Blog visits (brush visits, likes, crawl pictures)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.