Three entry-level crawlers written in Python (with notes)
写在前面的话:作者目前正在学习Python,还是一名小白,所以注释可以会有些不准确的地方,望谅解。这三个小爬虫不是很难,而且用处可能也不大,主要还是锻炼新手对函数的运用与理解大牛和意义党可以先绕过了
附:我用的是Pyton2.713,用3.0的朋友运行可能会有些代码出错
The first, Web source code crawler;
#-- coding: utf-8 --#一个巨详细又简单的小爬虫#---------------------------------import stringfrom urllib2 import urlopen #通过from import导入urllib2库中的urlopen模块,用于抓取url的内容url = raw_input(‘>‘) #使用raw_input函数让用户输入想要爬取的网页,并且赋值给变量x = urlopen(‘http://‘+url) #调用urlopen获取源码并赋值给变量 前边加上http://是怕用户忘记输入http协议print x.read() #最后使用read函数查看并使用print打印出来
Use Python's own idle open, input www.baidu.com, effect;
The second, crawl and download Baidu paste content;
#-- coding: utf-8 --#爬取并下载百度贴吧的html文件#秦王IPC#------------------------------------------------import string #调用模块from urllib2 import urlopen #调用urllib2库中的模块#---------------------------------#定义函数 def paqutieba(url,x,y): for j in range(x, y+1): #循环从开始页面到终点页面 Name = string.zfill(j,5) + ‘.html‘ #string.zfill(1,5) 文件名效果00001. print ‘正在下载第‘ + str(j) + ‘个网页内容,并将其存储为‘ + Name + ‘......‘ #下载的时候打印 l = open(Name,‘w+‘) #将写入操作赋值给变量l k = urlopen(url + str(j)).read() #调用urlopen模块抓取网页内容并查看并赋值给变量k l.write(k) #将k内容写入 l.close() #关闭#---------------------------------与用户交互---------------url = str(raw_input(u‘请输入贴吧的地址,去掉?pn=后面的数字:\n格式:https://tieba.baidu.com/p/xxxxxxxxx?pn= \n>>>‘)) x = int(raw_input(u‘请输入开始的页数:\n‘)) y = int(raw_input(u‘请输入终点的页数:\n‘)) paqutieba(url,x,y) #调用函数
With Python's own idle open, arbitrarily enter a paste address (some URLs do not? Pn= to add themselves), the effect;
Third, crawl the content and write the text (this may be a bit difficult for beginners to understand and trouble)
需要用到的工具:谷歌浏览器,火狐浏览器,和一颗坚持下去的心。
Preparatory work
1. Use Google to open the content of the official website, and then press F12 to grab the bag;
The initial box on the right, can be transferred to the bottom of the dock side, select Network, then click All, if the upper-left corner is not a red dot, click (Red is not a point) indicates that the browser is grasping the page.
Then slide to the bottom and click load More;
At this point, look at the captured content, find the file in the diagram, and copy the request URL link address, in Firefox browser Open, (the reason for Firefox is because Firefox comes with transcoding, otherwise open with Google may be garbled);
Open in Firefox browser;
Then open can be found, the content of the data>data>0>group>text under the path;
You can also find that there are more than 0, there are 1,2,3,4,5....19, you can guess this should be the number of pages. So you can use code to iterate over more content.
Also need to break the anti-crawling measures, in fact, is very simple, just need to copy the contents of the request headers into the code (but remember to add ' and: Separate, the code part has);
Code section
"' Title = Crawl connotation satin writes into text coder = Qin Wang Ipcdate = 2018-01-30 because a while loop is used, it needs to be stopped manually or it will be written down. "Import requests #网络请求模块import time #时间模块, the following to set the access interval. If not set, the server may think you are a malicious attack, will be protective measures, such as IP-blocking. Import sysreload (SYS) sys.setdefaultencoding (' utf-8 ') #这三行是转码header = {' Host ': ' neihanshequ.com ', ' user-agent ': ' Mozil la/5.0 (Windows NT 6.1; Win64; x64; rv:58.0) gecko/20100101 firefox/58.0 ', ' Accept ': ' Application/json, Text/javascript, */*; q=0.01 ', ' accept-language ': ' zh-cn,en-us;q=0.7,en;q=0.3 ', ' accept-encoding ': ' gzip, deflate ', ' refere R ': ' http://neihanshequ.com/', ' x-csrftoken ': ' e9b62faa6a962cdf92f1531b498fc771 ', ' x-requested-with ': ' XMLHT Tprequest ', ' Cookie ': ' csrftoken=e9b62faa6a962cdf92f1531b498fc771; tt_webid=6486401443292186126; Uuid= "W:c07f437659474cc1a7cfd052d9985b37"; hm_lvt_3280fbe8d51d664ffc36594659b91638=1511848146,1512200937,1512371475,1514373568 ', ' Connection ': ' Keep-alive ' , ' Pragma ': ' No-cache ', ' cache-control ': ' No-cache '} #添加网页的Request headers information, disguised as a browser access, is intended to break the anti-crawl measures. timestamp = 1517361103 #设置并赋值起始时间戳-The following while loop uses timestamps to determine while type (timestamp) = = int or type (timestamp) = = float: #保证时 Inter-stamp regardless of the type of numeric value can be run URL = ' http://neihanshequ.com/joke/?is_json=1&app_name=neihanshequ_web&max_time= ' +str ( Timestamp) #用F12抓包然后点加载更多获取原始链接 html = requests.get (Url,headers=header) #使用get请求url内容, and is assigned the HTML Time.sleep (3) #设置间隔时 Between for U and Range (len (Html.json () [' Data '] [' data ']): #print (Len (Html.json () [' Data '] [' data ']) to determine how many pieces of a page Putting in a For statement is how many loops are meant with open (R ' C:\duanzi.txt ', ' a ') as IPC: #使用with. Set the Write Path Nr=html.json () [' Data '] [' Data '][u][' group '] [' text '] #html. JSON is a file type, with the contents of the path ipc.write in square brackets (NR + ' \ n ') #进行写入操作 timestamp = Html.json () [' Data '] [' max_time '] #每当循环一次结束后, return the new timestamp for the next loop #--------------------------can be added Do not add---------------------#print (html.status_code) #返回状态码 #print (Html.json () [' Data '] [' max_time ']) #返回时间戳
Operation effect;
小白的第一篇博客,有错误或是不准确的地方可以在评论区里指出(注意用词文明),谢谢!
Python three entry-level crawlers (with code and notes) written in small white