Reference:http://v.qq.com/boke/page/q/g/t/q01713cvdgt.html
Purpose: Crawl site pictures
In fact, the above link in the video has been the whole process is very clear, a little bit of computer-based people want to come to the realization.
So, not much to say, directly stick to the script I wrote, there are problems to watch the video.
################################################################ #3
Import Os,requests,urllib.request
From BS4 import BeautifulSoup
Header = {' user-agent ': ' xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx ', #这两个参数user-agent, and cookies, can be seen by a browser with "Developer Tools" function, In the video, I don't have to be exposed.
' Cookie ': ' Xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx '}
Folter_path = ' e://temp/xxxxxx/'
def make_file (path): #创建文件夹的函数, and returns a picture storage path
if not Os.path.isdir (Folter_path):
Os.mkdir (Folter_path)
t = os.path.join (FOLTER_PATH,STR (path) + '/')
if not Os.path.isdir (t):
Os.mkdir (t)
return T
def down_pic (start_num,end_num,type): #爬取图片的函数, Parameters are: Start page, end page, download type
for NUM in range (int (start_num), int (end_num)):
url = ' http://xxxxxx.net/ooxx/page-{} '. Format (num) #具体网址见视频吧, or you can find one yourself, this is casual.
Source_code = requests.get (url,headers = header)
plain_txt = Source_code.text
Soup = BeautifulSoup (plain_txt)
download_link = []
print (' Get ' + str (num))
For Pic_tag in Soup.find_all (' img '):
pic_link = pic_tag.get (str (type))
download_link.append (Pic_link)
While None in Download_link: #这块实际上用处不大, because the type is differentiated, does not produce garbage data, but lazy to remove.
Download_link.remove (None)
For item in Download_link: #下载图片
Urllib.request.urlretrieve (Item,pic_path + item[-10:])
start_num = 1760
end_time = 1767
type = {' jpg ': ' src ', ' gif ': ' org_src '} #类型字典
Pic_path = make_file (type[' gif ')
down_pic (start_num,end_time,type[' gif ')
##################################################################################
The code is more than the video tutorial, is the following aspects:
1. Added function and function to create image storage path, and make a distinction between download type.
2, classification download pictures, if you use src distinguish, is not download the full GIF, this self-discovery bar.
3, the amount, thanks to the video author, since the video author is issued by the public, I send this link should be OK
python3.5 study notes-a simple picture crawler