Due to the latest video rectification of the storm, the connotation of the app was forced to close, the vast number of friends homeless, but recently found a "segment Friends" app, version update is also very fast, is calling the broad Chiyo home, such as, interested can download see (PS: I am not advertising, confiscation of advertising fees)
At the same time, the former colleague also sent a paste of the pieces of the settlement, Sir a little, immediately on the connection:
Chiyo Home https://tieba.baidu.com/f?ie= ...
Then, see above, indeed a lot of Chiyo on top, so, I want to crawl their pictures and small video, there is the topic of this article:
In fact, using Python to crawl the site data is the most basic thing, it is not difficult, but I also want to share to everyone, learning and communication.
The main modules used to crawl the data in these sites are BS4, requests, and OS, which are common modules
The idea is to request the Web page HTML data through the requests module, and then through the BS4 module BeautifulSoup analysis of the requested page, and then through the CSS Finder to find the connotation of the picture of the satin and the address of the small video, the main implementation code is as follows:
def download_file (web_url): "" "gets the URL of the resource" "# download page print (' downloading page:%s ... '% web_url) result = Request S.get (web_url) soup = bs4. BeautifulSoup (Result.text, "Html.parser") # Find picture Resource img_list = Soup.select ('. Vpic_wrap img ') if img_list = = []: Print (' No picture resources found! ') Else: # Find resource, start writing for img_info in img_list:file_url = Img_info.get (' bpic ') writ E_file (File_url, 1) # Find video Resource video_list = Soup.select ('. Threadlist_video a ') if video_list = = []: Print (' not Discover Video Resources! ') Else: # Find resource, start writing for video_info in video_list:file_url = Video_info.get (' Data-video ') Write_file (File_url, 2) print (' Download resource end: ', web_url) Next_link = Soup.select (' #frs_list_pager. Next ') if NEX T_link = = []: print (' Download data end! ') Else:url = next_link[0].get (' href ') download_file (' https: ' + URL) learning python+ 725479218
Get the image and the address of the video, certainly not enough, but also to write these resources locally, by binary way to read the remote file resources, and then write to the local classification, the implementation of the main code is as follows:
def write_file(file_url, file_type): """写入文件""" res = requests.get(file_url) res.raise_for_status() # 文件类型分文件夹写入 if file_type == 1: file_folder = ‘nhdz\\jpg‘ elif file_type == 2: file_folder = ‘nhdz\\mp4‘ else: file_folder = ‘nhdz\\other‘ folder = os.path.exists(file_folder) # 文件夹不存在,则创建文件夹 if not folder: os.makedirs(file_folder) # 打开文件资源,并写入 file_name = os.path.basename(file_url) str_index = file_name.find(‘?‘) if str_index > 0: file_name = file_name[:str_index] file_path = os.path.join(file_folder, file_name) print(‘正在写入资源文件:‘, file_path) image_file = open(file_path, ‘wb‘) for chunk in res.iter_content(100000): image_file.write(chunk) image_file.close() print(‘写入完成!‘)学习Python+ 725479218
Finally, complete the code. Otherwise, will be said, say half, say welfare, also do not give full, this is not enough meaning. Sir, come on now ...
#!/usr/bin/env python#-*-coding:utf-8-*-"" "Crawl Baidu Paste, Chiyo home pictures and videos author:cuizytime:2018-05-19" "" Import Requestsimport Bs4import osdef write_file (File_url, File_type): "" "Write file" "res = Requests.get (file_url) res.raise_for_status () # File Type sub-folder Write if File_type = = 1:file_folder = ' nhdz\\jpg ' elif file_type = = 2:file_folder = ' nhdz\\ MP4 ' Else:file_folder = ' nhdz\\other ' folder = Os.path.exists (file_folder) # Folders do not exist, then create folder if not fold Er:os.makedirs (file_folder) # Open File resource and write file_name = Os.path.basename (file_url) Str_index = File_name.fin D ('? ') If str_index > 0:file_name = file_name[:str_index] File_path = Os.path.join (File_folder, file_name) print (' Writing resource file: ', file_path) image_file = open (File_path, ' WB ') for Chunk in Res.iter_content (100000): Image_file . Write (chunk) image_file.close () print (' Write complete! ') Learn python+ 725479218def download_file (web_url): "" "gets the URL of the resource" "" # download page print (' IsDownload page:%s ... '% web_url) result = Requests.get (web_url) soup = bs4. BeautifulSoup (Result.text, "Html.parser") # Find picture Resource img_list = Soup.select ('. Vpic_wrap img ') if img_list = = []: Print (' No picture resources found! ') Else: # Find resource, start writing for img_info in img_list:file_url = Img_info.get (' bpic ') writ E_file (File_url, 1) # Find video Resource video_list = Soup.select ('. Threadlist_video a ') if video_list = = []: Print (' not Discover Video Resources! ') Else: # Find resource, start writing for video_info in video_list:file_url = Video_info.get (' Data-video ') Write_file (File_url, 2) print (' Download resource end: ', web_url) Next_link = Soup.select (' #frs_list_pager. Next ') if NEX T_link = = []: print (' Download data end! ') Else:url = next_link[0].get (' href ') download_file (' https: ' + URL) # Main program entry if __name__ = = ' __main__ ': Web_url = ' https://tieba.baidu.com/f?ie=utf-8&kw= Chiyo home ' Download_file (web_url)
No content satin can be brushed, using Python crawl Chiyo home paste pictures and small video (including source code)