Python implements batch download of code instances for wallpapers and python wallpapers
Project address: https://github.com/jrainlau/wallpaper-downloader
Preface
I haven't written any articles for a long time, because I have been adapting to new posts recently and learning python in my spare time. This article is a summary of the latest python learning phase. It developed a crawler to download the high-definition wallpapers of a website in batches.
Note: The project to which this article belongs is only used for python learning and cannot be used for other purposes!
Initialize a project
Project usedvirtualenv
Create a virtual environment to avoid global pollution. Usepip3
Download it directly:
pip3 install virtualenv
Createwallpaper-downloader
Directory, usevirtualenv
Create a project namedvenv
Virtual Environment:
virtualenv venv. venv/bin/activate
Next, create the dependency directory:
echo bs4 lxml requests > requirements.txt
Finally, download and install the dependency of yun:
pip3 install -r requirements.txt
Analysis of crawler steps
For simplicity, go to the wallpaper list page classified as "aero": http://wallpaperswide.com/aer ....
We can see that there are 10 wallpapers available for download on this page. However, the thumbnails are displayed here, and the definition of the wallpaper is far from enough. Therefore, we need to go to the wallpaper details page and find the HD download link. Click the first wallpaper to view a new page:
Because my machine is a Retina screen, I plan to directly download the largest one to ensure high definition (the size shown in red circles ).
After learning about the specific steps, you can use the developer tool to find the corresponding dom node and extract the corresponding url. This process is no longer available. You can try it by yourself and enter the encoding section below.
Access page
Createdownload.py
File, and then introduce two libraries:
from bs4 import BeautifulSoupimport requests
Next, write a function specifically used to access the url and then return the html of the page:
def visit_page(url): headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36' } r = requests.get(url, headers = headers) r.encoding = 'utf-8' soup = BeautifulSoup(r.text, 'lxml') return soup
To prevent being hit by the anti-crawling mechanism of the website, we need to add UA in the header to disguise the crawler as a normal browser, then specify UTF-8 encoding, and finally return the html in string format.
Extract Link
After obtaining the html of the page, you need to extract the url corresponding to the wallpaper list on the page:
def get_paper_link(page): links = page.select('#content > div > ul > li > div > div a') collect = [] for link in links: collect.append(link.get('href')) return collect
This function extracts the URLs of all the wallpaper details on the list page.
Download Wallpaper
With the address on the details page, we can select the appropriate size. After analyzing the dom structure of the page, we can know that each size corresponds to a link:
So the first step is to extract the links corresponding to these sizes:
wallpaper_source = visit_page(link)wallpaper_size_links = wallpaper_source.select('#wallpaper-resolutions > a')size_list = []for link in wallpaper_size_links: href = link.get('href') size_list.append({ 'size': eval(link.get_text().replace('x', '*')), 'name': href.replace('/download/', ''), 'url': href })
size_list
Is a collection of these links. In order to facilitate the next selection of the highest definition (largest volume) wallpapersize
I usedeval
Method.5120x3200
For calculation,size
.
After obtaining all the sets, you can usemax()
The method selects the highest definition:
biggest_one = max(size_list, key = lambda item: item['size'])
Thisbiggest_one
Inurl
The download link corresponding to the size.requests
The library downloads the linked resources:
result = requests.get(PAGE_DOMAIN + biggest_one['url'])if result.status_code == 200: open('wallpapers/' + biggest_one['name'], 'wb').write(result.content)
Note: First, you need to createwallpapers
Directory. Otherwise, an error is reported during running.
Completedownload_wallpaper
The function length is as follows:
def download_wallpaper(link): wallpaper_source = visit_page(PAGE_DOMAIN + link) wallpaper_size_links = wallpaper_source.select('#wallpaper-resolutions > a') size_list = [] for link in wallpaper_size_links: href = link.get('href') size_list.append({ 'size': eval(link.get_text().replace('x', '*')), 'name': href.replace('/download/', ''), 'url': href }) biggest_one = max(size_list, key = lambda item: item['size']) print('Downloading the ' + str(index + 1) + '/' + str(total) + ' wallpaper: ' + biggest_one['name']) result = requests.get(PAGE_DOMAIN + biggest_one['url']) if result.status_code == 200: open('wallpapers/' + biggest_one['name'], 'wb').write(result.content)
Batch run
The above steps can only be downloadedFirst wallpaper list pageOfFirst Wallpaper. If we want to downloadMultiple list pagesOfAll wallpapersWe need to call these methods cyclically. First, we define several constants:
import sysif len(sys.argv) != 4: print('3 arguments were required but only find ' + str(len(sys.argv) - 1) + '!') exit()category = sys.argv[1]try: page_start = [int(sys.argv[2])] page_end = int(sys.argv[3])except: print('The second and third arguments must be a number but not a string!') exit()
Here, three constants are specified by obtaining the command line parameters.category
,page_start
Andpage_end
Corresponding to the wallpaper category, start page, and end page respectively.
For convenience, define two url-related constants:
PAGE_DOMAIN = 'http://wallpaperswide.com'PAGE_URL = 'http://wallpaperswide.com/' + category + '-desktop-wallpapers/page/'
Next we can perform batch operations happily. Before that, we will definestart()
Start function:
def start(): if page_start[0] <= page_end: print('Preparing to download the ' + str(page_start[0]) + ' page of all the "' + category + '" wallpapers...') PAGE_SOURCE = visit_page(PAGE_URL + str(page_start[0])) WALLPAPER_LINKS = get_paper_link(PAGE_SOURCE) page_start[0] = page_start[0] + 1 for index, link in enumerate(WALLPAPER_LINKS): download_wallpaper(link, index, len(WALLPAPER_LINKS), start)
Thendownload_wallpaper
Rewrite the function:
def download_wallpaper(link, index, total, callback): wallpaper_source = visit_page(PAGE_DOMAIN + link) wallpaper_size_links = wallpaper_source.select('#wallpaper-resolutions > a') size_list = [] for link in wallpaper_size_links: href = link.get('href') size_list.append({ 'size': eval(link.get_text().replace('x', '*')), 'name': href.replace('/download/', ''), 'url': href }) biggest_one = max(size_list, key = lambda item: item['size']) print('Downloading the ' + str(index + 1) + '/' + str(total) + ' wallpaper: ' + biggest_one['name']) result = requests.get(PAGE_DOMAIN + biggest_one['url']) if result.status_code == 200: open('wallpapers/' + biggest_one['name'], 'wb').write(result.content) if index + 1 == total: print('Download completed!\n\n') callback()
Finally, specify the startup rule:
if __name__ == '__main__': start()
Run the project
Enter the following code in the command line to start the test:
python3 download.py aero 1 2
The following output is displayed:
Take charles to capture the package and you can see that the script is running smoothly:
Now, the download script has been developed, so you don't have to worry about wallpaper shortage!
The above is all the content for you. If you have any questions, you can discuss it in the comment area below. Thank you for your support for the help house.