Implementation example of Python10 multithread downloader

Source: Internet
Author: User
Tags http request ranges save file split

Today we see people asking questions about Python multithreaded writing files, Lenovo to this is the reboot of the architect class, I thought about it, feel the pit and the point of observation is quite many, can be considered as a face question to ask, simply say my thoughts and ideas, involved in the code and comments in the GitHub Kneeling for star

This article needs a certain Python base, I hope you have some knowledge of the following points

Python file processing, open write
Simple understanding of HTTP protocol header information
Os,sys Module
Threading Module Multithreading
Requests module Send request
Since the topic is a multithreaded download, the first thing to solve is to download the problem, in order to facilitate testing, we do not use QQ installation package so large, directly with the PC greatly brilliant and very rich image of the head example, probably this way (http://51reboot.com/src/blogimg/pc.jpg)

Download

Python's requests module encapsulates the HTTP request, and we choose to use it to send HTTP GET requests and then write local files (about requests and HTTP, as well as Python's handling of the file, which can be described in Baidu or continued attention, I will write later, as the idea is clear, the code is ready to be

# Simple, Rough downloads
Import requests

Res=requests.get (' http://51reboot.com/src/blogimg/pc.jpg ')
With open (' Pc.jpg ', ' W ') as F:
F.write (res.content)
After running the above code, there's a pc.jpg underneath the folder. That's the picture you want.

The above code is too small to function, note that our request is multithreaded download, this simple rough download completely does not meet the requirements, the so-called multithreading, you can understand that there are many bags of Austrian biscuits in the warehouse, the boss let me go to the company to put the good, and to put in accordance with the original order

The above code, I am probably a person to the warehouse, put all of Austria once back, the approximate process is as follows (figure not clear poke big)

If we want to complete the problem of multithreading requirements, first of all, the task will be disassembled, split into several subtasks, subtasks can be executed in parallel, and execution results can be summarized into the final results

Dismantling tasks

In order to complete this task, we first need to know how large the data, and then divide the data to fetch it OK, we have a good understanding of the HTTP protocol

Request data with Head method, return only HTTP header information, no topic part
We content-length the value of the information from scratch, knowing the size of the resource, such as 50 bytes
For example, we want to divide four threads, each thread to fetch about 1/4
50/4=12, so the first few threads take 12 bytes each, and the last one is ready to take the rest.
Each thread takes the appropriate content, the file seek to the appropriate location and then write
File.seek
In order to facilitate understanding, we first use a single threaded running flow chart is as follows (the figure is not clear stamp large)


The idea is clear, the code is ready to go, let's test the range header information

The range information in the HTTP header information, used in the request header, specifying the position of the first byte and the position of the last byte, such as 1-12, if the second number is omitted, it is considered to be the last, such as 36-

# Range Test Code
Import requests
# HTTP header information, specifying get first 15,000 bytes
headers={' Range ': ' bytes=0-15000 ', ' accept-encoding ': ' * '}
Res=requests.get (' http://51reboot.com/src/blogimg/pc.jpg ', headers=headers)

With open (' Pc.jpg ', ' W ') as F:
F.write (res.content)
We got the first 15,000 bytes of the Avatar, the following figure, the visual range is right.

Continue to enrich our code

To get the length of the data using the Requests.head method first
After confirming that several threads are open, confirm to each thread the range of data to get, that is, the value of the Range field
Seek write file
function is more complex, we need to use object-oriented to organize the code
First write a single thread, and gradually optimize
The code is ready.
Import requests
# The class of the download device
Class Downloader:
# constructors
def __init__ (self):
# The data connection to download
Self.url= ' http://51reboot.com/src/blogimg/pc.jpg '
# Number of threads to open
Self.num=8
# Store the name of the file, take the last face from the URL
Self.name=self.url.split ('/') [-1]
# Head method to request URL
R = Requests.head (Self.url)
# The length of the data taken out of the headers
self.total = Int (r.headers[' content-length '))
Print type (' Total is%s '% (self.total))
def get_range (self):
Ranges=[]
# For example, Total is 50, and the number of threads is 4. Offset is 12.
offset = Int (self.total/self.num)
For I in Range (Self.num):
If i==self.num-1:
# last thread, do not specify end position, take the last
Ranges.append ((I*offset, ")"
Else
# Each thread gets the interval
Ranges.append ((I*offset, (i+1) *offset))
# range is probably [(0,12), (12,24), (25,36), (36, ')]
return ranges
def run (self):

f = open (Self.name, ' W ')
For ran in Self.get_range ():
# Spell out range parameter to get fragmented data
R = Requests.get (self.url,headers={' Range ': ' bytes=%s-%s '% ran, ' accept-encoding ': ' * '})
# seek to the appropriate position
F.seek (Ran[0])
# Write Data
F.write (r.content)
F.close ()

If __name__== ' __main__ ':
Down = Downloader ()
Down.run ()
Multithreading

Multithreading and many processes is what in this is not much to say, to say understand also have to write an article, we know threading module is specialized to solve the problem of multithreading is OK, probably the use of the following methods, more detailed please Baidu or follow up articles

Threading. Thread creating threads, setting handler functions
Start startup
Setdaemon Set Daemon
Join Setup Thread wait
The code is as follows
Import requests
Import threading

Class Downloader:
def __init__ (self):
Self.url= ' http://51reboot.com/src/blogimg/pc.jpg '
Self.num=8
Self.name=self.url.split ('/') [-1]
R = Requests.head (Self.url)
self.total = Int (r.headers[' content-length '))
Print ' Total is%s '% (self.total)
def get_range (self):
Ranges=[]
offset = Int (self.total/self.num)
For I in Range (Self.num):
If i==self.num-1:
Ranges.append ((I*offset, ")"
Else
Ranges.append ((I*offset, (i+1) *offset))
return ranges
def download (self,start,end):
headers={' Range ': ' bytes=%s-%s '% (start,end), ' accept-encoding ': ' * '}
res = Requests.get (self.url,headers=headers)
print '%s:%s download success '% (Start,end)
Self.fd.seek (Start)
Self.fd.write (res.content)
def run (self):
SELF.FD = open (Self.name, ' W ')
Thread_list = []
n = 0
For ran in Self.get_range ():
Start,end = Ran
print ' thread%d start:%s,end:%s '% (n,start,end)
N+=1
Thread = Threading. Thread (target=self.download,args= (start,end))
Thread.Start ()
Thread_list.append (thread)
For I in Thread_list:
I.join ()
print ' Download%s load success '% (Self.name)
Self.fd.close ()
If __name__== ' __main__ ':
Down = Downloader ()
Down.run ()
Performing the Python downloader effect is as follows

Total is 21520
Thread 0 start:0,end:2690
Thread 1 start:2690,end:5380
Thread 2 start:5380,end:8070
Thread 3 start:8070,end:10760
Thread 4 start:10760,end:13450
Thread 5 start:13450,end:16140
Thread 6 start:16140,end:18830
Thread 7 start:18830,end:
0:2690 is end
2690:5380 is end
13450:16140 is end
10760:13450 is end
5380:8070 is end
8070:10760 is end
18830:is End
16140:18830 is end
Download Pc.jpg Load Success
The run function has been modified with multithreading and added a download function to download data blocks, which are explained in detail as follows

def download (self,start,end):
#拼接Range字段, the Accept field supports all encodings
headers={' Range ': ' bytes=%s-%s '% (start,end), ' accept-encoding ': ' * '}
res = Requests.get (self.url,headers=headers)
print '%s:%s download success '% (Start,end)
#seek到start位置
Self.fd.seek (Start)
Self.fd.write (res.content)
def run (self):
# Save File Open Object
SELF.FD = open (Self.name, ' W ')
Thread_list = []
#一个数字, to mark each thread for printing
n = 0
For ran in Self.get_range ():
Start,end = Ran
#打印信息
print ' thread%d start:%s,end:%s '% (n,start,end)
N+=1
#创建线程 arguments, the handler function is download
Thread = Threading. Thread (target=self.download,args= (start,end))
#启动
Thread.Start ()
Thread_list.append (thread)
For I in Thread_list:
# Set Waiting
I.join ()
print ' Download%s load success '% (Self.name)
#关闭文件
Self.fd.close ()
The point that can be optimized continuously

A file descriptor multiple processes can be problematic
It is recommended that you use Os.dup to copy file descriptors and Os.fdopen to open the processing file
The number of resource addresses and threads to be downloaded should be made into a command line to pass in
Get command line arguments with SYS.ARGV
Support Python downloader.py url num This notation
Incorrect number of arguments or incorrect format for times
Various fault-tolerant processing
Is the so-called women's Dior, man's Oreo, this article, you deserve to have
This is probably the case, I am also learning Python, the article on behalf of my personal view, there are errors inevitable, welcome to correct, common learning, this article complete code in GitHub, kneeling for everyone star

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.