See a multi-threaded download article, here to write their own understanding of a multi-threaded download of the article.
When we visited Http://192.168.10.7/a.jpg, it was a GET request, Response's head contained content-length:37694
This is the size of the a.jpg file.
Grab the packet, the server side is sending multiple packets (PDUs) and a file information, and then assembled into a a.jpg image:
Part.
If I use Requests.head ("Http://192.168.10.7/a.jpg"), the server side returns only the file information, not the file data.
Response =Requests.head (Self.url)Print(response.headers)#{'keep-alive':'timeout=5, max=100','accept-ranges':'bytes','Date':'Sat, 02:56:08 GMT','ETag':'"933e-548c4b0beff53"','Content-type':'Image/jpeg','Content-length':'37694','last-modified':'SAt, 02:21:39 GMT', 'Connection': 'Keep-alive', 'Server': 'apache/2.4.18 (Ubuntu)'}
File a.jpg size is 37964 bytes
Viewing the file size after saving the a.jpg file is also
OK, we know the size of the file, then how to multi-threaded download?
If we use 3 threads to download a.jpg, then we will use thread 1 to download 1260x10=12600 bytes, thread 2 download 12601-25200 bytes, and so on, not enough to use thread 1 to download.
But will the GET request not download the a.jpg file directly? How do you get data for only a subset of the files?
We can add "range:bytes=0-12599" to the head part of the GET request, first Test
# Res.text is an automatic encoding of the byte type data obtained by GET, is the str type, and Res.content is the original byte type data
# So here's the direct write (res.content)
headers = {"Range":"bytes=0-12599"} = Requests.get (self.url,headers=headers) # Res.text is an automatic encoding of the byte type data obtained by GET, is the str type, Res.content is the original byte type data # So here is the direct write (res.content)with open ( Self.filename,'wb') as F: F.write (res.content)
You can then see a partial picture of the download:
We'll get the next part of the data,
headers = {"Range":"bytes=12600-25199"} = Requests.get (self.url,headers=headers) # Res.text is an automatic encoding of the byte type data obtained by GET, is the str type, Res.content is the original byte type data # So here is the direct write (res.content)with open ( Self.filename,'ab+') as F: print(F.tell ()) f.write (res.content)
You can see the file:
We know:
R or RT default mode, text mode read RB binary file w or wt text mode write, open before file storage is emptied WB binary write, file store is also emptied a append mode, can only be written at the end of file a+ read-write mode, write can only be written at the end of the file W+ can read and write, and A + is to clear the contents of the file R+ can read and write, and A + difference is can write to the file anywhere
If it is multi-threaded and download, we use open (' file ', ' rb+ '), I first use this mode to continue the download file above, downloaded to 25199 bytes,
This time I downloaded from 26000, F.seek (26000) began to save the downloaded file, see if the file can be saved, see whether the file will appear in the middle blank:
headers = {"Range":"bytes=26000-37694"} = Requests.get (self.url,headers=headers) # Res.text is an automatic encoding of the byte type data obtained by GET, is the str type, Res.content is the original byte type data # So here is the direct write (res.content)with open ( Self.filename,'rb+') as F: F.seek (26000) F.write ( Res.content)
Files after download:
This, maybe the picture may be different from what we imagined, but rb+ must be able to read and write from anywhere.
It also introduces a knowledge point, which may be used at the time of your own testing:
F.truncate (n): Truncates from the beginning of the first line of the file, truncates the file to n characters, no n indicates truncation from the current position, and all characters following n are deleted after truncation.
OK, now let's start using multi-threaded download files:
The design ideas are:
1. Download a subset of data per thread
2. Each thread opens the file in rb+ mode
3, after each thread downloads the data, uses the F.seek () to the corresponding position, then writes the data.
File write errors occur directly f=open (), and then multiple threads f.write ().
We can use Os.dup () to copy files that match Os.fsopen (Fd,mode,buffer) to open the processing file.
The benefits of Os.dup () and Os.fdopen () are personally understood to be os.dup () copy file handles, Os.fdopen () write cache first, and official documents are yet to be verified.
Code:
Version Python3,
PIP Install requests
The following code can be used to run directly
#!-coding:utf8-*-ImportThreading,sysImportRequestsImport TimeImportOSclassMulthreaddownload (Threading. Thread):def __init__(SELF,URL,STARTPOS,ENDPOS,F): Super (multhreaddownload,self).__init__() Self.url=URL Self.startpos=startpos Self.endpos=Endpos SELF.FD=FdefDownload (self):Print("start thread:%s at%s"%(Self.getname (), Time.time ())) headers= {"Range":"bytes=%s-%s"%(Self.startpos,self.endpos)} res= Requests.get (self.url,headers=headers)#Res.text is an automatic encoding of the byte type data obtained by GET, is the str type, and Res.content is the original byte type data #So here is the direct write (res.content)Self.fd.seek (Self.startpos) self.fd.write (res.content)Print("stop thread:%s at%s"%(Self.getname (), Time.time ()))#f.close () defRun (self): Self.download ()if __name__=="__main__": URL= Sys.argv[1] #get the file size and filenamefilename = Url.split ('/') [-1] FileSize= Int (requests.head (URL). headers['Content-length']) Print("%s filesize:%s"%(filename,filesize))#Number of ThreadsThreadnum = 3#semaphore, while allowing only 3 threads to runThreading. Boundedsemaphore (threadnum)#default 3 threads Now, you can also set the number of threads by passing parametersStep = filesize//threadnum mtd_list=[] Start=0 End=-1#please empty and generate fileTempf = open (filename,'W') Tempf.close ()#rb+, binary open, can be read and written in any positionwith open (filename,'rb+') as F:fileno=F.fileno ()#If the file size is 11 bytes, it is the data that gets the location of the file 0-10. If End = 10, the data has been obtained. whileEnd < Filesize-1: Start= End +1End= Start + step-1ifEnd >Filesize:end=filesize#print ("start:%s, end:%s"% (start,end)) #Copy file handleDUP =os.dup (Fileno)#print (DUP) #Open FileFD = Os.fdopen (DUP,'rb+',-1) #print (FD)t =multhreaddownload (URL,START,END,FD) T.start () mtd_list.append (t) forIinchMtd_list:i.join ()
Execution Result:
Python multiprocess_download.py http://192.168.10.7/of.tar.gzof.tar.gz filesize:36578022start thread: Thread-1 at 1487405833.7353075start Thread:thread-2 at 1487405833.736311start Thread:thread-3 At 1487405833.7378094stop Thread:thread– 1 at 1487405836.9561603stop Thread:thread– 3 at 1487405837.0016065Stop Thread:thread-2 at 1487405837.0116146
After several tests, the downloaded files can be opened normally.
If there are multiple sites with of.tar.gz files, it can also reflect the experience of multi-threaded download.
According to the above theory, we should be able to do a similar to the download, such as 10 machines, each start an agent, each agent to the server to report their own directory file information, when an agent has a download file, Will go to the server to query which agents have this file, and then calculate which agents to download which data.
Python multi-threaded download file