The principle and implementation example of Python breakpoint continuation (multithreading support)

Source: Internet
Author: User
Tags mutex ranges truncated

First, the simple principle

When it comes to breakpoint continuation, you have to say some HTTP header fields related to the breakpoint continuation.

①content-length

Content-length is used to indicate the size of an entity in an HTTP response package. Unless a block code is used, the Content-length header is a message that must be used with the entity body. The content-length header is used to detect message truncation caused by server crashes and to segment multiple packets that share persistent connections.

Detection End
earlier versions of HTTP used a closed connection method to delimit the end of a message. However, no content-length, the client can not distinguish between the end of the message is the normal shutdown connection or message transmission due to server crashes caused by the connection shutdown. The client needs to detect the message truncated by Content-length.
The problem of packet truncation is particularly important for caching proxy servers. If the cache server receives the truncated message but does not recognize the truncation, it may store the incomplete content and use it multiple times to provide the service. Caching proxy servers typically do not cache HTTP principals that do not have an explicit content-length header, thus reducing the risk of caching a truncated packet.
Content-length and persistent connections
The Content-length header is essential for persistent links. If the response is routed through a persistent connection, another HTTP response may be followed. The client can know where the message ends and where the next message starts by Content-length the header. Because the connection is persistent, the client cannot rely on the connection shutdown to determine the end of the message.
In one case, a persistent connection can be used without the content-length header, that is, when the Block encoding (chunked encoding) is used. In the case of block coding, the data is divided into a series of blocks to send, no block has a size description. Even if the server does not know the size of the entire entity when it generates the header (usually because the entity is dynamically generated), a block encoding can still be used to transfer several blocks of known size.
②transfer-encoding

The HTTP protocol only defines a transger-encoding, or chunked. For example, if the principal of the service side is dynamically generated. And the client does not want the server to generate all of the main body, because the middle of the Shiyan is particularly large. The chunked format is as follows:


http/1.1 OK

Transfer-encoding:chunked

2 AB

A 0123456789

0
1
2
3
4
5
6
7
8
9
http/1.1 OK

Transfer-encoding:chunked

2 AB

A 0123456789

0
③content-enconding

The following three kinds are common: gzip,deflate,compress. It is used to indicate what algorithm the entity is encoded in. Usually, content-encoding is used in conjunction with transfer-encoding.

④content-range

For the response header, specifies the insertion position of a portion of the entire entity, and he also indicates the length of the entire entity. When the server returns a partial response to the customer, it must describe the scope of the response coverage and the entire entity length. General format:

Content-range:bytes Start-end/total

⑤range

For the request header, specify the position of the first byte and the position of the last byte, in general format:

Range:bytes=start-end

Two, single thread realization

① whether to support breakpoint continuation

Use head to get some entities to see if the return header contains Content-range
Use head to get some entities to see if the return status code is 206.
② Specific implementation steps

Use the head method to get the file size
Get local File size
Set Request Header Range information
Using requqests.response.iter_content and opening stream mode
The file is downloaded to a certain size to write

The code is as follows Copy Code

# usr/bin/env Python
# Coding:utf-8

"""
Copyright (c) 2015-2016 Cain
Author:cain <singforcain@gmail.com>
"""

Import OS
Import time
Import logging
Import datetime
Import requests
Import Argparse

Class FileDownload (object):


def __init__ (self, URL, file_name):


"""


:p Aram URL: Download address for file


:p Aram file_name: Name of the renamed file


: return:


"""


Self.url = URL


Self.file_name = file_name


Self.stat_time = Time.time ()


Self.file_size = Self.getsize ()


Self.offset = Self.getoffset ()


Self.downloaded_size = Self.offset


Self.headers = Self.setheaders ()


Self.tmpfile = ""


Self.info ()

def info (self):
Logging.info ("downloaded [%s] bytes"% (Self.offset))

    def setheaders (self):
        ""
         set range header range based on the size of the downloaded file and return
       :
        ""
        start = Self.offset
        end = Self.file_size-1
         range = "Bytes={0}-{1}". Format (start, end)
        return { "Range": Range}

    def getoffset (self):
        if Os.path.exists ( Self.file_name):
            if self.file_size = = Os.path.getsize (self.file_name):
                 exit ()
            else:
                 return Os.path.getsize ( Self.file_name)
        else:
             return 0

def getsize (self):
"""
: return: Returns the size of the file, using the Head method
"""
Response = Requests.head (Self.url)
return int (response.headers["content-length"])

def download (self):


"""


The core part of the continuation of a breakpoint


: return:


"""


With open (Self.file_name, "AB") as F:


Try


R = Requests.get (Self.url, Stream=true, Headers=self.headers)


For chunk in R.iter_content (chunk_size=1024):


If not chunk:


Break


Self.tmpfile + + Chunk


If Len (self.tmpfile) = = 1024*50:


F.write (Self.tmpfile)


Self.downloaded_size = Len (self.tmpfile)


Logging.info ("Downloaded---[%.2f%%] [%s/%s] bytes" (Float (self.downloaded_size)


/SELF.FILE_SIZE*100,


Self.downloaded_size, Self.file_size))


Self.tmpfile = ""


Except Keyboardinterrupt:


Logging.warning ("interruped by User")


Logging.info ("Ending the thread,please does not exit")


Finally


F.write (Self.tmpfile)


Self.downloaded_size = Len (self.tmpfile)


Logging.info ("Downloaded---[%.2f%%]%s/%s bytes" (Float (self.downloaded_size)


/SELF.FILE_SIZE*100,


Self.downloaded_size, Self.file_size))


consume = Int (Time.time ())-Self.stat_time


Logging.info ("It consumes%d seconds"% (consume))


Logging.info ("End at%s"% (Time.strftime ("%y-%m-%d%h:%m:%s", Time.localtime (Time.time ())))

DEF init ():
"""
Configuring log information
: return:
"""
Logging.basicconfig (format= ' [% (asctime) s]\t[% (levelname) s]\t% (message) s ',
Level= "DEBUG",
datefmt= "%y/%m/%d%i:%m:%s%p"
)

def run (URL, name):
If not name:
Name = Url.split ("/") [-1]
File = filedownload (URL, name)
File.download ()

if __name__ = = ' __main__ ':
Init ()
Parser = Argparse. Argumentparser ()
Parser.add_argument ("url", help= "the file ' s url")
Parser.add_argument ("--name", help= "the file ' s name you want to rename")
args = Parser.parse_args ()
Run (Args.url, Args.name)

Third, multithreading implementation (non-breakpoint continuation)

The code is as follows Copy Code
# usr/bin/env Python


# Coding:utf-8





"""


Copyright (c) 2015-2016 Cain


Author:cain &lt;singforcain@gmail.com&gt;


"""





Import time


Import Math


Import Queue


Import logging


Import Argparse


Import requests


Import threading





Mutex = Threading. Lock ()








Class FileDownload (object):


def __init__ (self, url, filename, threadnum, bulk_size, chunk_size):


Self.url = URL


Self.filename = filename


Self.threadnum = Threadnum


Self.bulk_size = Bulk_size


Self.chunk_size = Chunk_size


Self.file_size = Self.getsize ()


Self.buildemptyfile ()


Self.queue = Queue.queue (1024)


Self.setqueue ()








def getsize (self):


"""


: return: Returns the size of the file, using the Head method


"""


Response = Requests.head (Self.url)


return int (response.headers["content-length"])





def buildemptyfile (self):


"""


Create an empty file


: return:


"""


Try


Logging.info ("Building Empty File ...")


With open (Self.filename, "w") as F:


F.seek (Self.file_size)


F.write ("\x00")


F.close ()


Except Exception as err:


Logging.error ("Building Empty File Error ...")


Logging.error (ERR)


Exit ()





def setqueue (self):


"""


Set up queues based on file size and the file size of each task set


: return: Back to queue information


"""


Logging.info ("Setting the queue ...")


tasknums = Int (Math.ceil (float (self.file_size)/self.bulk_size)) # Rounding up


For I in Range (tasknums):


Ranges = (Self.bulk_size*i, self.bulk_size* (i+1)-1)


Self.queue.put (ranges)





def download (self):


While True:


Logging.info ("Downloading data in%s"% (Threading.current_thread (). GetName ()))


If not Self.queue.empty ():


Start, end = Self.queue.get ()


Tmpfile = ""


ranges = "Bytes={0}-{1}". Format (start, end)


headers = {"Range": Ranges}


Logging.info (Headers)


R = Requests.get (Self.url, Stream=true, Headers=headers)


For chunk in R.iter_content (chunk_size=self.chunk_size):


If not chunk:


Break


Tmpfile + + Chunk


Mutex.acquire ()


With open (Self.filename, "r+b") as F:


F.seek (Start)


F.write (Tmpfile)


F.close ()


Logging.info ("Writing [%d]bytes data into the file ..."% (len (tmpfile))


Mutex.release ()


Else


Logging.info ("%s is over ..."% (Threading.current_thread (). GetName ()))


Break





def run (self):


threads = List ()


For I in Range (Self.threadnum):


Threads.append (Threading. Thread (Target=self.download))


For thread in Threads:


Thread.Start ()


For thread in Threads:


Thread.Join ()





Def loginit ():


"""


Configuring log information


: return:


"""


Logging.basicconfig (format= ' [% (asctime) s]\t[% (levelname) s]\t% (message) s ',


Level= "DEBUG",


datefmt= "%y/%m/%d%i:%m:%s%p")








def start (URL, filename, threadnum):


"""


Download some of the core features


:p Aram URL:


:p Aram FileName:


:p Aram Threadnum:


: return:


"""


url = URL


filename = filename


Threadnum = Threadnum if threadnum and Threadnum &lt; else 5


Bulk_size = 2*1024*1014


Chunk_size = 50*1024


Print URL, filename, threadnum, bulk_size, chunk_size


Download = filedownload (URL, filename, threadnum, bulk_size, Chunk_size)


Download.run ()





if __name__ = = ' __main__ ':


LogInit ()


Logging.info ("APP is starting ...")


Start_time = Time.time ()


Parser = Argparse. Argumentparser ()


Parser.add_argument ("url", help= "the file ' s url")


Parser.add_argument ("--filename", help= "the file ' s name you want to rename")


Parser.add_argument ("--threadnum", help= "The threads you want to choose", Type=int)


args = Parser.parse_args ()


Start (Args.url, Args.filename, Args.threadnum)


Logging.info ("App in Ending ...")


Logging.info ("It consumes [%d] seconds"% (Time.time ()-start_time))


Four, multithreading breakpoint continued transmission

This is the combination of the above, but usually a configuration file is used to save the status of the download, which is reconfigured according to the file when it is downloaded again.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.