Base64 is a method that uses 64 characters to represent arbitrary binary data.
When we open exe, JPG, PDF files with Notepad, we will see a lot of garbled characters, because the binary file contains many character that cannot be displayed and printed, so if you want the text processing software such as Notepad to handle binary data, a binary to string conversion method is required. Base64 is one of the most common binary encoding methods.
The Base64 principle is simple, first, to prepare a 64-character array:
[' A ', ' B ', ' C ', ... ' A ', ' B ', ' C ', ... ' 0 ', ' 1 ', ... '+', '/']
Then, the binary data processing, every 3 bytes a group, is a total of 3x8=24bit, divided into 4 groups, each group of exactly 6 bit:
So we get 4 numbers as index, then look up the table, get the corresponding 4 characters, is the encoded string.
Therefore, the BASE64 encoding will encode 3 bytes of binary data into 4 bytes of text data, the length of 33%, the advantage is that the encoded text data can be displayed directly in the message body, Web pages and so on.
What if the binary data to be encoded is not a multiple of 3 and the last 1 or 2 bytes are left? Base64 with \x00 bytes at the end of the top, and then at the end of the code to add 1 or 2 = number, indicating how many bytes were filled, decoding, will be automatically removed.
Python's built-in base64 can be encoded directly into the base64:
>>> import base64
>>> base64.b64encode('binary\x00string')
'YmluYXJ5AHN0cmluZw=='
>>> base64.b64decode('YmluYXJ5AHN0cmluZw==')
'binary\x00string'
Since the standard BASE64 encoding may appear after the character + and/, in the URL can not be directly as parameters, so there is a "url safe" base64 encoding, in fact, the character + and/respectively into-and _:
>>> base64.b64encode('i\xb7\x1d\xfb\xef\xff')
'abcd++//'
>>> base64.urlsafe_b64encode('i\xb7\x1d\xfb\xef\xff')
'abcd--__'
>>> base64.urlsafe_b64decode('abcd--__')
'i\xb7\x1d\xfb\xef\xff'
You can also define the order of 64 characters yourself, so that you can customize the BASE64 encoding, but it is generally not necessary at all.
Base64 is a method of encoding by looking up a table and cannot be used for encryption, even if a custom encoding table is used.
BASE64 is suitable for encoding small pieces of content, such as digital certificate signatures, cookie content, and so on.
Because the = character may also appear in the Base64 encoding, but = used in the URL, the cookie will cause ambiguity, so a lot of Base64 encoding will be removed:
# standard BASE64: ' ABCD ', ' ywjjza== ' # automatically removed =: ' ABCD ', ' Ywjjza '
How to decode after removing =? Because Base64 is to change 3 bytes to 4 bytes, the length of the BASE64 encoding is always a multiple of 4, so you need to add = To change the length of the Base64 string to a multiple of 4, you can decode it normally.
Write a base64 decoding function that can handle removing =:
>>> base64.b64decode('YWJjZA==')
'abcd'
>>> base64.b64decode('YWJjZA')
Traceback (most recent call last):
...
TypeError: Incorrect padding
>>> safe_b64decode('YWJjZA')
'abcd'
Summary
Base64 is an arbitrary binary-to-text string encoding method that is commonly used to transmit small amounts of binary data in URLs, cookies, and Web pages.