How to Use Python to detect duplicate images through a hash algorithm,
Iconfinder is an icon search engine that provides exquisite icons for designers, developers, and other creative workers. It hosts more than 0.34 million icons and is the world's largest paid icon library. You can also upload and sell original works in the Iconfinder transaction section. Every month, thousands of icons are uploaded to Iconfinder, along with a large number of pirated images. In this article, Iconfinder engineer Silviu Tantos proposes a novel and clever image re-checking technology to prevent piracy.
We will launch a function to check whether the upload icon is repeated in the next few weeks. For example, if a user downloads an icon and tries to make a profit by uploading it (a similar case has occurred), we can use our method to check whether the icon already exists, and mark this account as fraudulent. A common method to check whether a file exists in a large number of files is to compute the hash value of each file in the dataset and store the hash value in the array library. When you want to find a specific file, first calculate the hash value of the file, and then find the hash value in the database.
Select a hash algorithm
The encryption hashing algorithm is a common hash algorithm. Standard libraries such as MD5, SHA1, and SHA256 can be found in any language. They are very effective for simple use cases.
For example, in Python, first import the hashlib module, and then call a function to generate the hash value of a string or file.
>>> import hashlib
# Calculating the hash value of a string.
>>> hashlib.md5 ('The quick brown fox jumps over the lazy dog'). hexdigest ()
'9e107d9d372bb6826bd81d3542a419d6'
# Loading an image file into memory and calculating it's hash value.
>>> image_file = open ('data / cat_grumpy_orig.png'). read ()
>>> hashlib.md5 (image_file) .hexdigest ()
'3e1f6e9f2689d59b9ed28bcdab73455f'
This algorithm is very effective for uploaded files that have not been tampered with. If the input data has slight changes, the encryption hash algorithm will cause an avalanche effect, resulting in the new file hash value completely different from the original file hash value.
For example, the following example adds an extra period at the end of the sentence.
# Original text.
>>> hashlib.md5 ('The quick brown fox jumps over the lazy dog'). hexdigest ()
'9e107d9d372bb6826bd81d3542a419d6'
# Slight modification of the text.
>>> hashlib.md5 ('The quick brown fox jumps over the lazy dog.'). hexdigest ()
'e4d909c290d0fb1ca068ffaddf22cbd0'
If the background color of the image is changed, the image is cropped, rotated, or a certain pixel is modified, then it cannot be matched in the image hash library. It can be seen that the traditional hash algorithm is not practical. As you can see in the example above, the hash value 9 e107d9d372bb6826bd81d3542a419d6 and e4d909c290d0fb1ca068ffaddf22cbd0 are almost different (except for a few characters).
For example, after modifying the color of the cat ’s nose in the image, the hash value of the image will change.
# Load the original image into memory and calculate it's hash value.
>>> image_file = open ('data / cat_grumpy_orig.png'). read ()
>>> hashlib.md5 (image_file) .hexdigest ()
'3e1f6e9f2689d59b9ed28bcdab73455f'
# Load the modified image into memory and calculate it's hash value.
>>> image_file_modified = open ('data / cat_grumpy_modif.png'). read ()
>>> hashlib.md5 (image_file_modified) .hexdigest ()
'12d1b9409c3e8e0361c24beaee9c0ab1'
There are many perceptual hash algorithms, and this article will propose a new dhash (difference hash) algorithm, which calculates the difference in brightness between adjacent pixels and determines the relative gradient. For the above use cases, the perceptual hash algorithm will be very effective. The perceptual hash algorithm obtains a multimedia file fingerprint that can flexibly distinguish small differences between different files from various characteristics of the file content.
dHash
Before studying the dHash algorithm in depth, let us introduce some basic knowledge. A color image is composed of three primary colors of RGB, which can be regarded as a color set of three primary colors of red, green and blue. For example, use the Python image library (PIL) to load an image and print the pixel value.
Test image
>>> from PIL import Image
>>> test_image = Image.open ('data / test_image.jpg')
# The image is an RGB image with a size of 8x8 pixels.
>>> print 'Image Mode:% s'% test_image.mode
Image Mode: RGB
>>> print 'Width:% s px, Height:% s px'% (test_image.size [0], test_image.size [1])
Width: 4 px, Height: 4 px
# Get the pixel values from the image and print them into rows based on
# the image's width.
>>> width, height = test_image.size
>>> pixels = list (test_image.getdata ())
>>> for col in xrange (width):
... print pixels [col: col + width]
...
[(255, 0, 0), (0, 255, 0), (0, 0, 255), (255, 255, 255)]
[(0, 0, 0), (212, 45, 45), (51, 92, 154), (130, 183, 47)]
[(206, 210, 198), (131, 78, 8), (131, 156, 180), (117, 155, 201)]
[(104, 133, 170), (215, 130, 20), (153, 155, 155), (104, 142, 191)]
Now we return to the dHash algorithm, which has four steps. This article details each step and verifies its effect on the original image and the modified image. The red, green, and blue color intensity values of the first three pixels are 255, the remaining two color intensity values are 0, the pure black pixel three primary colors are 0, and the pure white pixel three primary colors are 255. Other color pixels are composed of three primary color values with different intensities.
1. Image graying
By graying the image, the pixel value is reduced to a luminous intensity value. For example, the white pixel (255, 255, 255) becomes 255 and the black pixel (0, 0, 0) intensity value will become 0.
2. Reduce the image to a common size
Reduce the image to a common basic size, such as 9 * 8 pixels with a width and a height of one pixel value (you can understand why this size is in the third step). By this method, the high frequency and detail parts of the image are removed to obtain a sample with 72 intensity values. Since adjusting or stretching an image does not change its hash value, all images are normalized to this size.
3. Compare neighborhood pixels
After the first two steps are realized, a list of intensity values is obtained, and the adjacent pixels of each row of the binary value array are compared.
>>> from PIL import Image
>>> img = Image.open ('data / cat_grumpy_orig_after_step_2.png')
>>> width, height = img.size
>>> pixels = list (img.getdata ())
>>> for col in xrange (width):
... print pixels [col: col + width]
...
[254, 254, 255, 253, 248, 254, 255, 254, 255]
[254, 255, 253, 248, 254, 255, 254, 255, 255]
[253, 248, 254, 255, 254, 255, 255, 255, 222]
[248, 254, 255, 254, 255, 255, 255, 222, 184]
[254, 255, 254, 255, 255, 255, 222, 184, 177]
[255, 254, 255, 255, 255, 222, 184, 177, 184]
[254, 255, 255, 255, 222, 184, 177, 184, 225]
[255, 255, 255, 222, 184, 177, 184, 225, 255]
The first value 254 is compared with the second 254, the second value is compared with the third value, and so on, so that each row gets 8 boolean values.
>>> difference = []
>>> for row in xrange (height):
... for col in xrange (width):
... if col! = width:
... difference.append (pixels [col + row]> pixels [(col + row) +1])
...
>>> for col in xrange (width-1):
... print difference [col: col + (width-1)]
...
[False, False, True, True, False, False, True, False]
[False, True, True, False, False, True, False, False]
[True, True, False, False, True, False, False, False]
[True, False, False, True, False, False, False, True]
[False, False, True, False, False, False, True, True]
[False, True, False, False, False, True, True, False]
[True, False, False, False, True, True, False, False]
[False, False, False, True, True, False, False, True]
4. Convert to binary
In order to facilitate the storage and use of hash values, 8 Boolean values are converted into hexadecimal strings. Ture becomes 1, and False becomes 0.
Python implementation
The following is the complete algorithm implemented in Python:
def dhash (image, hash_size = 8):
# Grayscale and shrink the image in one step.
image = image.convert ('L'). resize (
(hash_size + 1, hash_size),
Image.ANTIALIAS,
)
pixels = list (image.getdata ())
# Compare adjacent pixels.
difference = []
for row in xrange (hash_size):
for col in xrange (hash_size):
pixel_left = image.getpixel ((col, row))
pixel_right = image.getpixel ((col + 1, row))
difference.append (pixel_left> pixel_right)
# Convert the binary array to a hexadecimal string.
decimal_value = 0
hex_string = []
for index, value in enumerate (difference):
if value:
decimal_value + = 2 ** (index% 8)
if (index% 8) == 7:
hex_string.append (hex (decimal_value) [2:]. rjust (2, '0'))
decimal_value = 0
return '' .join (hex_string)
In the most common case, the pictures are slightly different, and the hash values are likely to be the same, so we can compare them directly.
>>> from PIL import Image
>>> from utility import dhash, hamming_distance
>>> orig = Image.open ('data / cat_grumpy_orig.png')
>>> modif = Image.open ('data / cat_grumpy_modif.png')
>>> dhash (orig)
'4c8e3366c275650f'
>>> dhash (modif)
'4c8e3366c275650f'
>>> dhash (orig) == dhash (modif)
True
If there is one
A SQL database that holds the hash value, you can simply determine whether the hash value "4 c8e3366c275650f" exists:
SELECT pk, hash, file_path FROM image_hashes
WHERE hash = '4c8e3366c275650f';
Now, for some images with large differences, their hash values may be different, so you need to calculate the minimum number of characters to replace from one string to another, that is, Hamming distance.
Wikipedia has some sample Python code for calculating the Hamming distance between two strings. But it can also be implemented directly based on calculations and queries on the MySQL database.
SELECT pk, hash, BIT_COUNT (
CONV (hash, 16, 10) ^ CONV ('4c8e3366c275650f', 16, 10)
) as hamming_distance
FROM image_hashes
HAVING hamming_distance <4
ORDER BY hamming_distance ASC;
Perform an XOR operation on the queried value and the hash value in the database to count different digits. Since BIT_COUNT can only operate on integers, all hexadecimal hashes must be converted to decimal.
Conclusion
This article uses Python to implement the introduced algorithm. Of course, readers can use any programming language to implement the algorithm.
As mentioned in the introduction, the algorithm in this article will be applied to Iconfinder to prevent repeated submission of icons. It is expected that there are more practical applications of perceptual hashing algorithms. Because the hash values of images with similar features are also similar, it can help the image recommendation system to find similar images.