How to Use Python to detect duplicate images through a hash algorithm
Iconfinder is an icon search engine that provides exquisite icons for designers, developers, and other creative workers. It hosts more than 0.34 million icons and is the world's largest paid icon library. You can also upload and sell original works in the Iconfinder transaction section. Every month, thousands of icons are uploaded to Iconfinder, along with a large number of pirated images. In this article, Iconfinder engineer Silviu Tantos proposes a novel and clever image re-checking technology to prevent piracy.
We will launch a function to check whether the upload icon is repeated in the next few weeks. For example, if a user downloads an icon and tries to make a profit by uploading it (a similar case has occurred), we can use our method to check whether the icon already exists, and mark this account as fraudulent. A common method to check whether a file exists in a large number of files is to compute the hash value of each file in the dataset and store the hash value in the array library. When you want to find a specific file, first calculate the hash value of the file, and then find the hash value in the database.
Select a hash algorithm
The encryption hashing algorithm is a common hash algorithm. Standard libraries such as MD5, SHA1, and SHA256 can be found in any language. They are very effective for simple use cases.
For example, in Python, first import the hashlib module, and then call a function to generate the hash value of a string or file.
1 2 3 4 5 6 7 8 9 10 |
>>> Import hashlib # Calculating the hash value of a string. >>> Hashlib. md5 ('the quick brown fox jumps over The lazy dog'). hexdigest () '9e0000d9d372bb6826bd81d3542a419d6' # Loading an image file into memory and calculating it's hash value. >>> Image_file = open ('data/cat_grumpy_orig.png '). read () >>> Hashlib. md5 (image_file). hexdigest () '3e1f6e9f2689d59b9ed28bcdab73455f' |
This algorithm is very effective for uploaded files that have not been tampered with. If the input data changes slightly, the encryption hashing algorithm will lead to an avalanche effect, as a result, the hash value of the new file is completely different from that of the original file.
For example, in the following example, it adds an end to the end of a sentence.
1 2 3 4 5 6 7 |
# Original text. >>> Hashlib. md5 ('the quick brown fox jumps over The lazy dog'). hexdigest () '9e0000d9d372bb6826bd81d3542a419d6' # Slight modification of the text. >>> Hashlib. md5 ('the quick brown fox jumps over The lazy dog. '). hexdigest () 'E4d909c290d0fb1ca068ffaddf22cbd0' |
If the image background color is changed, the image is cropped, rotated, or a certain pixel is modified, the image cannot be matched in the image hash library. It can be seen that the traditional hash algorithm is not practical. As you can see in the above example, the hash value 9 e0000d9d372bb6826bd81d3542a419d6 and e4d909c290d0fb1ca068ffaddf22cbd0 are almost different (except for several characters ).
For example, after you modify the color of the cat's nose in the image, the hash value of the image changes.
1 2 3 4 5 6 7 8 9 |
# Load the original image into memory and calculate it's hash value. >>> Image_file = open ('data/cat_grumpy_orig.png '). read () >>> Hashlib. md5 (image_file). hexdigest () '3e1f6e9f2689d59b9ed28bcdab73455f' # Load the modified image into memory and calculate it's hash value. >>> Image_file_modified = open ('data/cat_grumpy_modif.png '). read () >>> Hashlib. md5 (image_file_modified). hexdigest () '12d1b9409c3e8e0361c24beaee9c0ab1' |
There are already many perceptual hashing algorithms. This paper will propose a new dhash (differential hashing) algorithm, which calculates the Brightness Difference between adjacent pixels and determines the relative gradient. For the preceding use cases, it is very effective to perceive the hash algorithm. The sensor hashing algorithm obtains a multimedia file fingerprint that can flexibly distinguish tiny differences between different files from various features of the file content.
DHash
Before going deep into the dHash algorithm, I would like to introduce some basic knowledge. A color image is composed of RGB primary colors and can be regarded as a color set of red, green, and blue primary colors. For example, you can use the Python Image Library (PIL) to load an image and print the pixel value.
Test image
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
>>> From PIL import Image >>> Test_image = Image. open ('data/test_image.jpg ') # The image is an RGB image with a size of 8x8 pixels. >>> Print 'image Mode: % s' % test_image.mode Image Mode: RGB >>> Print 'width: % s px, Height: % s px '% (test_image.size [0], test_image.size [1]) Width: 4 px, Height: 4 px # Get the pixel values from the image and print them into rows based on # The image's width. >>> Width, height = test_image.size >>> Pixels = list (test_image.getdata ()) >>> For col in xrange (width ): ... Print pixels [col: col + width] ... [(255, 0, 0), (0,255, 0), (0, 0,255), (255,255,255)] [(0, 0, 0), (212, 45, 45), (51, 92,154), (130,183, 47)] [(206,210,198), (131, 78, 8), (131,156,180), (117,155,201)] [(104,133,170), (215,130, 20), (153,155,155), (104,142,191)] |
Now let's go back to the dHash algorithm, which has four steps. This article describes each step in detail and verifies its effect on the original image and the modified image. The intensity values of the first three pixels are 255, the other two are 0, the three primary colors of the pure black pixels are 0, and the three primary colors of the pure white pixels are 255. Other color pixels are composed of three primary color values of different intensity.
1. Grayscale Images
Reduce the pixel value to a luminous intensity value through a grayscale image. For example, a white pixel (255, 255, 255) is 255, and a black pixel (, 0) is 0.
2. scale down an image to a common size
Scale down an image to a common base size, for example, the size of 9*8 pixels in a pixel value in width or height (you can see the reason for this size in step 3 ). This method removes the high frequency and detail parts of the image to obtain a sample with 72 intensity values. Since adjusting or stretching an image does not change its hash value, all images are normalized to this size.
3. Compare neighboring pixels
After the previous two steps are completed, a list of intensity values is obtained to compare the adjacent pixels of each row of the binary value array.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
>>> From PIL import Image >>> Img = Image. open ('data/cat_grumpy_orig_after_step_2.png ') >>> Width, height = img. size >>> Pixels = list (img. getdata ()) >>> For col in xrange (width ): ... Print pixels [col: col + width] ... [254,254,255,253,248,254,255,254,255] [254,255,253,248,254,255,254,255,255] [253,248,254,255,254,255,255,255,222] [248,254,255,254,255,255,255,222,184] [254,255,254,255,255,255,222,184,177] [255,254,255,255,255,222,184,177,184] [254,255,255,255,222,184,177,184,225] [255,255,255,222,184,177,184,225,255] |
Compare the first value 254 with the second value 254, compare the second value to the third value, and so on, so that each row gets eight boolean values.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
>>> Difference = [] >>> For row in xrange (height ): ... For col in xrange (width ): ... If col! = Width: ... Difference. append (pixels [col + row]> pixels [(col + row) + 1]) ... >>> For col in xrange (width-1 ): ... Print difference [col: col + (width-1)] ... [False, False, True, True, False, False, True, False] [False, True, True, False, False, True, False, False] [True, True, False, False, True, False] [True, False, False, True, False, True] [False, False, True, False, True, True] [False, True, False, True, True, False] [True, False, True, True, False, False] [False, True, True, False, False, True] |
4. Convert to binary
To facilitate the storage and use of hash values, convert eight boolean values into hexadecimal strings. True to 1, and False to 0.
Python implementation
The complete Python implementation algorithm is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
Def dhash (image, hash_size = 8 ): # Grayscale and shrink the image in one step. Image = image. convert ('l'). resize ( (Hash_size + 1, hash_size ), Image. ANTIALIAS, ) Pixels = list (image. getdata ()) # Compare adjacent pixels. Difference = [] For row in xrange (hash_size ): For col in xrange (hash_size ): Pixel_left = image. getpixel (col, row )) Pixel_right = image. getpixel (col + 1, row )) Difference. append (pixel_left> pixel_right) # Convert the binary array to a hexadecimal string. Decimal_value = 0 Hex_string = [] For index, value in enumerate (difference ): If value: Decimal_value + = 2 ** (index % 8) If (index % 8) = 7: Hex_string.append (hex (decimal_value) [2:]. Must ust (2, '0 ')) Decimal_value = 0 Return ''. join (hex_string) |
In the most common case, the image is slightly different, And the hash value may be the same, so we can directly compare it.
1 2 3 4 5 6 7 8 9 10 11 12 |
>>> From PIL import Image >>> From utility import dhash, hamming_distance >>> Orig = Image. open ('data/cat_grumpy_orig.png ') >>> Modif = Image. open ('data/cat_grumpy_modif.png ') >>> Dhash (orig) '4c8e3366c275650f' >>> Dhash (modif) '4c8e3366c275650f' >>> Dhash (orig) = dhash (modif) True If you have |
The SQL database that stores the hash value. You can easily determine whether the hash value "4 c8e3366c275650f" exists:
1 2 |
SELECT pk, hash, file_path FROM image_hashes WHERE hash = '4c8e3366c275650f '; |
Now, for some images with large differences, their hash values may be different. Therefore, you need to calculate the minimum number of characters to replace from one string to another, that is, the Hamming distance.
Wikipedia has some Python sample code that calculates the Hamming distance between two strings. However, it can also be implemented directly based on the computation and query on the MySQL database.
1 2 3 4 5 6 |
SELECT pk, hash, BIT_COUNT ( CONV (hash, 16, 10) ^ CONV ('4c8e3366c275650f', 16, 10) ) As hamming_distance FROM image_hashes HAVING hamming_distance <4 Order by hamming_distance ASC; |
Compare the queried value with the hash value in the database, and count different digits. BIT_COUNT can only operate on integers. Therefore, convert all hexadecimal hash values to decimal.
Conclusion
This article uses Python to implement the introduced algorithm. Of course, you can use any programming language to implement the algorithm.
As mentioned in the introduction, the algorithm in this article will be applied to Iconfinder to prevent repeated submission of icons. As you can expect, there are more practical applications to perceive the hash algorithm. Because the hash values of images with similar features are similar, it can help the image recommendation system look for similar images.