A tutorial for using Python to detect duplicate images with a hash algorithm

Source: Internet
Author: User
Iconfinder is an icon search engine, providing designers, developers and other creative workers with beautiful icons, currently hosting more than 340,000 icons, is the world's largest paid icon library. Users can also upload and sell original works in Iconfinder's trading section. Every month there are thousands of icons uploaded to Iconfinder, along with a large number of pirated images. Iconfinder Engineer Silviu Tantos in this paper, a novel and ingenious image-checking technology is put forward to eliminate piracy.

We will launch a feature that detects if the upload icon is duplicated in the next few weeks. For example, if a user downloads an icon and then tries to make a profit by uploading it (a similar case has occurred), then through our method, we can detect whether the icon already exists and mark the account fraud. A common way to detect whether a file already exists in a large number of files is by calculating the hash value of each file in the dataset and storing the hash value in the array library. When you want to find a specific file, the file hash value is evaluated first, and then the hash value is found in the database.
Select a hashing algorithm

The cryptographic hashing algorithm is a common hashing algorithm. Similar to md5,sha1,sha256 this can be found in any language can be called standard library, they are very effective for simple use cases.

For example, importing a hashlib module in Python and then invoking a function can generate a hash value for a string or file.

>>> Import Hashlib # Calculating the hash value of a string.>>> hashlib.md5 (' The quick brown fox jumps O ver the Lazy Dog '). Hexdigest () ' 9e107d9d372bb6826bd81d3542a419d6 ' # Loading an image file into memory and calculating it ' s Hash value.>>> image_file = open (' Data/cat_grumpy_orig.png '). Read () >>> hashlib.md5 (image_file). Hexdigest () ' 3e1f6e9f2689d59b9ed28bcdab73455f '

This algorithm is very effective for files that have not been tampered with, and if the input data is slightly changed, the cryptographic hashing algorithm will cause an avalanche effect, resulting in a hash of the new file completely different from the original file hash value.

For example, it adds a full stop to the end of the sentence.

# Original text.>>> hashlib.md5 (' The quick brown fox jumps over the lazy dog '). Hexdigest () ' 9e107d9d372bb6826bd81 D3542a419d6 ' # Slight modification of the text.>>> hashlib.md5 (' The quick brown fox jumps over the ' The Lazy Dog '). Hexdigest () ' e4d909c290d0fb1ca068ffaddf22cbd0 '

If the image background color is changed, the image is clipped, rotated, or a pixel is modified, it cannot be matched in the image Hachiku. It can be seen that traditional hashing algorithms are not practical. As you can see in the example above, the hash value of 9 e107d9d372bb6826bd81d3542a419d6 and e4d909c290d0fb1ca068ffaddf22cbd0 is almost different (except for a few characters).

For example, when you modify the color of a cat's nose in an image, the hash value of the image changes.


# Load The original image into memory and calculate it ' s hash value.>>> image_file = open (' DATA/CAT_GRUMPY_ORIG.P Ng '). Read () >>> hashlib.md5 (image_file). Hexdigest () ' 3e1f6e9f2689d59b9ed28bcdab73455f ' # Load the modified Image into memory and calculate it ' s hash value.>>> image_file_modified = open (' Data/cat_grumpy_modif.png '). Read () >>> hashlib.md5 (image_file_modified). Hexdigest () ' 12d1b9409c3e8e0361c24beaee9c0ab1 '

There are many perceptual hashing algorithms, this paper will propose a new Dhash (differential hash) algorithm, which calculates the luminance difference between neighboring pixels and determines the relative gradient. For the use cases above, the perceptual hashing algorithm will be very effective. The perceptual hashing algorithm obtains a multimedia file fingerprint which can flexibly distinguish the small difference between different files from the various features of the file contents.


Dhash

Before delving into the Dhash algorithm, we introduce some basic knowledge. A color image is composed of RGB three primary colors, which can be regarded as a color set of three primary colors of red, green and blue. For example, use the Python Image Library (PIL) to load an image and print the pixel values.

Test image

>>> from PIL import image>>> test_image = Image.open (' data/test_image.jpg ') # The Image was an RGB image With a size of 8x8 pixels.>>> print ' Image Mode:%s '% test_image.modeimage mode:rgb>>> print ' Width:% s px, Height:%s px '% (test_image.size[0], test_image.size[1]) width:4 px, height:4 px # Get The pixel values from the I Mage and print them into rows based on# the image ' s width.>>> width, height = test_image.size>>> pixels = List (Test_image.getdata ()) >>> for col in xrange (width): ...  Print Pixels[col:col+width] ... [(255, 0, 0), (0, 255, 0), (0, 0, 255), (255, 255, 255)] [(0, 0, 0), (212, 45, 45), (51, 92, 154), (130, 183, 47)] [(206, 210, 198), (131, 78, 8), (131, 156, 180), (117, 155, 201)] [(104, 133, 170), (215, 130, 20), (153, 155, 155), (104, 142, 191)]

Now we go back to the Dhash algorithm, which has four steps, this article explains each step in detail and verifies its effect on the original image and the modified image. The first three pixels of the red and green blue intensity values are 255, the remaining two color intensity values are 0, the pure black pixels of the primary color is 0, pure white pixel three primary colors 255. Other color pixels are composed of the values of the three primary colors of different intensities.


1. Grayscale of images

Reduces the pixel value to a luminous intensity value by grayscale the image. For example, white pixels (255, 255, 255) become 255 and the black Pixel (0,0,0) strength value becomes 0.

2. Reduce the image to a common size

Reduce the image to a common base size, such as the 9*8 pixel size of a pixel value with a large width (in the third step you can see why this is the size). This method removes the high-frequency and detail parts of the image to obtain a sample with 72 strength values. Because resizing or stretching an image does not change its hash value, all images are normalized to that size.

3. Compare Neighborhood pixels

After the first two steps are implemented, a list of strength values is obtained, comparing the neighboring pixels of each row of the binary value array.

>>> from PIL import image>>> img = Image.open (' data/cat_grumpy_orig_after_step_2.png ') >>> width, height = img.size>>> pixels = List (Img.getdata ()) >>> for col in xrange (width): ...  Print Pixels[col:col+width] ... [254, 254, 255, 253, 248, 254, 255, 254, 255] [254, 255, 253, 248, 254, 255, 254, 255, 255] [253, 248, 254, 255, 254, 255, 255, 255, 222] [248, 254, 255, 254, 255, 255, 255, 222, 184] [254, 255, 254, 255, 255, 255, 222, 184, 177] [255, 254, 255, 255, 255, 222, 184, 177, 184] [254, 255, 255, 255, 222, 184, 177, 184, 225] [255, 255, 255, 222, 184, 177, 184, 225, 255]

The first value 254 and the second 254 are compared, the second value is compared to the third value, and so on, so that each row gets 8 Boolean values.

>>> difference = []>>> for row in xrange (height): ...  For col in xrange (width): ...   If col! = width: ...    Difference.append (Pixels[col+row] > pixels[(col+row) +1]) ...>>> for Col in Xrange (width-1): ...  Print difference[col:col+ (width-1)] ... [False, False, True, True, False, False, true, false] [False, True, True, False, False, True, False, false] [True, True, False, False, True, False, False, false] [True, False, False, True, False, False, False, true] [False, False, True, False, False, False, true, True] [False, True, False, False, False, True, True, false] [True, False, False, False, True, True, False, false] [False, False, False, True, True, False, False, true]

4. Convert to two value

To facilitate the storage and use of hashes, convert 8 Boolean values to 16 binary strings. Ture becomes 1, and false becomes 0.
Python implementation

Here is the completion algorithm for the complete Python implementation:

def dhash (image, Hash_size = 8):  # Grayscale and shrink the image in one step.  Image = Image.convert (' L '). Resize (    (hash_size + 1, hash_size),    Image.antialias,  )   pixels = List ( Image.getdata ())   # Compare adjacent pixels.  difference = [] for  row in xrange (hash_size): For    col in Xrange (hash_size):      pixel_left = Image.getpixel (( Col, Row))      pixel_right = image.getpixel (col + 1, Row)      difference.append (Pixel_left > Pixel_right)   # Convert The binary array to a hexadecimal string.  Decimal_value = 0  hex_string = []  for index, value in enumerate (difference):    if value:      Decimal_ Value + = 2** (index% 8)    if (index% 8) = = 7:      hex_string.append (Hex (decimal_value) [2:].rjust (2, ' 0 '))      Decimal_value = 0   return '. Join (hex_string)

The most common case, the picture is slightly different, the hash value is probably the same, so we can directly compare.

>>> from PIL import image>>> from utility import Dhash, hamming_distance>>> orig = Image.open (' Data/cat_grumpy_orig.png ') >>> modif = Image.open (' data/cat_grumpy_modif.png ') >>> Dhash (orig) ' 4c8e3366c275650f ' >>> dhash (modif) ' 4c8e3366c275650f ' >>> dhash (orig) = = Dhash (MODIF) True if there is a

A SQL database that holds the hash value, you can simply determine if the hash value "4 c8e3366c275650f" exists:

SELECT PK, hash, file_path from image_hashes  WHERE hash = ' 4c8e3366c275650f ';

Now, for some large-difference images, their hashes may be different, so you need to calculate the minimum number of characters, the Hamming distance, to be replaced by a string into another string.
On wikipedia there are some Python sample code that calculates the Hamming distance between two strings. But it can also be implemented directly based on the calculations and queries on the MySQL database.

SELECT PK, hash, Bit_count (  CONV (hash, +) ^ CONV (' 4c8e3366c275650f ', +)) as Hamming_distance from  Image_ Hashes has  hamming_distance < 4  ORDER by Hamming_distance ASC;

The value queried differs from or operates from a hash value in the database, counting the different digits. Since bit_count can only manipulate integers, all hexadecimal hashes are converted to decimal.


Conclusion

This article uses Python to implement the algorithm described, of course, the reader can use any programming language to implement the algorithm.

As mentioned in the introduction, this algorithm will be applied to iconfinder to prevent duplicate commit icon, it can be expected that the perceptual hashing algorithm has more practical applications. Because the hashes of images with similar characteristics are similar, it can help the image recommender system look for similar images.

  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.