Using Python to detect a duplicate of a picture through a hashing algorithm _python

Source: Internet
Author: User
Tags md5

Iconfinder is an icon search engine that provides designers, developers and other creative workers with beautiful icons, currently hosting more than 340,000 icons, and is the world's largest paid icon library. Users can also upload original works in the Iconfinder trading section. There are thousands of icons uploaded to Iconfinder each month, along with a large number of pirated images. Iconfinder Engineer Silviu Tantos in this paper put forward a novel and clever image retrieval technology to eliminate piracy.

In the next few weeks we will launch a function to detect if the upload icon is duplicated. For example, if a user downloads an icon and then tries to make a profit by uploading it (a similar case has occurred), then by our method we can detect if the icon exists and mark the account fraud. A common way to detect if a file already exists in a large number of files is by calculating the hash value of each file in the dataset and storing the hash value in the array library. When you want to find a particular file, the hash value of the file is evaluated first, and then the hash value is found in the database.
Select a hashing algorithm

The cryptographic hash algorithm is a common hashing algorithm. Similar to md5,sha1,sha256 this can be found in any language can be called a standard library, they are very effective for simple use cases.

For example, to import a hashlib module in Python, and then call a function, you can generate a string or a hash value for a file.

>>> import hashlib
 
# Calculating the hash value of a string.
>>> hashlib.md5 (' The quick brown fox jumps over the lazy dog '). Hexdigest ()
' 9e107d9d372bb6826bd81d3542a419d6 '
 
# Loading an image file into memory and calculating it ' s hash value.
>>> image_file = open (' Data/cat_grumpy_orig.png '). Read ()
>>> hashlib.md5 (image_file). Hexdigest ()
' 3e1f6e9f2689d59b9ed28bcdab73455f '

This algorithm is very effective for unsaved uploaded files, and if the input data is slightly changed, the cryptographic hashing algorithm can cause an avalanche effect, which results in a completely different hash of the new file from the original file hash value.

For example, it adds a period to the end of the sentence.

# Original text.
>>> hashlib.md5 (' The quick brown fox jumps over the lazy dog '). Hexdigest ()
' 9e107d9d372bb6826bd81d3542a419d6 '
 
# Slight modification of the text.
>>> hashlib.md5 (' The quick brown fox jumps over the lazy dog. '). Hexdigest ()
' e4d909c290d0fb1ca068ffaddf22cbd0 '

If the image background color is changed, the image is cropped, rotated, or a pixel is modified, it cannot be matched in the image Hachiku. It can be seen that the traditional hashing algorithm is not practical. As you can see in the above example, the hash value of 9 e107d9d372bb6826bd81d3542a419d6 and e4d909c290d0fb1ca068ffaddf22cbd0 is almost different (except for a few characters).

For example, when you modify the color of a cat's nose in an image, the image's hash value changes.


# Load The original image into memory and calculate it ' s hash value.
>>> image_file = open (' Data/cat_grumpy_orig.png '). Read ()
>>> hashlib.md5 (image_file). Hexdigest ()
' 3e1f6e9f2689d59b9ed28bcdab73455f '
 
# Load the modified image into memory and calculate it ' s hash valu E.
>>> image_file_modified = open (' Data/cat_grumpy_modif.png '). Read ()
>>> hashlib.md5 (image_ file_modified). Hexdigest ()
' 12D1B9409C3E8E0361C24BEAEE9C0AB1 '

At present, there are many perceptual hashing algorithms, this paper will propose a new Dhash (differential hash) algorithm, which calculates the brightness difference between adjacent pixels and determines the relative gradient. For the above use cases, the perceptual hashing algorithm will be very effective. The perceptual hashing algorithm obtains a multimedia file fingerprint which can flexibly distinguish the small difference of different files from the various features of the file contents.


Dhash

Before you delve into the Dhash algorithm, first introduce some basic knowledge. A color image is composed of RGB three primary colors, can be seen as a red, green and blue color set of three primary colors. For example, use the Python Image Library (PIL) to load an image and print pixel values.

Test image

 >>> from PIL import Image >>> test_image = Image.open (' DATA/TEST_IMAGE.J
PG ') # The image is a RGB image with a size of 8x8 pixels. >>> print ' Image Mode:%s '% test_image.mode Image mode:rgb >>> print ' Width:%s px, Height:%s px '% ( Test_image.size[0], test_image.size[1]) width:4 px, height:4 px # get the pixel values from the image and print them I
Nto rows based on # The image ' s width. >>> width, height = test_image.size >>> pixels = List (Test_image.getdata ()) >>> for Col in Xran
GE (width): ... print pixels[col:col+width] ...
[(255, 0, 0), (0, 255, 0), (0, 0, 255), (255, 255, 255)]
[(0, 0, 0), (212, 45, 45), (51, 92, 154), (130, 183, 47)]
[(206, 210, 198), (131, 78, 8), (131, 156, 180), (117, 155, 201)] 
[(170, 215, 130), (153, 191), ((), ()] 

Now that we're back to the Dhash algorithm, which has four steps, this article describes each step in detail and verifies its effect on the original image and the modified image. The first three pixels of red-green-blue color strength values are 255, the remaining two color intensity values are 0, pure black pixel three primary colors 0, pure white pixel three primary colors 255. Other color pixels are composed of three primary colors of different intensities.


1. Image grayscale

By grayscale image, the pixel value is reduced to a luminous intensity value. For example, white pixels (255, 255, 255) become 255 and the black Pixel (0,0,0) strength value becomes 0.

2. Reduce the image to a common size

Reduce the image to a common base size, such as the 9*8 pixel size of a pixel value in a large height (in the third step you can see why this is the size). This method removes the High-frequency and detail parts of the image, thus obtaining a sample of 72 strength values. Because resizing or stretching an image does not change its hash value, all images are normalized to that size.

3. Compare neighboring pixels

The first two steps are implemented to get a list of strength values that compare the adjacent pixels of each row of the binary array of values.

>>> from PIL import Image
>>> img = image.open (' data/cat_grumpy_orig_after_step_2.png ')
>>> width, height = img.size
>>> pixels = List (Img.getdata ())
>>> for Col in Xrange (wi DTH): ...  Print Pixels[col:col+width]
...
[254, 254, 255, 253, 248, 254, 255, 254, 255]
[254, 255, 253, 248, 254, 255, 254, 255, 255]
[253, 248, 254, 255, 254, 255, 255, 255, 222]
[248, 254, 255, 254, 255, 255, 255, 222, 184]
[254, 255, 254, 255, 255, 255, 222, 184, 177]
[255, 254, 255, 255, 255, 222, 184, 177, 184]
[254, 255, 255, 255, 222, 184, 177, 184,)
[255, 255, 255, 222, 184, 177, 184, 225, 255]

The first value 254 and the second 254 are compared, the second and third values, and so on, so that 8 Boolean values are obtained for each row.

>>> difference = []
>>> for row in xrange (height):
...  For col in xrange (width):
...   If col!= width:
...    Difference.append (Pixels[col+row] > pixels[(col+row) +1])
...
>>> for Col in Xrange (width-1):
...  Print difference[col:col+ (width-1)]
...
[False, False, True, True, False, False, True, false]
[False, True, True, False, False, True, False, false]
[True, True, False, False, True, False, False, false]
[True, False, False, True, False, False, False, True]
[False, False, True, False, False, False, True, True]
[False, True, False, False, False, True, True, false]
[True, False, False, False, True, True, False, false]
[False, False, False, True, True, False, False, true]

4. Convert to two value

To facilitate hash value storage and use, convert 8 Boolean values to a 16-string. Ture becomes 1, and false becomes 0.
Python Implementation

Here is the complete Python implementation algorithm:

def dhash (image, Hash_size = 8):
  # Grayscale and shrink the image in one step.
  Image = Image.convert (' L '). Resize (
    (hash_size + 1, hash_size),
    Image.antialias,
  )
 
  pixels = List ( Image.getdata ())
 
  # Compare adjacent pixels.
  difference = [] for
  row in xrange (hash_size): For
    col in Xrange (hash_size):
      pixel_left = Image.getpixel ( Col, Row))
      pixel_right = Image.getpixel ((col + 1, row))
      difference.append (Pixel_left > Pixel_right)
 
  # Convert The binary array to a hexadecimal string.
  Decimal_value = 0
  hex_string = []
  for index, value in enumerate (difference):
    if value:
      Decimal_ Value + + 2** (index% 8)
    if (index% 8) = = 7:
      hex_string.append (Hex (decimal_value) [2:].rjust (2, ' 0 ')]
      Decimal_value = 0 return
 
  '. Join (hex_string)

In the most common case, the picture is slightly different and the hash value is probably the same, so we can compare it directly.

>>> from PIL import Image
>>> utility import Dhash, hamming_distance
>>> orig = Im Age.open (' data/cat_grumpy_orig.png ')
>>> modif = Image.open (' data/cat_grumpy_modif.png ')
> >> Dhash (orig)
' 4c8e3366c275650f '
>>> dhash (modif)
' 4c8e3366c275650f '
>> > Dhash (orig) = = Dhash (modif)
True

If there is a

A SQL database that holds a hash value so that you can simply determine whether the hash value "4 c8e3366c275650f" exists:

SELECT PK, hash, file_path from image_hashes
  WHERE hash = ' 4c8e3366c275650f ';

Now, for some of the larger differences in the image, their hash value may be different, then you need to calculate from one string to another string to replace the minimum number of characters, that is, Hamming distance.
Wikipedia has some Python sample code that calculates the Hamming distance between two strings. But can also be directly based on the MySQL database calculation and query to achieve.

SELECT PK, hash, Bit_count (
  CONV (hash,,) ^ CONV (' 4c8e3366c275650f ',
II)) as Hamming_distance
  Mage_hashes has
  Hamming_distance < 4 order by
  Hamming_distance ASC;

To vary or manipulate the queried value against the hash value in the database, counting the different digits. Because Bit_count can only manipulate integers, all hexadecimal hash values are converted to decimal.


Concluding remarks

This article uses Python to implement the algorithm described, of course, the reader can use any programming language to implement the algorithm.

As mentioned in the introduction, this algorithm will be applied to Iconfinder to prevent the recurrence of the icon, you can expect that the perceptual hashing algorithm has more practical applications. Because the hash value of an image with similar characteristics is similar, it can help the image recommendation system to look for similar images.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.