Repeated identification of image fingerprints and repeated identification of image fingerprints
When I searched for image recognition on google, I found a good article. Here I translated all the copyrights from NLP.
Note:The purpose of this article is that the author of this article finds that many users use the same profile picture when establishing a website, which leads to reduced recognition, in order to prevent users from uploading the same image as their profile pictures and uploading improper image files, the author studies the problem of this image fingerprint. (After I have completed the translation, I found someone has already translated it on the internet at http://python.jobbole.com/81277/, which is worse than myself)
Think about it. Every person has his/her own fingerprint, which can be used to identify a person.
Now we use three phases to implement the algorithm:
1. Calculate a set of unsuitable image fingerprints and store these image fingerprints in the database.
2. When a user uploads an image from a new customer profile, we will compare it with the image fingerprint in the database. If any profile picture fingerprint in the database matches the image uploaded by the user, the Administrator will prevent the user from uploading the image as his/her profile picture.
3. Similarly, it identifies pornographic images and creates a database to collect pornographic image Fingerprints Based on the fingerprints of pornographic images to prevent users from uploading pornographic images.
Our program is not perfect and effective. Even though the efficiency is slow but the goal is achieved after all, although it does not completely solve the problem, it reduces the number of improper files uploaded by users by more than 80%.
Then the biggest question is how to create an image fingerprint?
Please continue reading and find out the answer...
What do we need to do next?
We will use image fingerprints for repeated recognition. This technology is generally referred to as "Image Perception hashing" or "Image Hashing ".
What is an image fingerprint or an image hash?
The process of Image Hashing is to detect the content of the image and then construct a value that uniquely identifies the image based on the content of the image.
As shown in:
Here, we will use "difference hash" or just use the dHash algorithm to calculate our image fingerprints. In short, dHash focuses on exploring the differences between adjacent pixels. The hash value is created on this basis.
Why do we not directly use algorithms such as md5 and sha-1?
Unfortunately, we cannot use the encryption hash algorithm in this example. This is due to the nature of the encryption hashing algorithm itself-very small changes in the input file will cause a significantly different hash value. In this case, we hope that the input of similar images can be output with similar hash values.
What fields can image fingerprints be used in?
As in the above example, you can use image fingerprints to manage your inappropriate image database. When users try to upload such images, you can remind them.
You can create a search engine similar to TinEye to track images and find webpages similar to images.
You can even use image fingerprint recognition to manage your personal photo sets. Imagine that you have a hard disk that stores your personal images, but you need a way to back up partially trimmed images and keep a unique copy-image fingerprints can help you.
To put it simply, you can use image fingerprints or image hashing methods wherever duplicate copies are detected.
What databases do we need?
To create an image fingerprint recognition solution, we will use the following three python libraries:
- PIL/Pillow is used to read or load images
- ImageHash, which includes the implementation of the dHash Algorithm
- And Numpy/Scipy to calculate image hashes
You can also use the following command to build all environments (python2.7 ):
$ Pip install pillow imagehash
Step 1: create an image fingerprint Dataset
We do not plan to use pornographic images that we often encounter on dating websites. I have found a dataset that we can use.
For computer vision researchers, CALTECH-101 datasets are a legend. It contains more than 7500 images from 101 directories, including people, motorcycles, and airplanes.
I randomly selected 17 images from the 7500 images.
Then, from the 17 randomly selected images, I randomly adjusted the image size to create N new images. Our goal is to find these almost repeated images, like a haystack.
Besides the width and height, images are the same. because they do not have the same size, we cannot simply use MD5 verification, more importantly, images with similar content may have significantly different hash values. The reason has been explained. On the contrary, we can use image hashes because similar images have similar hash fingerprints.
Now, write the code about the dataset and name it index. py:
1 # coding = UTF-8 2 # import necessary package 3 import argparse 4 import shelve
Import imagehash 5 import glob 6 from PIL import Image 7 8 # construct Parameter Parsing and analyze the parameter 9 ap = argparse. argumentParser () 10 ap. add_argument ("-d", "-- dataset", required = True, 11 help = "Path of the photo dataset") 12 ap. add_argument ("-s", "-- shelve", required = True, 13 help = "shelve dataset output") 14 args = vars (ap. parse_args () 15 16 # Open the shelve dataset 17 db = shelve. open (args ["shelve"], writeback = True)
First, we need to import the required packages. We will use the Image class in the PIL or Pillow module to load images from the disk. Then, the image fingerprint library is used to construct the perceptual sequence.
According to the code above, argparse is used to parse the command line parameters. shelve is used to create a python sub-typical database and store it on disk, glob will be used to easily collect the path location of images.
Next, analyze the command line parameters. First, -- dataset is the path directory of the input image. Second, -- shelve is the output path to the shelve database.
Next, we open the shelve database and write it to it. Db will store our image hash values. The Code is as follows:
1 # loop in image data sets 2 for imagePath in glob. glob (args ["dataset"] + "/*. jpg "): 3 # differences between loading images and calculating hash values 4 image = Image. open (imagePath) 5 h = str (imagehash. dhash (image) 6 7 # extract the file name in the path and update the database 8 # use Hash as the dictionary key, and add the file name to the Value List 9 filename = imagePath [imagePath. rfind ("/") + 1:] 10 db [h] = db. get (h, []) + [filename] 11 12 # disable shelf dataset 13 db. close ()
These are the code work we need. load images from the disk, traverse the image dataset, and create image fingerprints.
Now, observe the two most important codes in the following Tutorial:
1 filename = imagePath[imagePath.rfind("/") + 1:]2 db[h] = db.get(h, []) + [filename]
As I mentioned in my first article, images with the same fingerprint are considered to be the same image.
Therefore, if we want to find similar images, we need to create an image list with the same fingerprint value.
The above two lines of code do this.
The first line extracts the image file name, and the second line creates a list with the same hash value for the image.
Extract image fingerprints from the database and create a hash database. Run the following command:
$ Python index. py-dataset images-shelve db. shelve
The script runs for several seconds. Once completed, a file is generated, which contains the key-value pairs corresponding to the image fingerprint-file name.
This algorithm is the same as the one I wrote when I created a dating website a few years ago. We collect unsuitable images, calculate their hash values, and store them in the database. When a user submits an image, I only need to calculate the fingerprint of the image and compare it with the fingerprint in the database to determine whether to upload the determined content.
Next, I will show you how to perform a search to determine whether the image has a hash value similar to that in the database.
Step 2: search for a database
Now that we have created a fingerprint image database, it is time to search for the database.
Open a new file named search. py and write the code:
1 # coding = UTF-8 2 # import necessary package 3 from PIL import Image 4 import imagehash 5 import argparse 6 import shelve 7 8 # construct Parameter Parsing and analyze parameter 9 ap = argparse. argumentParser () 10 ap. add_argument ("-d", "-- dataset", required = True, 11 help = "Path of the photo dataset") 12 ap. add_argument ("-s", "-- shelve", required = True, 13 help = "shelve dataset output") 14 ap. add_argument ("-q", "-- query", required = True, 15 help = "search image path") 16 args = vars (ap. parse_args ())
Import the required package and parse the command line parameters, as we did last time. Next, we need to perform three conversions: -- dataset, which is the path of the original image dataset, -- shelve, database that stores key-value pairs, and -- query, search for or upload the image path. Our goal is to search for images similar to hash values based on the uploaded images.
Next, write the code for executing the search:
1 # Open the shelf Dataset 2 db = shelve. open (args ["shelf"]) 3 4 # load the image to be queried and calculate its image dispersion column value, and capture images similar to hash values from the database 5 query = Image. open (args ["query"]) 6 h = str (imagehash. dhash (query) 7 filenames = db [h] 8 print ("Found % d images" % (len (filenames ))) 9 10 # loop inside the image 11 for filename in filenames: 12 Image = image. open (args ["dataset"] + "/" + filename) 13 image. show () 14 15 # disable dataset 16 db. close ()
First, open the database, load images from the disk, calculate the image fingerprints, and find out that all images with the same fingerprint are worth it.
If any image with the same hash value exists, the image will be repeatedly displayed in sequence.
With this code, we will determine whether the uploaded image already exists in the database.
Result
As I mentioned earlier in the article, I have randomly collected 17 images from the CALTECH-101 dataset and created N new images with some small size changes above.
The size of these images is only a small part of the pixels, so the MD5 hash algorithm cannot be used (this will be further discussed in the algorithm improvement ). We need to use image hashes to find similar images.
Open the terminal and run the following command:
$ Python search. py-dataset images-shelve db. shelve-query images/84eba74d-38ae-4bf6-b8bd-79ffa1dad23a.jpg
If no error is reported, the following results are displayed:
On the left side of the image above is the input image. We will use this image to search the dataset to find all the images with the same fingerprint.
It is worth noting that two images in our dataset have the same fingerprint, as shown in the two images on the right. Although it is not obvious from the above, they are indeed images of different sizes with the same content.
Let's try to input another image:
$ Python search. py-dataset images-shelve db. shelve-query images/9d355a22-3d59-487e-ad14-138a4e3880bc.jpg
The result is as follows:
Perfect!
Improved Algorithms
There are many ways to improve our algorithms, but the most important method is that hash is similar and not exactly the same.
For example, the size of the image we submitted this time is adjusted by a few percentage points (up or down). If the size of the image is large, the aspect ratio changes, the hash value will not be identical.
However, the image will be similar.
To find images with different similarity, we need to further use the Hamming distance method. Hamming distance method can calculate the number of pixels of different hash values. Therefore, the two images with a pixel difference are basically more similar than those with a 10-digit difference.
But we encountered the second problem: algorithm scalability.
Imagine that there is an input image and you need to find all the similar images in the database. Then we can calculate the Hamming distance between the input image and each database image.
As the dataset increases, it will lead to more time to compare all hash values. Finally, our hash database will reach a certain scale, so that the linear comparison is not practical.
One solution is to use the Kd tree or VP tree classification to change from linear search to sublinear, reducing the complexity of the search problem.
Summary
In this article, we learned how to construct and use the image hashing method to perform approximate image detection. Image Hashing is applied to image visual content research.
Just as a fingerprint can recognize a person, the hash value of an image can also uniquely identify the image.
Using our knowledge of questioning images, we can create a system to search for similar images, which is nothing more than an image hash algorithm.
Edited remarks
OK. This is what I translated last night and this morning. After translating this blog, I realized my grammar limitations, such as database loading and unfamiliar Parameter Parsing. My heart was able to quietly translate articles, analyze algorithms, and feel his enthusiasm and thoughts from the lines of international friends in the lively atmosphere of May Day. I would like to thank the author again. In other words, today's school runner is burning traffic. I don't know why. After all, it's okay when I used it yesterday. Does Google FQ need to consume a lot of traffic? Or does the plug-in have a certain ability to absorb traffic? Probably the most impossible thing is my computer hacked. oh, my god!