About five years ago, I was doing development work for a dating site. They were early startups, but they also started to have some stable users. Unlike other dating sites, the company has always been a major market image of being clean. It's not a site for you to fool around with-it's where you can find a loyal partner.
As a result of investing millions of venture capital (before the US depression), their online ad irresistible force about true love and finding soul mates. Forbes (Forbes, a famous American financial magazine) interviewed them. They were also interviewed on national television. Early successes contributed to the coveted exponential growth at the start of a career-the number of users growing at a doubling rate per month. For them, everything seems to be in the wind.
But they have a serious problem-the pornography problem.
Users of the dating site will have some people upload pornographic images and set them as their profile picture. This behavior destroys many other users ' experiences-causing many users to cancel their membership.
It may not be a problem for some of the current dating sites to have a few pornographic images everywhere. Or it can be used as a byproduct of an online dating that is accepted and then ignored.
However, such an act should neither be accepted nor be overlooked.
Don't forget, this venture is positioning itself in an excellent dating haven, free from the filth and rubbish that plague other dating sites. In short, they have a very real reputation for backing up with risky capital, and that is exactly the style they need to maintain.
The dating site can be said to be desperate in order to quickly stop the outbreak of pornographic images. They hired the photo forum moderator team, really don't do anything else just stare at the regulatory page for more than 8 hours a day and then remove any pornographic images that are uploaded to the social network.
It is no exaggeration to say that they have invested $ tens of thousands of (not to mention countless artificial hours) to solve the problem, but only to alleviate the situation, and the control is not serious but not at the source.
The outbreak of pornographic images reached a critical level in July in 2009. For the first time in 8 months, the number of users has not doubled (or even started to decrease). Worse, investors say they will withdraw if the company fails to solve the problem. In fact, the polluted tide has already begun to shock the ivory tower, and it is only a matter of time before it can be overturned into the sea.
I'm proposing a more robust long-term solution when the dating web giant is about to get out of the way: what if we use picture fingerprints to fight the outbreak of pornographic images?
You see, every picture has a fingerprint. Just as people's fingerprints can identify people, images can be identified by fingerprints.
This has prompted the implementation of a three-phase algorithm:
1. Create a fingerprint for an indecent image, and then store the image fingerprint in a database.
2. When a user uploads a new avatar, we compare it to the image fingerprint in the database. If the fingerprint of the uploaded image matches any of the image fingerprints in the database, we prevent the user from setting the image as a profile picture.
3. When picture supervisors mark new pornographic images, these images are also given fingerprints and deposited into our database, creating a database that can be used to block illegal uploads and evolve.
Our approach, though imperfect, is also fruitful. Slowly, the situation of pornographic pictures broke down somewhat. It never disappears-but this algorithm allows us to successfully reduce the number of illegal uploads by more than 80%.
This also saved the hearts of investors. They continued to provide us with financial support--until the recession came and we all lost our jobs.
Looking back on the past, I couldn't help laughing. My work hasn't lasted long. The company didn't stick with it for too long. There are even a few investors who have gone.
But there was one thing that really survived. Algorithm for extracting image fingerprints. A few years later, I shared the basic content of the algorithm and expected you to apply it to your own projects.
But the big question is, how can we build a picture fingerprint?
Read on to find out what's going on.
The things that are about to be done
We are going to use the image fingerprint to detect similar images. This technique is often referred to as "perceptual image hash" or a simple "picture hash".
What is a picture thumbprint/picture Hash
A picture hash is a process that detects the contents of a picture and then creates a unique value for the image based on what is detected.
For example, look at the picture at the top of this article. Given a picture as input, a hash function is applied, and a picture hash is computed based on the visual of the image. Similar images should also have similar hash values. The application of the image hash algorithm makes the detection of similar images quite simple.
In particular, we will use the "differential hash" or simple dhash algorithm to calculate the image fingerprint. In simple terms, the Dhash algorithm looks at the difference between two neighboring pixels. Then, based on this difference, a hash value is established.
Why not use algorithms like Md5,sha-1?
Unfortunately, we cannot use cryptographic hash algorithms in implementations. Due to the nature of the cryptographic hash algorithm, the very small differences in the input file can also result in a hash value that is significantly different. In the case of picture fingerprints, we actually want similar inputs to have similar hash output values.
Where can I use a picture fingerprint?
As I've shown above, you can use the image fingerprint to maintain a database of indecent images-a warning when users try to upload similar images.
You can create an image of a reverse search engine, such as TinEye, which can record images and the related pages they appear on.
You can also use the image fingerprint to help manage your personal photo collection. Let's say you have a hard drive with some partial backups of your photo gallery, but you need a way to delete a local backup, a picture that retains only a single copy of the backup--a picture fingerprint can help you do that.
Simply put, you can use the image fingerprint/hash in almost any scene that requires a similar copy of your image to be detected.
What are the required libraries?
In order to create a picture fingerprinting scheme, we intend to use three main Python packages:
- Pil/pillow for reading and loading pictures
- Imagehash, including the implementation of Dhash
- and Numpy/scipy,imagehash's dependency package.
You can use the following command to install the required libraries in one click:
$ pip Install Pillow Imagehash
First step: Create a fingerprint for a picture set
The first step is to create a fingerprint of our image collection.
Maybe you'll ask, but we won't, and we won't use the pornographic images I worked for that dating site. Instead, I created an artificial data set that I could use.
Data set CALTECH-101 is a legendary presence for researchers in computer vision. It contains at least 7500 images from 101 different categories, including people, motorcycles and airplanes.
From these 7,500 pictures, I randomly selected 17.
Then, from these 17 randomly selected images, randomly zoom in/out and create n a new image at a fraction of a few percent. Our goal here is to find these approximate copies of the picture--a bit of a haystack feeling.
Do you also want to create a similar data set for your work? Download the CALTECH-101 DataSet, extract about 17 images, and run the script file gather.py under Repo.
Back to the point, these pictures are the same in all respects except width and height. And because they don't have the same shape, we can't rely on simple MD5 checksums. Most importantly, images with similar content may have a completely different MD5 hash. However, images that take picture hashes and similar content also have similar hash fingerprints.
So hurry up and start writing code to set up a fingerprint for the data set. Create a new file, name index.py, and start working:
# import the necessary packagesfrom PIL import imageimport imagehashimport argparseimport shelveimport glob # construct th E argument parse and parse the ARGUMENTSAP = Argparse. Argumentparser () ap.add_argument ("-D", "--dataset", required = True,help = "path to input dataset of images") Ap.add_argume NT ("-S", "--shelve", required = True,help = "Output shelve database") args = VARs (Ap.parse_args ()) # Open the shelve Databa Sedb = Shelve.open (args["shelve"], writeback = True)
The first thing to do is to introduce the packages we need. We will load the picture on the hard disk using the image class in PiL or pillow. This Imagehash library can be used to build hashing algorithms.
The Argparse library is used to parse command-line arguments, and the shelve library serves as a simple key-value pair database (Python dictionary) that is stored on the hard disk. The Glob library makes it easy to get a picture path.
Then pass the command-line arguments. The first one,-dataset, is the path to the input picture library. Second,-shelve is the output path of the shelve database.
Next, open the shelve database to write the data. This DB database stores picture hashes. More of these are as follows:
# loop over the image datasetfor ImagePath in Glob.glob (args["DataSet"] + "/*.jpg"): # Load the image and compute the Difference hash image = Image.open (imagePath) h = str (imagehash.dhash (image)) # extract the filename from The path and update the database # using the hash as the key and the filename append to the # List of values F Ilename = Imagepath[imagepath.rfind ("/") + 1:] db[h] = Db.get (h, []) + [filename] # Close the shelf databasedb.close ()
The above is the content of most of the work. Start looping from the hard disk to read pictures, create image fingerprints and deposit into the database.
Now, take a look at the two most important lines of code in the entire example:
filename = Imagepath[imagepath.rfind ("/") + 1:]db[h] = Db.get (h, []) + [filename]
As mentioned earlier in this article, images with the same fingerprints are considered to be the same.
So, if our goal is to find an approximate image, then we need to maintain a list of images with the same fingerprint value.
And that's exactly what these lines of code do.
The previous code snippet extracts the file name of the picture. The next code snippet maintains a list of images with the same fingerprint value.
In order to extract the image fingerprint from our database and establish the hash database, run the following command:
$ python index.py-dataset images-shelve db.shelve
This script will run for a few seconds, and when finished, a file named Db.shelve will appear with a key-value pair for the image fingerprint and file name.
This basic algorithm was the algorithm I used to work for this dating startup a few years ago. We have an indecent picture set that builds a picture fingerprint for each of these images and stores it in the database. When I come to a new picture, I simply calculate its hash value and detect the database to see if the uploaded image has been identified as illegal.
In the next step, I'll show you how to actually execute the query to determine if there is a picture in the database that has the same hash value as the given picture.
Step Two: Query the data set
Now that we've created a database of image fingerprints, it's time to search our datasets.
Open a new file, name search.py, and start writing code:
# import the necessary packagesfrom PIL import Imageimport imagehashimport argparseimport shelve # construct the argument Parse and parse the ARGUMENTSAP = Argparse. Argumentparser () ap.add_argument ("-D", "--dataset", required = True, help = "path to DataSet of images") ap.add_ Argument ("-S", "--shelve", required = True, help = "Output shelve database") ap.add_argument ("-Q", "--query", Required = True, help = "path to the query image") args = VARs (Ap.parse_args ())
We need to import the relevant packages again. Then convert the command-line arguments. Requires three options,-dataset the path of the initial picture set,-shelve, the path to the database where the key values are saved,-query, the path of the query/upload image. Our goal is to determine whether the database already exists for each query image.
Now, write the code to perform the actual query:
# Open the shelve Databasedb = Shelve.open (args["shelve"]) # Load the query image, compute the difference image hash, and# And grab the images from the database that has the same hash# valuequery = Image.open (args["Query"]) H = str (IMAGEHASH.DH Ash (query)) filenames = Db[h]print "Found%d Images"% (len (filenames)) # Loop over the imagesfor filename in filenames:
image = Image.open (args["DataSet"] + "/" + filename) image.show () # Close the shelve Databasedb.close ()
First open the database, then load the pictures on the hard disk, calculate the fingerprint of the picture, and find all the pictures with the same fingerprint.
If a picture has the same hash value, it will traverse the picture and display it on the screen.
This code allows us to use only the thumbprint value to determine whether the image exists in the database.
Results
As I mentioned earlier in this article, I selected 17 randomly from more than 7,500 images in the CALTECH-101 dataset and then generated n new pictures by arbitrarily zooming a portion of the points.
These pictures are only a few pixels different in size-but also because of this we cannot rely on the MD5 hash of the file (this has been described in detail in the "Optimization Algorithm" section). However, we can use image hashes to find approximate images.
Open your terminal and execute the following command:
$ python search.py-dataset images-shelve db.shelve-query images/84eba74d-38ae-4bf6-b8bd-79ffa1dad23a.jpg
If all goes well you can see the following results:
On the left is the input picture. Load this image, calculate its image fingerprint, search the database for fingerprints to see if there is a picture with the same fingerprint.
Of course--as shown on the right, our data sets have the same images as the other two fingerprints. Although it is not obvious from this, these pictures, although have exactly the same visual content, is not exactly the same! The height width of the three images varies.
Try another input image:
$ python search.py-dataset images-shelve db.shelve-query images/9d355a22-3d59-465e-ad14-138a4e3880bc.jpg
Here's the result:
The left is still our input image. As shown on the right, our image fingerprinting algorithm can find three identical images with the same fingerprint.
One last example:
$ python search.py-dataset images-shelve db.shelve-query images/5134e0c2-34d3-40
This time the left input picture is a motorcycle. Get the picture of the motorcycle, calculate its fingerprint, and look it up in the fingerprint database. As we see on the right, we can also tell that there are three images in the database with the same fingerprint.
Optimization algorithm
There are many ways to optimize this algorithm-but the most critical is to take into account similar but not identical hashes.
For example, the images in this article are only a small number of points reorganized (scaled or decreased). If a picture is resized with a larger factor, or the aspect ratio is changed, the corresponding hash will be different.
However, these pictures should still be similar.
In order to find similar but not identical images, we need to calculate Hamming distance (Hamming distance). Hamming distance is used to calculate the number of different digits in a hash. As a result, only one of the two different pictures in Kazakhstan is naturally more similar to 10 different pictures.
However, we have a second problem-the scalability of the algorithm.
Consider this: we have an input image and are asked to find all the similar images in the database. Then we have to calculate the Hamming distance between the input image and each picture in the database.
As the size of the database grows, so does the time for database alignment. Ultimately, our hash database will reach a linear alignment that is already impractical.
The workaround, though beyond the scope of this article, is to use k-d trees and VP trees to reduce the complexity of the search problem from linear to sub-linear.
Summarize
In this article we have learned how to construct and use image hashes to detect similar images. These picture hashes are built using the visual content of the picture.
Just as a fingerprint can identify a person, a picture hash can also uniquely identify a picture.
Using the knowledge of image fingerprinting, we created a system that uses only image hashes to find and identify images with similar content.
We then showed how the image hash was applied to quickly find images with similar content.
Download the code from the repo directory.