A tutorial on using Python to implement a simple similar image search _python

Source: Internet
Author: User
Tags glob md5 md5 hash vars

About five years ago, I was doing development work for a dating site. They are early startups, but they are also starting to have some stable user capacity. Unlike other dating sites, the company has traditionally been a major market image of abstinence. It's not a site for you to fool around--it's a place where you can find a loyal partner.

With millions of venture capital invested (before the US depression), their online ads about true love and finding soul mates are on the line. Forbes (Forbes, the famous American financial magazine) interviewed them. They were also interviewed on the national television program. Early success has led to a coveted exponential growth in the start of a career-the number of users growing at a doubling rate each month. Everything seemed to be in the wind for them.

But they have a serious problem--the pornography problem.

Some of the users of the dating site will upload pornographic images and set them up as personal avatars. This behavior undermines the experience of many other users--causing many users to cancel their membership.

It may not be a problem for some of today's dating sites to be seen with a few pornographic images. Or it might be customary to even expect something, just a by-product of an online dating that is accepted and then ignored.

However, such an act should neither be accepted nor be ignored.

Don't forget, this venture is positioning itself in a good dating paradise, free from the hassle of filth and rubbish that bothers other dating sites. In short, they have a real reputation for being backed by risky capital, which is what they need to keep.

The dating site is desperate to stop the eruption of pornographic images quickly. They hired the photo forum moderator team, really don't do other things just stare at the regulatory page for more than 8 hours a day and then remove any pornographic images that are uploaded to the social network.

It is no exaggeration to say that they have invested tens of thousands of of dollars (not to mention countless artificial hours) to solve the problem, but only to mitigate, to control the situation rather than to stop it at the source.

The outbreak of pornographic images reached a critical level in July of 2009. For the first time in 8 months, the user has failed to double (or even started to reduce). To make matters worse, investors say that if the company fails to solve the problem, it will withdraw its capital. In fact, the filthy tides had already begun to strike the ivory tower, and it was only a matter of time before it was overthrown into the sea.

As the dating giant is on the verge of holding up, I have come up with a more robust long-term solution: what if we use picture fingerprints to fight the outbreak of pornographic images?

You see, every picture has a fingerprint. Just as a person's fingerprints can identify a person, a picture's fingerprint can recognize a picture.

This has prompted the implementation of a three-phase algorithm:

1. Create a fingerprint for an indecent picture, and then store the picture fingerprint in a database.

2. When a user uploads a new avatar, we compare it to the picture fingerprint in the database. If the fingerprint of the uploaded image matches any of the images in the database, we prevent the user from setting the image as a personal portrait.

3. When picture supervisors mark new pornographic images, they are also given fingerprints and stored in our database, creating a database that can be used to prevent unauthorized uploads and evolving.

Our approach, though not perfect, is effective. Slowly, the outbreak of pornographic images slowed down. It will never go away--but this algorithm allows us to successfully reduce the number of illegal uploads by more than 80%.

This also saved the investor's heart. They continued to provide us with financial support-until the recession came, we were all out of work.

Looking back, I couldn't help laughing. My work didn't last very long. The company has not persisted for too long. There are even a few investors out of the pack.

But one of them did survive. The algorithm for extracting the fingerprint of the picture. A few years later, I'm going to share the basics of this algorithm and expect you to apply it to your own projects.

But the big question is, how do we build a picture fingerprint?

Read on.
the things that are about to be done

We are going to use picture fingerprints to detect similar images. This technique is often referred to as "perceptual image hash" or simple "picture hash".
What is picture fingerprint/Picture hash

A picture hash is the process of detecting the contents of a picture and then establishing a unique value for the picture based on the contents of the test.

For example, look at the picture at the top of this article. Given a picture as input, apply a hash function, and then compute a hash of the picture based on the vision of the picture. Similar images should also have similar hash values. The application of the image hash algorithm makes the detection of similar images quite simple.

In particular, we will use "differential hash" or a simple dhash algorithm to compute the picture fingerprint. In simple terms, the Dhash algorithm looks at the difference between two adjacent pixels. Then, based on the difference, a hash value is established.
Why not use algorithms such as Md5,sha-1?

Unfortunately, we cannot use the cryptographic hash algorithm in the implementation. Because of the nature of the cryptographic hash algorithm, a very small difference in the input file can also result in a significant hash value. In the case of picture fingerprints, we actually hope that similar input can have similar hash output values.
where can I use a picture fingerprint?

As I've mentioned above, you can use a fingerprint to maintain a database that keeps an indecent image-a warning when a user tries to upload a similar image.

You can create a reverse search engine for a picture, such as TinEye, that can record pictures and the related pages they appear in.

You can also use picture fingerprints to help manage your personal photo collection. Let's say you have a hard drive with some local backups of your photo gallery, but you need a way to delete a local backup, one image to keep only a single backup--a picture fingerprint can help you.

In simple terms, you can use a picture fingerprint/hash for almost any scene that requires a similar copy of your image to be detected.
What are the libraries you need?

To create a picture fingerprint scheme, we intend to use three main Python packages:

    1. Pil/pillow for reading and loading pictures
    2. Imagehash, including the implementation of Dhash
    3. and Numpy/scipy,imagehash dependency packs.

You can use the following command to install the required libraries in one click:

$ pip Install Pillow Imagehash

First step: Create a fingerprint for a picture set

The first step is to create a fingerprint for our picture set.

Maybe you will ask, but we won't, we won't use pornographic images that I work for that dating site. Instead, I created a human dataset to work with.

For the researchers of Computer vision, DataSet CALTECH-101 is a legendary existence. It contains at least 7500 images from 101 different categories, including characters, motorcycles and airplanes.

From these 7,500 pictures, I randomly selected 17.

Then, from the 17 randomly selected pictures, zoom in/out randomly with a few percentage points and create N-Zhang pictures. Here our goal is to find these approximate copies of the picture--a bit of a haystack feeling.

Do you want to create a similar dataset to work with? Download the CALTECH-101 DataSet, extract about 17 pictures, and then run the script file gather.py under Repo.

Back to the point, these pictures are all the same except for width and height. And because they don't have the same shape, we can't rely on simple MD5 checksums. Most importantly, a picture with similar content might have a completely different MD5 hash. However, a picture hash, similar to the content of the picture also has a similar hash fingerprint.

So quickly start writing code to create a fingerprint for the dataset. Create a new file, name it index.py, and start working:


# import the necessary packages from
PIL import Image
import imagehash
import argparse
import shelve
Import Glob
 
# construct the argument parse and parse the arguments
ap = argparse. Argumentparser ()
ap.add_argument ("D", "--dataset", required = True, help
= "path to input dataset of images") 
   ap.add_argument ("-S", "--shelve", required = True, help
= "Output shelve database")
args = VARs (ap.parse_ Args ())
 
# Open the shelve database
db = Shelve.open (args["shelve"], writeback = True)

The first thing to do is to introduce the packages we need. We will use the image class in PiL or pillow to load the picture on the hard disk. This Imagehash library can be used to build hashing algorithms.

The Argparse library is used to parse command-line arguments, and the shelve library acts as a simple key-value pair database (Python dictionary) stored on the hard disk. The Glob library can easily get a picture path.

The command-line arguments are then passed. First,-dataset is the path to the input picture library. Second,-shelve is the output path to the shelve database.

Next, open the shelve database to write the data. This DB database stores a picture hash. More is shown below:

# Loop over the image dataset for
ImagePath in Glob.glob (args["DataSet" + "/*.jpg"):
  # Load the image and compute The difference hash
  image = Image.open (ImagePath)
  h = str (imagehash.dhash (image))
 
  # extract the filename fro M the path and update the database
  # using the hash as the key and the filename append to the
  # List of values
   
    filename = Imagepath[imagepath.rfind ("/") + 1:]
  db[h] = Db.get (h, []) + [filename]
 
# Close the shelf database
    db.close ()

   

The above is the content of most of the work. Start to loop through the hard drive to read the picture, create the picture fingerprint and store it in the database.

Now, take a look at the two most important lines of code in the entire paradigm:

filename = Imagepath[imagepath.rfind ("/") + 1:]
db[h] = Db.get (h, []) + [filename]

As mentioned earlier in this article, images with the same fingerprint are considered to be the same.

So, if our goal is to find an approximate picture, we need to maintain a list of pictures with the same fingerprint value.

And that's exactly what these lines of code do.

The previous code snippet extracts the file name of the picture. The last code fragment maintains a list of pictures with the same fingerprint value.

To extract a picture fingerprint from our database and set up a hash database, run the following command:

$ python index.py-dataset images-shelve db.shelve

The script will run for a few seconds, and when it is done, a file named Db.shelve will appear with the key value pairs of the picture's fingerprint and filename.

The basic algorithm was the one I used when I worked for this dating startup a few years ago. We got an image set that builds a picture fingerprint for each picture and stores it in the database. When I come up with a new picture, I simply compute its hash value, and checking to see if the upload image has been identified as illegal content.

In the next step, I'll show you how to actually execute the query to determine if there is a picture in the database that has the same hash value as the given picture.
Step Two: query data set

Now that we've built a database of picture fingerprints, we should search our dataset.

Open a new file, name it search.py, and start writing code:

# import the necessary packages from
PIL import Image
import imagehash
import argparse
import shelve
 
# Construct the argument parse and parse the arguments
ap = argparse. Argumentparser ()
ap.add_argument ("D", "--dataset", required = True, help
  = "path to DataSet of images")
Ap.add_argument ("-S", "--shelve", required = True, help
  = "Output shelve database")
ap.add_argument ("-Q", "- Query ', required = True, help
  = ' path to ' query image '
args = VARs (Ap.parse_args ())

We need to import the related packages again. The command line arguments are then converted. Requires three options,-dataset the initial picture set path,-shelve, save the key value pairs of the database path,-query, query/upload the path to the picture. Our goal is to determine whether the database already exists for each query picture.

Now write the code to execute the actual query:

# Open the shelve database
db = Shelve.open (args["shelve"])
 
# Load the query image, compute the difference image h Ash, and
# and grab the images from the database that have the same hash
# value
query = Image.open (args["Quer") Y "])
h = str (imagehash.dhash (query))
filenames = db[h]
print" Found%d images "% (len (filenames))
 
# Loop Over the images for
filename in filenames:
  image = Image.open (args["DataSet" + "/" + filename)
  image.show ()
 
# Close the shelve database
db.close ()

First open the database, then load the picture on the hard disk, calculate the fingerprint of the picture, and find all the pictures with the same fingerprint.

If a picture has the same hash value, the pictures are traversed and displayed on the screen.

This code allows us to determine whether a picture is already in the database, using only the fingerprint value.
Results

As mentioned earlier in this article, I randomly select 17 pieces from more than 7,500 pictures in the CALTECH-101 dataset, and then create N-Zhang pictures by arbitrarily scaling some points.

These pictures are only a few pixels different in size-but also because of this we cannot rely on the MD5 hash of the file (which is described in detail in the "Optimization Algorithm" section). However, we can use a picture hash to find an approximate picture.

Open your terminal and execute the following command:

$ python search.py-dataset images-shelve db.shelve-query images/84eba74d-38ae-4bf6-b8bd-79ffa1dad23a.jpg

If all goes well you can see the following results:

On the left is the input picture. Load this picture, calculate its picture fingerprint, search the database for fingerprints to see if there are any pictures with the same fingerprint.

Of course-as shown on the right, our dataset has two other images with the same fingerprint. Although it is not clear from the screenshot, these pictures, although there is exactly the same visual content, is not exactly the same! The height widths of these three pictures are different.

Try a different input picture:

$ python search.py-dataset images-shelve db.shelve-query images/9d355a22-3d59-465e-ad14-138a4e3880bc.jpg

Here is the result:

The left side is still our input picture. As shown on the right, our picture fingerprint algorithm is able to find three identical pictures with the same fingerprint.

A final example:

$ python search.py-dataset images-shelve db.shelve-query images/5134e0c2-34d3-40

This time the left side of the input picture is a motorcycle. Get the motorcycle picture, calculate its fingerprint, and look for the fingerprint in the fingerprint database. As we see on the right, we can also tell that there are three images in the database with the same fingerprint.
Optimization Algorithm

There are many ways to optimize the algorithm--but the most critical thing is to take into account similar but not identical hashes.

For example, the picture in this article is just a small subset of the points that have been reorganized (increase or decrease proportionally). If a picture is resized with a larger factor, or the aspect ratio is changed, the corresponding hash will be different.

However, these pictures should still be similar.

To find similar but not identical images, we need to calculate the Hamming distance (Hamming distance). The Hamming distance is used to compute the different digits of a hash. As a result, only one of the two different pictures in the Greek is naturally more similar than the 10-bit picture.

However, we encountered the second problem-the scalability of the algorithm.

Consider this: we have an input picture and are asked to find all the similar images in the database. Then we have to calculate the Hamming distance between the input picture and each picture in the database.

As the size of the database grows, the time to compare to the database is extended. In the end, our hash database will reach a linear alignment that is already impractical in scale.

The solution, though beyond the scope of this article, is to use k-d trees and VP trees to reduce the complexity of the search problem from linear to sub linear.
Summary

In this article, we learned how to build and use a picture hash to perform a similar image detection. These picture hashes are built using the visual content of the picture.

Just as a fingerprint can identify a person, a picture hash can also uniquely identify a picture.

Using the knowledge of picture fingerprinting, we have established a system that can find and identify images with similar content using only picture hashes.

Then we demonstrated how the picture hash was applied to quickly find pictures with similar content.

Download the code from the repo directory.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.