Five Most popular similarity measures (EXT)

Source: Internet
Author: User

Five Most popular similarity measures implementation in Python

The buzz term similarity distance measures have got wide variety of definitions among the math and data mining practitioner S. As a result those terms, concepts and their usage went the beyond the head for beginner, who started to understand th EM for the very first time. So today I write this post to give more simplified and very intuitive definitions for similarity and I'll drive to Five Most popular similarity measures and implementation of them.

Before going to explain different similarity distance measures let me explain the the ' effective key term similarity in dat Amining.  This similarity was the very basic building block for activities such as recommendation engines, clustering, classification and anomaly detection.

Similarity:

Similarity is the measure of how much alike, data objects are. Similarity in a data mining context was usually described as a distance with dimensions representing features of the object S. If This distance is small it would be the high degree of similarity where as large distance would be a low degree of similarity . Similarity is subjective and was highly dependant on the domain and application. For example fruits is similar because of color or size or taste. Care should is taken when calculating distance across dimensions/features that is unrelated. The relative values of each feature must is normalized or one feature could end up dominating the distance calculation. similarity is measure in the range 0 to 1 [0,1].

The main consideration about similarity:

    • Similarity = 1 if x = y (Where x, Y is both objects)
    • Similarity = 0 if x≠y

That's all on similarity let's drive to five most popular similarity distance measures.

Euclidean Distance:

Euclidean distance is the most common use of distance. In the most cases when people said about distance, they would refer to Euclidean distance. Euclidean distance is also know as simply distance. When data was dense or continuous, this is the best proximity measure. The Euclidean distance between and points is the length of the path connecting them. This distance between and points are given by the Pythagorean theorem.

Euclidean distance implementation in Python:

#!/usr/bin/env pythonfrom Math import*def euclidean_distance (x, y):  return sqrt (sum (POW (a-b,2) for a, B in zip (×, y))) Print euclidean_distance ([0,3,4,5],[7,6,3,-1])

Script Output:

9.74679434481[finished in 0.0s]

Manhattan Distance:

Manhattan distance is a metric in which the distance between both points is the sum of the absolute differences of their C Artesian coordinates. In simple to saying it is the absolute sum of difference between the x-coordinates and Y-coordinates. Suppose we have both point A and B if we want to find the Manhattan distane between them, just we had to sum the Absult UE x-axis and Y–axis variation means we have to find how these to points A and B is varining in x-axis and y-axis. In more mathematical to saying Manhattan distance between, points measured along axes at right angles.

In a plane with P1 at (x1, y1) and P2 at (x2, y2).

Manhattan distance = |x1–x2| + |y1–y2|

This Manhattan distance metric are also known as Manhattan length,rectilinear distance, L1 distance or L1 norm, city block Distance,minkowski ' s L1 distance,taxi cab metric, or city block distance.

Manhattan distance implementation in Python:

#!/usr/bin/env pythonfrom Math import*def manhattan_distance (x, y):  return sum (ABS (A-a) for B in Zip (x, y)) print Manhattan_distance ([10,20,10],[10,20,20])

Script Output:

10[finished in 0.0s]

Minkowski Distance:

The Minkowski distance is a generalized metric form of Euclidean distance and Manhattan distance.

In the equation d^mkd are the Minkowski distance between the data record I and J, k the index of a variable, n the total nu Mber of Variables y andλthe order of the Minkowski metric. Although it is defined for anyλ> 0, it's rarely used for values other than 1, 2 and∞.

The distances is measured by the Minkowski metric of different orders between-objects with three variables (in t He image it displayed in a coordinate system with X, Y, z-axes).

Synonyms of Minkowski:
Different names for the Minkowski distance or Minkowski metric arise form the order:

    • Λ= 1 is the Manhattan distance. Synonyms is l1-norm, taxicab or city-block distance. For a vectors of ranked ordinal variables the Manhattan distance is sometimes called foot-ruler distance.
    • Λ= 2 is the Euclidean distance. Synonyms is l2-norm or Ruler distance. For a vectors of ranked ordinal variables the Euclidean distance is sometimes called spear-man distance.
    • Λ=∞is the Chebyshev distance. Synonym is Lmax-norm or chessboard distance.
      Reference.

Minkowski distance implementation in Python:

#!/usr/bin/env pythonfrom Math import*from Decimal Import decimaldef nth_root (value, n_root): Root_value = 1/float (n_root Return round (Decimal (value) * * Decimal (Root_value), 3) def minkowski_distance (x,y,p_value): Return Nth_root (POW ( ABS (-a), p_value) for a, b in zip (x, y)), p_value) print minkowski_distance ([0,3,4,5],[7,6,3,-1],3)

Script Output:

8.373[finished in 0.0s]
cosine similarity:

Cosine similarity metric finds the normalized dot product of the and the attributes. By determining the cosine similarity, we'll effectively trying to find cosine of the angle between the same objects. The cosine of 0°is 1, and it is less than 1 for any other angle. It is thus a judgement of orientation and not magnitude:two vectors with the same orientation has a cosine similarity of 1, vectors at 90°have a similarity of 0, and both vectors diametrically opposed have a similarity of-1, Independent of their magnitude. Cosine similarity is particularly used in positive space, where the outcome was neatly bounded in [0,1]. One of the reasons for the popularity of cosine similarity are that it's very efficient to evaluate, especially for sparse Vectors.

Cosine similarity implementation in Python:

#!/usr/bin/env pythonfrom Math import*def square_rooted (x):   return round (sqrt (sum ([A*a for A In X]), 3) def Cosine_ Similarity (x, y): numerator = SUM (a*b for a, b in zip (x, y)) denominator = square_rooted (x) *square_rooted (y) return round (num Erator/float (denominator), 3) print cosine_similarity ([3, 45, 7, 2], [2, 54, 13, 15])

Script Output:

0.972[finished in 0.1s]
Jaccard Similarity:

We are far from discussed some metrics to find the similarity between objects. Where the objects is points or vectors. When we consider about Jaccard similarity this objects would be sets. So first let's learn some very basic about sets.

Sets:

A set is (unordered) collection of objects {A,b,c}. We use the notation as elements separated by commas inside curly brackets {}. They is unordered so {a, b} = {b,a}.

Cardinality:

Cardinality of A denoted by | a| Which counts how many elements is in A.

Intersection:

Intersection between, sets A and B is denoted a∩b and reveals all items which be in both sets A, A.

Union:

Union between sets A and B is denoted a∪b and reveals all items which yes in either set.

Now going back to Jaccard similarity. The Jaccard similarity measures similarity between finite sample sets, and is defined as the cardinality of the INTERSECTI On sets divided by the cardinality of the Union of the sample sets. Suppose want to find jaccard similarity between both sets A and B it is the ration of cardinality of a∩b and A∪b

Jaccard Similarity Implementation:

#!/usr/bin/env pythonfrom Math import*def jaccard_similarity (x, y): intersection_cardinality = Len (set.intersection (*[ Set (x), set (y)]) union_cardinality = Len (set.union (*[set (x), set (y))) return Intersection_cardinality/float (union_ Cardinality) print jaccard_similarity ([0,1,2,5,6],[0,2,3,5,7,9])

Script Output:

0.375[finished in 0.0s]

(Source: http://dataaspirant.com/2015/04/11/five-most-popular-similarity-measures-implementation-in-python/)

Five Most popular similarity measures (EXT)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.