Many people think that machine learning is unattainable. This is a mysterious technology that only a few professional scholars know.
After all, you are letting a machine running in the binary world come up with its own understanding of the real world. You are teaching them how to think. However, this article is hardly the obscure, complex, and full of mathematical formulas you think. Just like all the basic knowledge that helps us understand the world (for example, Newton's law of motion, work needs to be done, supply-demand relationship, etc ), the best methods and concepts of machine learning should also be concise and clear. Unfortunately, the vast majority of machine learning literature is filled with complicated and difficult symbols, difficult and obscure mathematical formulas, and unnecessary nonsense. It is precisely this that has placed a thick wall on the idea of the simple foundation of machine learning.
Now let's look at a practical example. We need to add a "you may like" recommendation function at the end of an article. How can we achieve this?
To achieve this idea, we have a simple solution:
- 1. Obtain the title of the current article and divide it into independent words (note: the original text is in English and only needs to be separated by spaces. A word divider is required for Chinese Word Segmentation)
- 2. Get all articles except the current article
- 3. Sort these articles based on the degree of coincidence between their content and the title of the current article
def
similar_posts(post)
title_keywords = post.title.split(
' '
)
Post.all.to_a.sort |post1, post2|
post1_title_intersection = post1.body.split(
' '
) & title_keywords
post2_title_intersection = post2.body.split(
' '
) & title_keywords
-
post2_title_intersection.length <=> post1_title_intersection.length
end
[
0
..
9
]
end
This method is used to find articles similar to the blog article "how to improve product quality by the Support Team", and we will get the following top 10 articles of relevance:
- How to implement a verified Solution
- Learn how your customers make decisions
- Design the first running interface to please your users
- How to recruit designers
- Icon design
- Interview with singer Ryan
- Actively support customers through internal communication
- It doesn't matter why you become the first
- Interview with Joshua Porter
- Customer Retention, group analysis and Visualization
As you can see, the benchmark article is about how to efficiently provide team support, which has little to do with the advantages of Customer Group Analysis and Design Discussion, in fact, we can also adopt better methods.
Now, we try to solve this problem with a real machine learning method. There are two steps:
- The article is represented in mathematics;
- The K-means clustering algorithm is used to analyze the preceding data points.
1. Represent the article in Mathematics
If we can present the article in mathematics, we can plot the article based on the degree of similarity and identify different clusters:
As shown in, it is not difficult to map each article into a coordinate system coordinate point. You can perform the following two steps:
- Find all words in each article;
- Create an array for each article. The element in the array is 0 or 1, which indicates whether a word appears in this article. The order of array elements in each article is the same, the value is different.
The Ruby code is as follows:
@posts
= Post.all
-
@words
=
@posts
.map
do
|p|
p.body.split(
' '
)
end
.flatten.uniq
-
@vectors
=
@posts
.map
do
|p|
@words
.map
do
|w|
p.body.include?(w) ?
1
:
0
end
end
Assume that the value of @ words is:
["Hello", "internal", "internal communication", "Reader", "blog", "publish"]
If the content of an article is "Hello blog publisher", the corresponding array is:
[1, 0, 1, 1]
Of course, we can't use a simple tool to present the six-dimensional coordinate point like a two-dimensional coordinate system, but the basic concepts involved in this, such as the distance between two points, are interconnected, it can be extended to a higher dimension through two dimensions (so it is feasible to use two-dimensional examples ).
2. Use the K-means clustering algorithm to perform cluster analysis on the Data Base.
Now we get the coordinates of a series of articles. We can try to find the group clusters of similar articles. Here we use a simple clustering algorithm-K-means algorithm, which has five steps:
- Sets K to indicate the number of objects in the cluster;
- K objects are randomly selected from all data objects as the initial K cluster centers;
- Traverse all objects and assign them to the group nearest to them;
- Update the cluster center, that is, calculate the mean of objects in each cluster, and use the mean value as the new center of the cluster;
- Repeat steps 3 and 4 until each cluster center does not change.
We will then visually display these steps in the form of graphs. First, we randomly select two points (K = 2) from the coordinates of a series of articles ):
We assign each article to the group closest to it:
We calculate the mean coordinate of all objects in each cluster as the new center of the cluster.
In this way, we have completed the first data iteration. Now we will re-assign the article to the corresponding cluster based on the new cluster center.
So far, we have found the cluster corresponding to each article! Obviously, even if you continue to iterate into the cluster center, the cluster corresponding to each article will not change.
The Ruby code for the above process is as follows:
@cluster_centers
= [rand_point(), rand_point()]
-
15
.times
do
@clusters
= [[], []]
-
@posts
.
each
do
|post|
min_distance, min_point =
nil
,
nil
-
@cluster_centers
.
each
.with_index
do
|center, i|
if
distance(center, post) < min_distance
min_distance = distance(center, post)
min_point = i
end
end
-
@clusters
[min_point] << post
end
-
@cluster_centers
=
@clusters
.map
do
|post|
average(posts)
end
end
The following is the top 10 articles on the similarity between this method and the blog post "how to improve product quality by Support Teams:
- You know this better or you are smarter.
- Three guidelines for customer feedback
- Obtain the information you want from the customer
- Product Delivery is just the beginning
- What do you think function extensions look like?
- Understand your user base
- Convert customers with correct information and time
- Communicate with your customers
- Does your application have a message push schedule?
- Have you tried to communicate with the customer?
The results are self-explanatory.
We only use less than 40 lines of code and SIMPLE algorithm introduction to implement this idea. However, if you read an academic paper, you will never know how simple this is. The following is a summary of the K-means algorithm paper (I don't know who proposed the K-means algorithm, but this is the first article to propose the term "K-means ).
If you like to express your thoughts with mathematical symbols, there is no doubt that academic papers are very useful. However, there are actually more high-quality resources to replace these complicated mathematical formulas, which are more practical and approachable.
- Wiki Encyclopedia (for example, potential semantic index and cluster analysis)
- Source code of the open-source machine learning Library (for example, Scipy's K-Means, Scikit's DBSCAN)
- Books written by programmers (for example, collective intelligent programming and hacker machine learning)
- Khan College
Try
How do I manage application recommendation labels for your project? How to design your customer support tools? Or how do Users Group in social networks? These can be implemented through simple answer code and simple algorithms, which is a good opportunity to practice! Therefore, if you think that the problems faced by the project can be solved through machine learning, why do you have to hesitate?
Machine Learning is actually easier than you think!
Original article: Intercom Translation: bole online-zhibinzeng
Http://blog.jobbole.com/53546/.
========================================================== ====================
PPC platform activated!
Search for "PHPChina" and click "follow" to obtain the latest and most professional industry information pushed by PPC. More topics will be provided for you.
[PPC mining]: provides you with stories about classic products and product people from time to time.
[Ppc foreign language]: Share a foreign language translation article every day.
[PPCoder]: replies to users' questions on a daily basis.