From content/user portrait to How to do algorithm development

Last Update:2018-07-26 Source: Internet

Author: User

Tags svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Original link: http://www.jianshu.com/p/d59c3e037cb7?spm=5176.100239.blogcont60117.8.Bd8tGq

Noon and a former colleague to dine together, found that there are still a lot of collision points. Communicate a lot of things that are being done,
The other side also offers a lot of thought that deserves a good thought.

First we talked to him about the progress we're making now, and we're actually doing content portraits. We are generally talking about the user portrait, in fact, the content is to be portrait.

I said before that content and users are now the core of the Internet Enterprise two things, the user's behavior will be content and users connected together.

A lot of people come up, roll up the sleeves and start to do the user portrait, the latter will find that if there is no analysis of the content, in fact, the user portrait of this thing will do bad. Because the user's behavior is content-based, only the content of the image is done, to further enhance the quality of the user portrait. To do a content portrait, there are actually two things to do:

Describe content from multiple dimensions and form a corresponding labeling system
How to put these tags on the content
In addition, he also spoke of his own view on how to do this, asking for the mlib of Spark as the carrier and try to share an algorithmic platform with everyone. I was surprised to say that the idea coincides with mine. He said that the benefit of this is that we share information more quickly and that the same platform is better maintained. I further added that if everyone has the level of a Google engineer, in fact, there is no limit on a platform, but in fact, if everyone insists on the way they are good at, in fact, the cost of invisibility is very high.

For example, the algorithm engineer wrote a huge algorithm prototype, and then he needs to give the engineer to understand the algorithm, the engineer to see the individual level, not to say whether the algorithm can be realized, the time spent, and whether there is really time and energy to help achieve, the realization of the problem is a big problem. Back and forth a toss, two people will be more tired. Of course, as I said before, if it's all Google engineers, things can be faster. This communication cost is much smaller if everyone is using the spark platform. As long as the engineers have already written the spark code of the algorithm engineer to do some tuning optimization, the estimate can be directly online to see the effect. So I do more extreme, require algorithm engineers to use the algorithm must be spark mlib existing, or have the ability to implement their own, can not go to Lib run on the line.

He also asked me, "How to calculate a real understanding of the algorithm." The question really asked me, I would have said before, it is enough to know what kind of algorithm the scene uses. But now it's really quiet and it's not like that.

Let's talk first, how to know what the scene is, what algorithm to use. First, we need to know what kind of problem a specific scenario can correspond to. is a clustering problem. A classification problem. is still a regression class problem. After defining the category, the corresponding algorithm is found. such as clustering can use kmeans,lda,k nearest neighbor, etc., classification can be Bayesian, SVM and so on. However, you will find that it is still too simple.

A scene to solve a problem is often not so intuitive, as we mentioned above to build a content image of the problem, there are two sub-problems, each sub-problem needs to be divided into several steps, each step may correspond to one or more algorithm problems.

But even so, it is still far from enough. Because even if we do know exactly which algorithm to use, but one use, we find that the effect is not the case at all. At this point we need to know at least two things:

What is the core of the algorithm, what are the potential requirements. For example, do you make assumptions about the distribution of data?
What are the characteristics and the data set?
And a lot of algorithms make a lot of rough assumptions, this hypothesis will lead to some inherent problems of the algorithm, if you do not understand the internal assumptions, you will think that these are one of his characteristics, is actually a disadvantage. For example Gini importance, if you do not understand the internal thought, you understand the data, it will cause misunderstanding, resulting in the wrong to think that the first selected characteristics is very important, and the rest of the features are not important, but in fact, these characteristics of the response variable is really very close to the role.

It is not important to make the formula deduction in the end. We often think that the formula in the algorithm can be deduced from the people, very cow, can do this, naturally deserves encouragement and admiration, but I think the algorithm and can deduce the formula is not the same thing. I can carry out each formula in the algorithm, find a math department of the people to deduce, it may be relatively easy to fix. But we say he knows that. He doesn't even know what the algorithm is, right. Therefore, the people who turn from the project, must not feel that there are any obstacles, in fact, we can ignore the formula itself derivation process.

I sometimes feel that the most popular word for algorithmic engineers is tricky. I do not know how to translate more appropriate, many times is the need for understanding and the nature of things to understand, to understand an algorithm, absolutely not by a few formulas can be done.

Collaborative algorithm is a widely used algorithm for our application. But I think synergy should not be an algorithm, but a pattern. Many of our most common models are, finally, collaborative patterns. For example, is not a A1 user referral article B1, which we might have done:

The user uses vectors to characterize, the article is also
Observe a large number of user a2,a3 ... An is not a bit hit the B1
Training a model using a classification algorithm such as logistic regression/SVM
Throw the A1,B1 into the model and get a recommendation.
But in fact this algorithm, with the use of synergy. Why is that. In essence, similar users make the choice to recommend each other.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More