From content/user portrait to How to do algorithm development

Source: Internet
Author: User
Tags svm

Original link: http://www.jianshu.com/p/d59c3e037cb7?spm=5176.100239.blogcont60117.8.Bd8tGq

Noon and a former colleague to dine together, found that there are still a lot of collision points. Communicate a lot of things that are being done,
The other side also offers a lot of thought that deserves a good thought.

First we talked to him about the progress we're making now, and we're actually doing content portraits. We are generally talking about the user portrait, in fact, the content is to be portrait.

I said before that content and users are now the core of the Internet Enterprise two things, the user's behavior will be content and users connected together.

A lot of people come up, roll up the sleeves and start to do the user portrait, the latter will find that if there is no analysis of the content, in fact, the user portrait of this thing will do bad. Because the user's behavior is content-based, only the content of the image is done, to further enhance the quality of the user portrait. To do a content portrait, there are actually two things to do:

Describe content from multiple dimensions and form a corresponding labeling system
How to put these tags on the content
In addition, he also spoke of his own view on how to do this, asking for the mlib of Spark as the carrier and try to share an algorithmic platform with everyone. I was surprised to say that the idea coincides with mine. He said that the benefit of this is that we share information more quickly and that the same platform is better maintained. I further added that if everyone has the level of a Google engineer, in fact, there is no limit on a platform, but in fact, if everyone insists on the way they are good at, in fact, the cost of invisibility is very high.

For example, the algorithm engineer wrote a huge algorithm prototype, and then he needs to give the engineer to understand the algorithm, the engineer to see the individual level, not to say whether the algorithm can be realized, the time spent, and whether there is really time and energy to help achieve, the realization of the problem is a big problem. Back and forth a toss, two people will be more tired. Of course, as I said before, if it's all Google engineers, things can be faster. This communication cost is much smaller if everyone is using the spark platform. As long as the engineers have already written the spark code of the algorithm engineer to do some tuning optimization, the estimate can be directly online to see the effect. So I do more extreme, require algorithm engineers to use the algorithm must be spark mlib existing, or have the ability to implement their own, can not go to Lib run on the line.

He also asked me, "How to calculate a real understanding of the algorithm." The question really asked me, I would have said before, it is enough to know what kind of algorithm the scene uses. But now it's really quiet and it's not like that.

Let's talk first, how to know what the scene is, what algorithm to use. First, we need to know what kind of problem a specific scenario can correspond to. is a clustering problem. A classification problem. is still a regression class problem. After defining the category, the corresponding algorithm is found. such as clustering can use kmeans,lda,k nearest neighbor, etc., classification can be Bayesian, SVM and so on. However, you will find that it is still too simple.

A scene to solve a problem is often not so intuitive, as we mentioned above to build a content image of the problem, there are two sub-problems, each sub-problem needs to be divided into several steps, each step may correspond to one or more algorithm problems.

But even so, it is still far from enough. Because even if we do know exactly which algorithm to use, but one use, we find that the effect is not the case at all. At this point we need to know at least two things:

What is the core of the algorithm, what are the potential requirements. For example, do you make assumptions about the distribution of data?
What are the characteristics and the data set?
And a lot of algorithms make a lot of rough assumptions, this hypothesis will lead to some inherent problems of the algorithm, if you do not understand the internal assumptions, you will think that these are one of his characteristics, is actually a disadvantage. For example Gini importance, if you do not understand the internal thought, you understand the data, it will cause misunderstanding, resulting in the wrong to think that the first selected characteristics is very important, and the rest of the features are not important, but in fact, these characteristics of the response variable is really very close to the role.

It is not important to make the formula deduction in the end. We often think that the formula in the algorithm can be deduced from the people, very cow, can do this, naturally deserves encouragement and admiration, but I think the algorithm and can deduce the formula is not the same thing. I can carry out each formula in the algorithm, find a math department of the people to deduce, it may be relatively easy to fix. But we say he knows that. He doesn't even know what the algorithm is, right. Therefore, the people who turn from the project, must not feel that there are any obstacles, in fact, we can ignore the formula itself derivation process.

I sometimes feel that the most popular word for algorithmic engineers is tricky. I do not know how to translate more appropriate, many times is the need for understanding and the nature of things to understand, to understand an algorithm, absolutely not by a few formulas can be done.

Collaborative algorithm is a widely used algorithm for our application. But I think synergy should not be an algorithm, but a pattern. Many of our most common models are, finally, collaborative patterns. For example, is not a A1 user referral article B1, which we might have done:

The user uses vectors to characterize, the article is also
Observe a large number of user a2,a3 ... An is not a bit hit the B1
Training a model using a classification algorithm such as logistic regression/SVM
Throw the A1,B1 into the model and get a recommendation.
But in fact this algorithm, with the use of synergy. Why is that. In essence, similar users make the choice to recommend each other.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.