Original: Http://www.infoq.com/cn/news/2014/03/baidu-salon48-summary
March 15, 2014, in the 48th phase of Baidu Technology salon, sponsored by @ Baidu, @InfoQ responsible for organizing and implementing, from Baidu Alliance Big Data Machine Learning technology responsible for summer powder, and Sogou precision Advertising Research and development Department of technical manager Wang Xiaobo, each sharing its experience in machine learning. Their topics involve "large-scale machine learning on advertising data" and "theme retrieval applications under Big data Scenarios", and this article will provide a brief review of the individual lecturers ' sharing, as well as the download of relevant materials.
Topic One: Large-scale machine learning on advertising data (download notes)
A good advertising matching system, need to solve the above challenges, while using as few resources as possible to tap as much data value, improve the efficiency of advertising matching. For this purpose, summer powder teacher to the advertisement click-through estimate problem as an example, explains how to use the large-scale machine learning technology to build a trillion characteristic data, a minute-level model update, automatic high-efficiency deep learning, efficient training of CTR estimation system.
Calculate Advertising and Ctr forecasts
The main challenge for computing advertising is to find "best Bets" between specific users and corresponding ads in a given context. Context can be the user in the search engine input query words, can also be the user is reading the page, it can be the user is watching the movie, and so on. User-related information can be very much or very small. The number of potential ads can reach billions of. Therefore, depending on the definition of "best fit", the challenge is likely to lead to large-scale optimization and search problems under complex constraints.
"We use machine learning to advertise data and how to do it well, which requires combing the entire process." After we've combed through the entire process, we can find out what we can do to influence the click Estimate, "Jing Jinpeng said.
Large-scale machine learning
Large feature Size: training samples, daily company aims Yangzhou other traffic, characteristics of complex types, advertising, users, traffic, seasons, holidays and so on. The data is big, characteristic is many, the category is unbalanced, the noise is big.
There is a highly nonlinear relationship between features: e.g., different users (male, female), at different ages, like to point different ads, the same ad, at different times, click also different;
Frequent data training: regular updates of strategies, policy research, frequent call to model training programs
Model timeliness
Sparsity: The model needs to save as little information as possible;
Timeliness: The model training data is nearly likely to be less;
Stability: The model requires as much data information as possible;
Google: Retain the gradient and model of the first n models, the information loss is large, the model is not stable
Topic Two: Big Data Scenario Theme Search application (download lecture)
Most of the data sets encountered are usually tens of thousands of to hundreds of thousands of articles of this magnitude, but in the actual scenario of the enterprise, if the number of billion series to deal with it? How to use the limited computing cluster resource processing?
Great anthology, Wang Xiaobo around this problem, we introduce the LDA topic Model Training System and the problems and solutions that need to be faced when predicting online.
- Theoretical basis of topic retrieval model
- Challenges in Big Data scenarios
- Build an efficient training system
- The application of model in commercial advertisement retrieval
Development process--VSM
The vector space model is a groundbreaking concept:
Advantage: The document can be represented as a real number vector;
Documents of different lengths can be represented as fixed-length sequences;
The method of vector-related calculation is introduced.
Problem: The document is mapped in the word space, the vector dimension is too high;
Weak comprehension, support of semantic analysis not strong communication selection
Introduction to LDA Model
OpenSpace (open discussion session)
In order to facilitate the participants to communicate closely with each of our guests and lecturers, and to delve into the questions during the speech, the open Space (open discussion) session is still set up in this event. In the open space of the summary, several topics team leader the discussion of the contents of the summary.
Summer powder: Deep learning topics in the current Big data era will be more and more fire, I was in the speech for everyone to throw a brick, interactive process, we asked a lot of practical questions, I hope my explanation can bring some help to everyone.
Wang Xiaobo: Attention to machine learning enthusiasm is very high, summer teacher talk a lot of dry, but as long as not involved in the key commercial data, such as the specific number of Baidu ads Click, these models published to everyone's study is still very good. It is hoped that the next organizer will be able to prepare the relevant topics so that in open space, the instructor can do some preparatory work in advance to provide a more targeted answer to the audience.
At the meeting, some participants also shared their experiences through Sina Weibo:
Aixinjueluo small teeth: and Baidu than, we are still in the Stone Age.
Hao _change: @ Summer Powder _ Baidu Summer teacher, you can tell the calculation of the billions of characteristics of the ads specifically include which. How did you get that?
Fan Bin _: #百度技术沙龙 # Nearly one hours in advance, people are almost full. Everybody take the Iphone,ipad,kindle. Read, study, discuss. Programmers are all that.
For more information about Baidu Technology salon, you can pay attention to the Baidu technology salon by Sina Weibo, or focus on INFOQ official: Infoqchina,infoq also summed up in the past all the Baidu Technology salon speech video and materials, interested readers can directly browse the content.
Special Note: The 48th phase of the Baidu Technology salon will be held on April 19, Saturday, in Beijing Garage Coffee, the main title of "large-scale distributed storage real-time analysis," Welcome to @infoq, @ Baidu Technology Salon to obtain follow-up activities information.
Baidu Technology Salon 48th review: Large-scale machine learning (including data download)