Information retrieval and data mining in the field of essential knowledge Summary _ Basic theory

Source: Internet
Author: User

Information retrieval and network data fields (WWW, Sigir, cikm, WSDM, ACL, EMNLP, etc.) are commonly used in the papers of the model and technical summary

Introduction: For the doctoral students in this field, read the paper is to understand what people are doing research basis, usually we will go to read a book. Reading a book is good, but there is a big drawback: a book itself into a system, so contains too many things, a lot of content to see, but in fact, but not. This can not be said to be a waste, but did not put limited strength on the blade.

I am in the field of network data processing (international conferences www, Sigir, cikm, WSDM, ACL, EMNLP, etc.) I made a list of the models or technologies that I think we often encounter in our field and hope to help save time:
1. Preliminary probability theory
The following concepts are mainly used: three conditions of elementary probability definition, full probability formula, Bayesian formula, chain rule, common probability distribution (Dirichlet distribution, Gaussian distribution, polynomial distribution, Bose distribution m)
Although the content of the probability theory is many, but in the actual use of the main is the above several concepts. The higher probability theory based on the measure theory, the most of the papers appearing in several conferences (Www,sigir, etc.) will not appear.
2. Information base
The main common concepts: entropy, conditional entropy, KL divergence, and the relationship between the three, the maximum entropy principle, information gain (information gain)
3. Classification
Naive Bayes, KNN, support vector machines, maximum entropy model, decision tree fundamentals, and pros and cons, know commonly used software packages
4. Clustering
Non-hierarchical clustering of K-means algorithm, hierarchical clustering types and their differences, as well as the calculation of distance (such as the difference of single,complete a), know the commonly used software packages
5. EM algorithm
Understand the difficulty of inference of incomplete data, understand the EM principle and the inference process
6. The Monte Carlo algorithm (especially the Gibbs sampling algorithm O) knows the basic principles of the Monte Carlo algorithm, especially the sampling process of the Gibbs algorithm; Markov stochastic process and Markov chain
7. Graph model
The graph model has been very hot in recent years, and it is also very important because it can include many of the previous studies and is intuitive. such as CRF, Hmm,topic model is the application of graph models and special cases.
A. Understanding the general representations of the graph model (the direction graph and the non-direction graph Model X), the General Learning Algorithm (learning) and the Inference algorithm (inference), such as sum-product algorithm, propagation algorithm, etc.
B. Familiarity with the HMM model, including its hypothetical conditions, as well as forward and back algorithms;
C. Familiar with the LDA model, including its graph model representation I, and its Gibbs inference algorithm, the variational inference algorithm does not require mastery.
D. Understanding the CRF model, mainly to understand its graph model indicates that if there is time and interest a, you can understand the inference algorithm;
E. Understanding the general representations of Hmm,lda, CRF and graph models, and the links and differences between common learning algorithms and inference algorithms;
F. Understanding of Markov Logic Network (MLN), a language constructed on the basis of graph models and first-order logic, can be used to describe many practical problems, preliminary understanding, can help to understand the graph model;
8. Topic model
The idea of this model is widely used, there is no must have no time to read, recommended as follows:
A. A deep understanding of pLSA and LDA, while understanding the links and differences between pLSA and LDA; After these two models are understood, most of the topic model papers are understandable, especially those applied to NLP topic
Model At the same time, you can also design their own needs of the topic model of the non-level.
B. If you want to continue to deepen, and continue to understand the Hlda model, especially the understanding behind the mathematical principles Dirichlet Process, so you can design the level of their own topic model;
C. For supervised topic model, must understand S-lda and Llda two models, these two models embody completely different design thought, can realize, then oneself design own need topic model;
D. For the understanding of these models, the Gibbs sampling algorithm is an open ridge;
9. Optimization and stochastic processes
A. Understand that constraints are the most optimal problem of equal sign and the solution of Lagrange multiplier method;
B. Understand that the constraint condition is a convex optimization problem with no equal sign, and understand the simplex method;
C. Understanding gradient descent method, simulated annealing algorithm;
D. Understanding the idea of optimum solution such as mountain climbing method
E. Stochastic processes need to understand random walks, queuing theory, and other basic stochastic processes (in the paper occasionally, but not too common N), understand Markov stochastic process (very important, sampling theory commonly used L);
10. Bayesian Learning
At present, more and more methods or models use the idea of Bayesian school to process the data, so it is necessary to understand the relevant content.
A. Understanding the differences and connections between the Bayes school and the Statistical School of thought and principle;
B. Understanding loss functions and their role in Bayesian learning; remember commonly used loss functions;
C. Understanding the concept of Bayesian priori and four commonly used methods of selecting Bayesian priori;
D. Understanding the concepts of parameters and parameters, as well as differences;
E. The idea of Bayesian data processing is understood by the prior selection of LDA (or other model i);
11. Information retrieval models and tools
A. Understanding of commonly used retrieval models;
B. Understanding commonly used open source tools (Lemur,lucene etc ng) 12. Model selection and feature selection
A. Understand the commonly used feature selection methods, so as to select effective features to train the model; B. Look at several examples of model choices, and understand how to choose a suitable model; (this thing can only be understood by example) 13. Tricks in the thesis writing
The trick is a lot, here's a little.
It is suggested that whenever there is a peer review of the paper, careful thinking, to improve the writing ability is very helpful.

The above model and algorithm, perhaps after learning but not remember, personal opinion: No relationship, look again soon.

Xianling Mao, Search Engine & Web Mining Group


Turn from: http://blog.csdn.net/xianlingmao/article/details/7667042

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.