How to get started with natural language processing

Source: Internet
Author: User
Tags modifier

PS: The author will continue to update ~

Domain Branch Overview

As the saying goes well:

The most important thing to do in study or to learn a skill is to be very familiar with your own studies (3mins let others understand what you are doing, where contribution, and let others think that what you do is meaningful)

Then I'll just sort it out. Branches of natural language processing related fields ~

Natural languages include many branches, mainly:

Machine translation, automatic summarization, information retrieval, document classification, question and answer system, information filtering, information extraction, text mining, speech recognition, etc.

Many of these branches are cross-cutting, and you can specialize in an area based on your own interests. I'm in the direction of AI--- machine learning--text mining with natural language processing (NLP)

So what are the applications of machine learning in text mining?

(1) Topic recognition

Topic recognition belongs to the text classification, the common example in the experiment is classifying the news text into "Finance, education, sports, entertainment" and so on. The main methods I use today are "word2vector" and "Word to bags". Word2vector, or "word vectors", determines whether the new text comes from this class by calculating the position, part of speech, and frequency of the text morphemes. For example, whether the recognition text is a comment text or a news text is a kind of plan is that the comment sentence appears in the modal verb and the exclamation point is more and the position is not fixed. Word to bags is a word bag that is used more in topic model. Word to bags calculates the probability that each word appears in each category, and then finds words with high levels of category information by TF-IDF or information gain or probability, and classifies the text by judging the collinear extent of the words.

(2) Classification of emotions

Affective analysis is the user's attitude analysis. Now most affective analysis systems are "plus or minus two categories" of text, that is, only to determine whether the text is positive or negative, and some systems can do three classification (neutral). For example, I would like to analyze the user's attitude to the 2013 "Ma 370 Incident", as long as the topic text to find the event, through the Taiwan Great Emotion Dictionary and other tools to judge the polarity of emotional words, and then according to a certain set of rules to combine the frequency and degree of emotional words can judge the feelings of the text. But this method cannot judge the textual evaluation facet. For example, I now have 1 million "Xiaomi phone evaluation" information, I can use the method above to understand how much of the user is not satisfied with the Xiaomi phone, but do not know that these dissatisfied users are not satisfied with the Xiaomi phone and the proportion of the ratio (such as the shape or performance). The commonly used method is to construct the seed dictionary of the related words of millet phone, find the facet of user's comment through the dictionary, construct the syntax tree to find the predicate and modifier adverb of the facet, quantify the emotion polarity through the affective dictionary, and finally, the quantified comment facet, modifier words and degree adverbs are brought into SVM for text classification. However, it is not appropriate to use naive Bayes here, because in multi-faceted multi-classification, naive Bayes is prone to overfitting.

(3) Named entity recognition

The so-called entity recognition refers to the computer to automatically recognize the words they do not know. For example: "Hu song is very nice!" "How can a computer know that" Hu "is a word and should not be" singing "is a word? The word "Hu" is unlikely to exist for most of the thesaurus, so how can the machine recognize the word and make the word the most likely to be correct? I think that in all of these methods, CRF works best, even better than hmm. CRF, also known as the conditional random field, can record the state of each feature in the training data and the state of its surrounding features, and when multiple features appear at the same time, find out the most likely state of each feature in multiple feature combinations. That is to say, the CRF with "Birds of a Feather" as the basic argument, that most words appear in the environment is regular, not disorganized. When selecting features, the "word" units are significantly better than "words", because the names of the named entities are understood in terms of words, such as "Kikiaelita", which we understand with the meaning of "Chen/Xiao/chun" rather than "Chen/Koharu" or "Chen Xiao/Spring".

(4) Recommended system

The value of text mining in Recommender system lies in the calculation of feature word weights. For example, we recommend a new book to the user. We can model it in the following way: First, we find all the feature words in the user's comments about the books, build the feature dictionaries, and then use the text analysis and time series analysis to calculate the weight of the feature words in combination with the content and time compactness of the user comments, indicating the degree of a particular feature that a user cares about. To set up a good user Review feature Level table inverted index, to find all the evaluation of each feature word users and their evaluation weights, and finally according to the characteristics of the book to be recommended to find a list of users can be recommended, to find high-weighted users and recommend books to him.

At present, the main research areas of text mining are: Text structure analysis, text summarization, text classification, text clustering, text correlation analysis, distribution analysis and trend prediction.

The required knowledge Reserve

Here's a talk about learning about AI---machine learning--Natural language Processing (NLP)-- What knowledge reserves are required for text mining:

Here is not only the text mining direction, the entire NLP field is actually applicable, but a particular field will have subtle technical differences, but the overall introduction, there is no need to go so deep

    • Statistical learning methods

      This is the most important, want in this field in-depth study, must know the statistical learning method, here recommended Hangyuan Li Professor's "Statistical learning Method", at first we may feel boring, preferably combined with the relevant practical application and code to read, the best effect, understanding is also more convenient.

    • Holistic process of NLP

      This everyone must be familiar with the process of every step of the heart, and know when to use what method to solve the problem, this is the core of NLP, contains a lot of things, I will be in the blog to explain in detail in the future

    • A language

      Learning computer in the final analysis or need to use code to achieve, the light can not think of ideas, so we must be proficient in a language, the most recent machine learning the hottest people should know, is python ~

How to get started with natural language processing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.