10-minute Learning Natural Language Processing Overview
Bai Ningsu
September 23, 2016 00:24:12
Abstract: recently, the natural language processing industry has developed vibrant and widely used market. I have written a lot of articles since the study, the depth of the level of the article, today because of some need, will be all looked at the article to do a collation, can also be called an overview. Regarding these questions, the blog has the detailed article to introduce, this article only to its each part highly summarizes combs. ( Original, reprint annotated source : 10 minutes to learn natural language processing Overview )
1 What is text mining?
Text mining is a branch of information mining, which is used for knowledge discovery based on text information. The preparation of text mining consists of three steps: Text collection, text analysis, and feature pruning. At present, the most researched and applied text mining techniques are: Document clustering, document classification and abstract extraction.
2 What is natural language processing?
Natural language processing is an important direction in the field of computer science and artificial intelligence. It studies the theory and method of effective communication between man and computer in natural language. The science of Linguistics, computer Science and mathematics equals Oneness.
The principle of natural language processing: Formal description-mathematical model algorithmic-programmatic-practical
Automatic speech synthesis and recognition, machine translation, natural language understanding, dialogue, information retrieval, text classification, Automatic digest and so on.
3 commonly used Chinese word segmentation?
There is no space between Chinese text words and words, so there are a lot of times when Chinese text manipulation involves cutting words, here some Chinese word segmentation tools are arranged.
Stanford (directly using the CRF method, the feature window is 5.) )
Chinese Word segmentation tool ( personal recommendation )
Hit Language cloud
Discovering participle
Pangu Participle Ictclas (Chinese Academy of Sciences) lexical analysis system
Ikanalyzer (under Luence project, Java-based)
FUDANNLP (Fudan University)
4 pos tagging method? Syntax analysis method?
Principle Description: Annotate a sentence in an article, that is, the statement callout, using the Labeling method Bio label. The observation sequence x is a corpus (this assumes an article, x represents each sentence in the article, X is a collection of x), the identity sequence y is the bio, that is, the identification of the corresponding x sequence, so that the correct sentence annotation can be inferred according to the conditional probability P (Callout | sentence).
Obviously, this is for the sequence state, that is, CRF is a probabilistic structure model for labeling or dividing sequence structure data, and CRF can be regarded as a graph-free model or Markov random field. As used in CRF, CRF is a sequence labeling model that refers to the marking of each word of a word sequence. Generally through, in the left and right of the word open a small window, according to the words inside the window, and to be labeled words to achieve the extraction of feature templates. Finally, the combination of features determines what tag is required to play.
5 named entity recognition? Three mainstream algorithms, CRF, dictionary method and hybrid method
1 CRF: In the CRF for Chinese NER This task, most of the extracted features are whether the word is Chinese name surname, whether the word is Chinese name name with words, True or false characteristics. So a reliable surnames of the table is very important ~ In many experiments done by domestic scholars, the best-performing names can be F1 measures to reach 90%, the worst organization name reached 85%.
2 Dictionary method: In the NER is to put each word as the beginning of the word in the trie-tree to check again, found that is NE. Chinese trie-tree need to be hashed, because there are too many characters, not like 26 in English.
3 different types of named entities for six are handled differently, for example, for the name of a person, the conditional probability of the word level is calculated. English: Hit (language cloud) Shanghai Jiaotong University: Stanfordner, etc.
7 study on syntactic recognition of TCM documents based on active learning
7.1 corpus knowledge?
A corpus, which is specially collected as one or more application goals, has a certain structure, a representative, and can be retrieved by a computer program, with a certain scale of corpus.
Corpus Division: ① Time Division ② Processing Depth Division: annotation Corpus and non-labeled Corpus ③ Structure Division ⑤ Language Division ⑥ Dynamic Update degree Division: reference Corpus and monitoring corpus
Corpus Construction Principle: ① representative ② structural ③ Balance ④ Scale ⑤ metadata: Metadata pair
The advantages and disadvantages of corpus annotation
① Advantages: Easy to study. Reusable, functional diversity, and clear analysis.
② Disadvantage: Corpus is not objective (manual labeling accuracy is high and consistency is poor, automatic or semi-automatic labeling consistency high and accurate rate difference), labeling inconsistent, accurate rate is low
7.2 To solve labeling problems with the airport?
The conditions of the airport are used for sequence labeling, Chinese word segmentation, Chinese name recognition and ambiguity resolution, which show a good effect in natural language processing. The principle is: the conditional probability model is established for the given observation sequence and the labeling sequence. Conditional random field can be used for different prediction problems, and its learning method is usually maximum likelihood estimation.
I love China, to carry out sequence labeling cases to explain the conditions with the airport. (Rule model and statistical model problem)
The conditional random field model also needs to solve three basic problems: the selection of the feature (indicating that the first observation value is "Love", the relative yi,yi-1 mark is b,i), the parameter training and decoding.
7.3 Hidden Markov model
Applications: Speech tagging, speech recognition, local parsing, block analysis, named entity recognition, information extraction, etc. It is used in many fields such as natural science, engineering technology, biotechnology, utilities, channel coding and so on.
Markov chain: In the stochastic process, the occurrence probability of each language sign is not independent of each other, the current state of each stochastic experiment depends on the previous state, this chain is a Markov chain.
Multivariate Markov chain: Consider the effect of the previous language symbol on the probability of the latter language sign, so that the chain of language components is called a heavy Markov chain, but also two yuan syntax. Double Markov chain, also ternary syntax, Triple Markov chain, is also four Yuan grammar
Three problems in the thought of hidden Markov model
Question 1 (likelihood problem): For a hmmλ= (a, B) and an observation sequence O, determine the likelihood of the observed sequence P (o|λ). (Forward algorithm resolution)
Question 2 (decoding problem): Given an observation sequence O and a hmmλ= (a, B), find the best hidden state sequence Q. (Viterbi algorithm solution)
Question 3 (learning problem): Given an observation sequence O and a state set in a Hmm, automatically learns the parameters A and B of Hmm. (Forward-backward algorithm resolution)
7.4 Viterbi algorithm decoding
Ideas:
1 to calculate the Viterbi probability of time step 1
2 calculation of the Viterbi probability of time Step 2, in (1) basis calculation
3 Calculation of the Viterbi probability of time step 3, in (2) basis calculation
4 VITERBI Reverse Tracking path
The difference between the Viterbi algorithm and the forward algorithm:
(1) The Viterbi algorithm chooses the maximum value in the probability of the previous path, while the forward algorithm calculates its sum, and in addition, the Viterbi algorithm is the same as the forward algorithm.
(2) The Viterbi algorithm has a reverse pointer, looking for a hidden state path, while the forward algorithm does not have a reverse pointer.
Hmm and Viterbi algorithm solves the problem of random speech tagging, and uses the Viterbi algorithm to annotate Chinese syntax.
7.5 Sequence Labeling methods refer to the above POS notation
7.6 Model Evaluation Method
Model: Method = model + strategy + algorithm
The model problem involves: Training error, test error, overfitting and so on. The predictive ability of learning methods to unknown data is often called generalization capability.
Model Evaluation Parameters:
Accuracy p= Identify the correct number/total identified quantity
Error rate = number of identified errors/total identified
accuracy = identification of correct positive quantity/identification of correct quantity
Recall rate r= Identify correct quantity/total correct amount (identify + unrecognized)
F metric =2pr/(P+r)
Data positive and negative equalization suitable for accurate rate data is not suitable for recall, accuracy, f measurement
Several methods of model evaluation:
Two models of ROC curves for K-fold cross-validation and random quadratic sampling evaluation
8 Vocabulary Construction System of postgraduate English grade examination based on text processing technology
Complete the core word extraction for 17 sets of get real questions for 2002--2010 year. Including data cleansing, stop word processing, word segmentation, frequency statistics, sorting and other common methods. Real problem is structured data, there are certain rules, relatively easy to deal with. This process is actually the data cleaning process) Finally all the words are aggregated, and then removed such as: A/an/of/on/frist and other discontinued words (Chinese text processing also need to deal with the deactivation of words, such as:, the ground, is, etc.). Deal with the word to go to the weight and frequency statistics, and finally the use of Network Tools for English translation. Then sort by word frequency.
8.1 Apache Tika?
The Apache tika Content extraction tool is powerful in that it can handle various files and save you more time to do important things.
Tika is a content analysis tool that comes with a comprehensive parser tool class that resolves basic files in all common formats
Tika Features: • Document type detection • Content extraction • Meta Data extraction • Language detection
8.2 Text word frequency statistics? Word frequency sorting method?
Algorithm idea:
1 calendar year (2002-2010) get exam real title, document format is different. Online collection
2 The documents of all formats are statistically processed into TXT documents, formatted (removing non-English words such as kanji/punctuation/spaces, etc.) and the removal of discontinued words (removal of 891 inactive words).
3 after cleaning the word to go to the weight and frequency statistics, through the map statistics frequency, Entity storage: words-word frequency. (arrays can also be used, except in the face of particularly large data, where the array has cross-border problems). Sort: According to Word frequency or letter
4 Extract the core vocabulary, greater than 5 and less than 25 times of data, you can set the threshold. When traversing the list< entity > list, select the vocabulary size by getting the word frequency attribute of the entity.
5 The last step, Chinese and English translation.
Design and implementation of text classifier for 9 naive Bayesian model
9.1 Naive Bayes formula
0: Joy 1: Rage 2: Disgust 3: Low
9.2 Naive Bayes principle
--Training text preprocessing, constructing classifiers. (That is, the Bayesian formula to achieve the text classification parameter value of the solution, temporarily do not understand it does not matter, detailed below)
--Construction Prediction Classification function
--Preprocessing of test data
Using classifier classification
For a new training document D, which category does it belong to in the top four categories? We can make a specific object based on the Bayesian formula, just now.
> P (Category | Document): Test the probability of a file belonging to a class
> P (category): Randomly extracts a document d from the document space, which is the probability of category C. (Number of Documents/total documents)
> (P (Document | Category): Document D probability for a given class C (the number of words in a document in a class/total number of words in a class)
> P: The probability of randomly extracting a document D from a document space (as in each category, it can be ignored.) At this point, the maximum likelihood probability is obtained)
> C (d) =argmax {P (c_i) *p (d|c_i)}: Find the approximate probability of each category of Bayesian, compare to obtain the maximum probability, at this time the document is classified as the maximum probability of a class, classification success.
Review
1. Pre-collection of processing data sets (involving web crawler and Chinese word-cutting, feature selection)
2. Pre-treatment: (Remove the stop words, removing the frequency too small words "depending on the situation")
3. Experimental process:
The data set is divided into two parts (3:7): 30% as a test set, and 70% as a training set
Increase confidence: 10-fold cross-validation (the entire data set is divided into 10 equal parts, 9 are merged into the training set, and the remaining 1 are the test sets.) Running 10 times, averaging as a result of classification) comparative analysis of advantages and disadvantages
4. Evaluation criteria:
Macro Evaluation & Micro-evaluation
Smoothing factor
9.3 The difference between production model and discriminant model
1) Production model: directly to the joint distribution modeling, such as: Hidden Markov model, Markov with the airport, etc.
2) discriminant Model: The conditional distribution modeling, such as: conditional with the airport, support vector machine, logistic regression.
Build model Benefits: 1) The convergence rate is faster by the Union distribution 2. 3) ability to handle hidden variables. Disadvantage: In order to estimate accurately, the sample quantity and the computation quantity are large, the sample number is not recommended to use.
Discriminant Model Advantages: 1) calculation and the number of samples is small. 2) high accuracy rate. Cons: Slow convergence, not for hidden variables.
9.4 roc Curve
The ROC curve is also called the receiver operation characteristic curve, compares the learner model good or bad visualization tool, the horizontal axis parameter false Positive example rate, the ordinate parameter is the real example rate. The closer the curve is to the diagonal (random guessing line) the worse the model.
Good model, the real proportion is more, the curve should be steep from 0 onwards, and then encounter a real proportion of less and fewer, false positive proportional tuple more and more, the curve gradually become more level. The exact correct model area is 1
10 Statistical knowledge
Information Visualization (pie chart, line chart, etc.)
Centralized trend measurement (average median number of digits, etc.)
Probability
Permutation combinations
Distribution (geometry two-item Poisson normal card square)
Statistical sampling
Sample estimation
Hypothesis Testing
Regression
Stanfordnlp
Sentence comprehension, automatic question and answer system, machine translation, syntactic analysis, labeling, affective Analysis, text and visual scenes and models, and natural language processing applications and calculations in digital humanities and social sciences.
APache OPENNLP
Apache's OPENNLP library is a machine-learning toolkit for processing natural language text. It supports the most common NLP tasks such as word breaking, sentence segmentation, partial part-of-speech tagging, named entity extraction, chunking, parsing, and reference digestion.
Sentence detector: Sentence detector is used to detect sentence boundaries
Tag generator: The OPENNLP word segment input character sequence is labeled. This is often a space-delimited word, but there are exceptions.
Name search: Name Finder detects text-named entities and numbers.
POS Marker: The OPENNLP POS label uses the probability model to predict the correct POS tag out of the label group.
Detail: The text is divided by dividing the syntactic-related parts of the word, such as the noun base, the verb-based text, but does not specify its internal structure, nor its role in the main sentence.
Parser: The simplest way to try the parser is in the command-line tool. This tool is intended for demonstration and testing purposes only. Please click on the English block from our website
Lucene
Lucene is a Java-based full-text information Retrieval toolkit, which is not a complete search application, but rather provides indexing and search capabilities for your application. Lucene is currently an open source project in the Apache Jakarta (Jakarta) family. It is also the most popular open source full-Text Search toolkit based on Java.
There are already many applications that are based on Lucene, such as the Eclipse Help system search function. Lucene has the ability to index text-type data, so you can index and search your documents as long as you convert the format of the data you want to index into a text format.
Apache SOLR
SOLR It is an open source, Lucene Java-based search server. SOLR provides a level of search (that is, statistics), hit-highlighting, and supports multiple output formats. It is easy to install and configure, and comes with an HTTP-based management interface. You can use SOLR's excellent basic search functionality, or expand it to meet the needs of your business.
The features of SOLR include:
• Advanced full-Text search functionality
• Optimized for high-throughput network traffic
• Standards based on open interfaces (XML and HTTP)
• Integrated HTML Management interface
• Scalability-ability to efficiently replicate to another SOLR search server
• Flexibility and compatibility with XML configuration
• Extensible plug-in system SOLR Chinese participle
15 Machine Learning dimensionality reduction
Main feature selection, random forest, principal component analysis, linear dimensionality reduction
16 Domain Ontology Building method
1 identify areas of specialization and scope of domain ontology
2 Consideration of reusing existing ontology
3 List important terms in the field of ontology
4 Defining classification concepts and conceptual classification hierarchies
5 defining the relationship between concepts
17 Knowledge Engineering approach to building domain ontology:
Main Features: Ontology more emphasis on sharing, reuse, can provide a unified language for different systems, so the engineering of ontology construction is more obvious.
Methods: So far, several famous methods in ontology engineering include Tove Method, Methontology method, skeleton method, IDEF-5 method and seven step. (mostly hand-built domain ontology)
Status quo: Because the ontology project is still in a relatively immature stage, the construction of domain ontology is still in the exploratory period, so there are still many problems in the construction process.
Method Maturity: The above-mentioned methods are: Seven Steps, Methontology method, IDEF-5 method, Tove method, skeleton method.
"NLP" 10 minutes to learn natural language processing