Preface:
Natural Language Processing (NLP) is widely used in speech recognition, machine translation, and automatic Q &. The early natural language processing technology was based on "part of speech" and "Syntax". By the end of 1970s, it was replaced by the "Mathematical Statistics" method. For more information about NLP history, see the book the beauty of mathematics.
This series follows Professor Stanford Dan jurafsky and Assistant Professor Christopher Manning to learn more about NLP. Including word and sentence tokenization (Word and statement segmentation), text classification (Text Classification) sentiment analysis (sentiment analysis), probality (probability theory), Statistics (Statistics) basic Theories of machine learning and some basic algorithms such as n-gram
Language Modeling (n-level language model), Naive Bayes and maxent classifiers, and hidden Markov models (Hidden Markov Model.
Stanford NLP course website https://www.coursera.org/course/nlp
(I have a limited level of English. If you have any questions, please make a picture)
Chapter 1: Basic Text Processing Chapter 1 describes basic text processing, including the following four aspects:
- Regular Expression
- Word tokenization
- Word normalization and stemming
- Sentence Segmentation
1.1 Regular Expressions (regular expression)
Intuitively, what problems do we face if we want to find a word in a large text segment? Word format problems, such as Woodchuck, may also occur in the form of woodchucks/Woodchuck/Woodchucks. This requires some "rules" for processing, and regular expressins is such a rule.
[]: Match any character in square brackets. For example, [WW] can match W or W. [A-B]: matches all characters from A to B. ^: The beginning of a row; $: the end of a row. For example, the first ^ [A-Z] matches all rows starting with a non-college letter .. : Periods indicate all characters. \.: Add a backslash before the end to indicate the end. [^]: The characters in square brackets. For example, [^ A-Z] indicates a non-capital letter .? : The first character of the question mark can be omitted. *: The first character of the asterisk can be repeated 0 times or multiple times. +: The first character of the asterisk can be repeated once or multiple times. |: Represents the meaning. For example, a | B | C is equivalent to [ABC]. (You can go to regexpal.com to try it)
1.2 word tokenization
For each NLP task, the first thing to do is text normalization (Text Standardization), which includes the following three aspects:
1.Segmenting/tokenizing words in Running Text(Word splitting)
2.Normalizing word formats(Normalization Word format)
3.Segmenting sentences in Running Text(Sentence splitting)
This article describes the first aspect. In this regard, Professor jurafsky has not yet reached a final conclusion on some questions about word count, but he can understand all kinds of situations correctly. The following is a detailed explanation.
1.2.1 how many words?
How to count the number of words in a given sentence.
We cannot speak correctly once in our lives, often with fragments (word fragments) pauses (pauses) or duplicates, as shown in the following sentence.
I do uh main-mainly business data processing.
Among them, there are uh Modal words, main-mainly pause and repetition, which will cause troubles for the "number word.
Several definitions are introduced here.Lemma, wordform, type, Token, N, V, and | v |
Example 1: Seuss's cat in the hat is different from other cats!
Lemma(Word dollar): for example, CAT and cats have the same word dollar.
Wordform(Word Form): CAT and cats are two different word shapes.
Type: Number of words, excluding repeated words
Token: Total number of words, calculating repeated words (simply put, the number of spaces between words + 1)
Example 2: They lay back on the San Francisco grass and looked at the stars and their
In the above sentence, if San Francisco is split into two words, there will be 15 tokens. If it is regarded as one, it will be 14 tokens.
In the above sentence, the and appear twice, so there are 13types or 12 types. If they and their are treated as one, 11 types is used, these are all dependent on specific rules.
N: Number of tokens
V: Vocabulary set,| V |Indicates the vocabulary.
N and | v | (the middle line looks like counting the words and vocabulary in Shakespeare's works)
1.2.2 issues in tokenization(Various problems in Word Segmentation)
There are many forms that are used to in languages, such as writing, quotation marks, dot, and dash.
The following situations:
What're, I'm, isn' t ----------------> this situation is well handled and split into what are, I am, is not.
Hewlett-Packard -----------------> can it be split into Hewlett Packard? (Question mark indicates no conclusion)
Lowercase -----------------> lower-case lowercase lower case?
San Francisco -----------------> one token or two?
M.p. H., PhD. -----------------> ??
Not only in English, but also in other languages. For example, in French, l 'ensenble ----> one token or two? There are no separators in the noun Compound Words in German, such as lebensversicherungsgesellschaftsangestellter ). Chinese and Japanese words are also connected together.
Next we will introduceWord Segmentation in Chinese.
Word tokenization in Chinese is also calledWord Segmentation. Each word in Chinese is composed of N words, and the average value of N is 2.4, that is, each word has about 2.4 words.
Use of Chinese acronyms"Maximum matching"(Maximum matching) algorithm. The process is as follows:
There is a Chinese vocabulary and a sentence in advance.
Step 1. The starting Pointer Points to the beginning of a sentence.
Step 2. Find the largest matching word in the vocabulary and set a separator.
Step 3. After the pointer jumps to the separator, continue to step 2. Loop.
For example, sarabova now lives in Florida, southeast United States.
Sarabova now lives in southeast Florida.
Of course, there is also a more advanced probabilistic segmentation algorithms (I literally translated it as the probability division method, which will be discussed later), with better results.
1.3 word normalization and stemming (word normalization and stem)
1.3.1 normalization (normalization)
We need the normalize Word format. For example, in IR (Information Retrieval), indexed text (INDEX) and query terms (query entry) must have the same format, such as U. s. A and USA.
We often need to implicitly define some equivalent forms of words. The method is usually to remove the periods, such as the preceding U. S. A and USA.
Of course, we will also encounter antisymmetirc expansion, which is not equivalent to "input words" and "words to be searched". For example, if we enter "window" and search for "window or Windows ", enter "Windows" to search for "windows, windows, and Windows", which makes the search very responsible. Therefore, we should try to use the equivalent Ric expansion and simple extension methods.
1.3.2 case folding (lowercase)
In some applications, such as IR, we usually need to convert all words into lowercase, because most users prefer to use lowercase letters. But there are also some exceptions. For example, when we encounter large letters such as General Motors, Federal Reserve Bank, or Stanford Artificial Intelligence Lab in the sentence ), we may want to keep the original larger form.
In sentiment analysis, MT (Machine Translation) and information extraction (Information Extraction), Case sensitivity is critical. For example, us and us are completely different.
1.3.3 lemmatization (word deformation)
We usually also use deformation to convert words into their basic forms, such:
Am, are, is-> be
Car, cars, car's, cars-> Car
The boy's cars are different colors-> the boy car be different color
Therefore, the task of lemmatization is to find an accurate Dictionary of words.
1.3.4 morphology (lexical)
Morphology studies morphemes, the smallest unit of a word.
Stems)
Affixes (affixes)
For example, in the word "stems", "stem" is the root of the word, and "S" at the end of the word is the affix.
1.3.5 stemming
Stemming deletes the Suffixes in the form of words and converts them into the root of the word, as shown in
Automatic (s), automatic, automation-> Automat
ForExample compressedAndCompression areBothAcceptedAsEquivalentTo Compress-> forexampl compress and compress
Ar both accept as equival to compress
1.3.6 Porter's algorithm (Porter algorithm, the most common stemming method in English)
As shown in, you need to save ing when the verb + ing is used.
1.3.7 practice
In the Linux system, click the following code to find the ending word that matches the condition. shakes.txt is the text file used. You can find an English document as the text to be processed.
Tr-SC 'a-Za-z'' \ n' <shakes.txt | tr 'a-z'' A-Z' | grep 'ing $ '| sort | uniq-C | sort-n-r | less
Convert non-letter to line feed text to be processed | convert uppercase letters to lowercase letters | search for all words ending with ing | sort | same merge and count | sort by number | display
Shows the display effect:
1.4 sentence segmentation (sentence Division)
We all know some common rules about sentence segmentation.
First, an exclamation point! And question mark? It is an obvious end of a sentence, but it is not necessarily the case, such as. 02%, 4.3, and so on. Therefore, we need to establish a binary classifier (binary classifier) to process the period. A typical method isDemo-tree(Decision tree), such as determining whether a word is at the end of a sentence (E-O-S ).
The decision tree above is very simple. If you want to make the judgment more accurate, you need to introduce more features and set more if-then-else statements. The corresponding decision tree is also more complex.
Classification of features, in addition to decision trees, and other classifiers, such as logistic regression, SVM, neural nets (neural networks), will be discussed later.