There is a lot of text information. How do we extract useful information?
For example:
JSON is a good boy
The expected information is JSON and a good boy.
First, we need to split sentences and determine the attributes of words:
You can use the followingCode:
Def ie_preprocess (document ):... sentences = nltk. sent_tokenize (document )... sentences = [nltk. word_tokenize (sent) for sent in sentences]... sentences = [nltk. pos_tag (sent) for sent in sentences]
Then we need to specify the types of information to be extracted:
That is, the syntax format:
Grammar = "NP: {<DT>? <JJ> * <NN> }"
DT is an attribute, JJ is an adjective, and NN is a noun.
CP = nltk. regexpparser (grammar)
Later use
Result = CP. parse (sentence)
Analyze statements
We will get a nltk. Tree. tree structure.
Then we use:
For N in chunked: If isinstance (n, nltk. Tree. Tree): If n. node = 'np ': A = N
This code gets the snippet we need
In reality, this method cannot remove non-English words.
We can add:
D = enchant. dict ("en_us ")
Such interpretation is used to delete and select.
I hope to help you.