Python: a simple text string processing method.
This example describes how to implement simple text string processing in Python. We will share this with you for your reference. The details are as follows:
For a text string, you can use the Pythonstring.split()
Method to cut it. Next let's take a look at the actual running effect.
mySent = 'This book is the best book on python!'print mySent.split()
Output:
['This', 'book', 'is', 'the', 'best', 'book', 'on', 'python!']
As you can see, the splitting effect is good, But punctuation marks are also treated as words and can be processed using regular expressions. The separator is any character string except words and numbers.
import rereg = re.compile('\\W*')mySent = 'This book is the best book on python!'listof = reg.split(mySent)print listof
Output:
['This', 'book', 'is', 'the', 'best', 'book', 'on', 'python', '']
Now we get a vocabulary composed of a series of words, but the empty strings in it need to be removed.
You can calculate the length of each string and only return strings greater than 0.
import rereg = re.compile('\\W*')mySent = 'This book is the best book on python!'listof = reg.split(mySent)new_list = [tok for tok in listof if len(tok)>0]print new_list
Output:
['This', 'book', 'is', 'the', 'best', 'book', 'on', 'python']
Finally, we found that the first letter in the sentence was in uppercase. We need to convert uppercase to lowercase in the same format. Python embedded method, which can convert all strings to lowercase letters (.lower()
) Or capital (.upper()
)
import rereg = re.compile('\\W*')mySent = 'This book is the best book on python!'listof = reg.split(mySent)new_list = [tok.lower() for tok in listof if len(tok)>0]print new_list
Output:
['this', 'book', 'is', 'the', 'best', 'book', 'on', 'python']
Here is a complete Email:
Content
Hi Peter,With Jose out of town, do you want tomeet once in a while to keep thingsgoing and do some interesting stuff?Let me knowEugene
import rereg = re.compile('\\W*')email = open('email.txt').read()list = reg.split(email)new_txt = [tok.lower() for tok in list if len(tok)>0]print new_txt
Output:
Copy codeThe Code is as follows: ['hi', 'Peter ', 'with', 'job', 'out', 'of', 'town', 'Do', 'you ', 'want', 'to', 'meet ', 'Once', 'in', 'A', 'while ', 'to', 'keep', 'things ', 'going', 'and', 'Do ', 'some', 'Interesting', 'stuff', 'let', 'me', 'know', 'eugene ']