1. Description of the problem
In the text processing in Python, sometimes the text contains Chinese, English, Japanese and other languages of the text, sometimes can not be processed at the same time, it is necessary to determine the current text belongs to which language family. Python has a LANGID toolkit that provides this functionality, and LangID currently supports detection in 97 languages, which is very useful.
2. Code for the program
The following Python is the program code that invokes the LangID tool package for language detection and discrimination of text:
Import LangID #引入langid模块
def translate (Inputfile, outputfile):
fin = open (inputfile, ' R ') # Open input file as read
Fout = open (outputfile, ' W ') #以写的方式打开输出文件 for
eachline in fin: #依次读入每一行 line
= Eachline.strip (). Decode (' utf-8 ', ' ignore ') #去除每行的首位空格等, and uniformly converted to Unicode
linetuple = langid.classify (line) #调用langid来对该行进行语言检测
if linetuple[0] = = "en": #如果该行语言大部分为中文, no processing
continue
outstr = Line #如果该行语言为非中文, prepare the output
Fout.write (Outstr.strip (). Encode (' utf-8 ') + ' \ n ') #输出非中文的行, Convert from Unicode to utf-8 output
fin.close ()
fout.close ()
if __name__ = = ' __main__ ': #相当于main函数
Translate ("MyInputFile.txt", "MyOutputFile.txt")
The above code is used to work with a text that will output rows that are not part of Chinese to a new file.
3. Note
9th, 10 lines of code, the output of langid.classify (line) is a two-tuple, and the first item in the two-tuple represents the language family to which the text belongs, such as: En for Chinese, en for English, and so on; the second item of the binary group represents the proportion of the language in the text that belongs to the first.
Hope to help you.