This article mainly introduces how to write a simple Python program to determine the language. the code is very simple. it mainly uses the langid toolkit. For more information, see
1. problem description
When using Python for text processing, sometimes the processed text contains texts of Chinese, English, and Japanese languages, and sometimes cannot be processed at the same time, in this case, you need to determine the language of the current text. The langid toolkit in Python provides this function. langid currently supports detection in 97 languages and is very useful.
2. program code
The following Python code calls the langid toolkit to detect and identify the language of the text:
Import langid # introduce the langid module def translate (inputFile, outputFile): fin = open (inputFile, 'r') # open the input file fout = open (outputFile, 'W') # Open the output file for eachLine in fin: # read each line in sequence = eachLine. strip (). decode ('utf-8', 'ignore') # Remove the first space in each line and convert it to Unicode lineTuple = langid. classify (line) # Call langid to perform language detection on this line. if lineTuple [0] = "zh": # if most of the language in this line is Chinese, do not perform any processing. continue outstr = line # if the language of this row is not Chinese, the fout is prepared. write (outstr. strip (). encode ('utf-8') + '\ n') # output non-Chinese lines and convert Unicode to UTF-8 to output fin. close () fout. close () if _ name _ = '_ main _': # equivalent to the main function translate ("myInputFile.txt", "myOutputFile.txt ")
The above code is used to process a text and output rows that are not Chinese to a new file in sequence.
3. Note
9th, 10 lines of code, langid. the output result of classify (line) is a binary group. The first item in the binary group indicates the language of the text, for example, zh indicates Chinese, en indicates English, and so on; the second item of the binary group indicates the proportion of the language family in the first item.
Hope to help you.