Compile a simple Python program to determine the language of the text

Source: Internet
Author: User
This article mainly introduces how to write a simple Python program to determine the language. the code is very simple. it mainly uses the langid toolkit. For more information, see 1. problem description

When using Python for text processing, sometimes the processed text contains texts of Chinese, English, and Japanese languages, and sometimes cannot be processed at the same time, in this case, you need to determine the language of the current text. The langid toolkit in Python provides this function. langid currently supports detection in 97 languages and is very useful.


2. program code

The following Python code calls the langid toolkit to detect and identify the language of the text:

Import langid # introduce the langid module def translate (inputFile, outputFile): fin = open (inputFile, 'r') # open the input file fout = open (outputFile, 'W') # Open the output file for eachLine in fin: # read each line in sequence = eachLine. strip (). decode ('utf-8', 'ignore') # Remove the first space in each line and convert it to Unicode lineTuple = langid. classify (line) # Call langid to perform language detection on this line. if lineTuple [0] = "zh": # if most of the language in this line is Chinese, do not perform any processing. continue outstr = line # if the language of this row is not Chinese, the fout is prepared. write (outstr. strip (). encode ('utf-8') + '\ n') # output non-Chinese lines and convert Unicode to UTF-8 to output fin. close () fout. close () if _ name _ = '_ main _': # equivalent to the main function translate ("myInputFile.txt", "myOutputFile.txt ")

The above code is used to process a text and output rows that are not Chinese to a new file in sequence.


3. Note

9th, 10 lines of code, langid. the output result of classify (line) is a binary group. The first item in the binary group indicates the language of the text, for example, zh indicates Chinese, en indicates English, and so on; the second item of the binary group indicates the proportion of the language family in the first item.

Hope to help you.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.