Okay, I'm low, I'm working with a Java line, and the result is a virtual machine memory overflow:
Error occurred during initialization of vmincompatible minimum and maximum heap sizes specified
For Python, previously looked up a Python line to read the data, no use to that method, thought no, low. Plus time a little bit longer useless python, progress some slow, but also good, is running, do all the data preprocessing.
1.python regex matches , Re.compile, and Finditer () functions.
2. Character set problem codecs,u ' Chinese Escape ', R ' escaped ' not enough.
3.python opens the File open () function .
4. Read into the question, a line reads into the readlines () function , saves to text, writes the function write ().
#coding: Utf-8import codecsimport re#----------------------n=0p=re.compile (U ' [^\u4e00-\u9fa5] ') #正则匹配非中文字符 #-- --------------------#一行一行读取该文件with Codecs.open (u "d:/shifengworld/nlp/nlp_project/new words found/data/untreated_data/2012_ 7.csv ") as f: text = F.readlines () #----------------------File_object=open (U" d:/shifengworld/nlp/nlp_project/ New words found/data/data_preproces/abc2.txt ", ' W ') #----------------------for line in text: line=line.decode (' Utf-8 ') #因为字符编码问题 need to decode the open file to Utf-8 format? Messy, the character encoding is not enough to understand for the m in P.finditer (line): #python正则匹配所有非中文字符 line=line.replace (M.group (), ") # All non-Chinese characters are replaced with a space Line=line.strip (' ') file_object.write (line+ ' \ n ') #读入 file, and each line is read into a line, adding a newline character # Print line,# if n>6:# break# n=n+1file_object.close () #记得关闭读入的文件
Natural Language Processing---New word discovery---micro-blog Data preprocessing 2