1. Read in the text content
Corpos = Pandas. DataFrame (columns=['FilePath','content']) forRoot, Dirs,filesinchOs.walk (R'H:\19113117-Copy'): forNameinchFiles:filepath=root+'\\'+name F= Codecs.open (FilePath,'R','Utf-8') Content=F.read () f.close () Corpos.loc[len (Corpos)+1]=[filepath,content.strip ()]
2. The word frequency statistic will be used to divide the text manually
filepaths=[]segments=[] forFilepath,contentinchCorpos.itertuples (index=False): forIteminchContent.split ('/'): Segments.append (item) filepaths.append (FilePath) SEGMENTDF=pandas. DataFrame ({'FilePath': Filepaths,'Segments': Segments}) Segstat=Segmentdf.groupby ( by=["FilePath","Segments"] )["Segments"].agg ({"Count": Numpy.size}). Reset_index ();
3. Calculate TF Value
textvector=segstat.pivot_table ( index='segments', values= ' Count ' , columns='filePath', fill_value=0) TF= (1+numpy.log (textvector)). As_matrix ()
4. Calculation of IDF
def handle (x): IDF=1+numpy.log (len (corpos)/(Numpy.sum (x>0) +1)) return Idfzhuan=TEXTVECTOR.TIDF=zhuan.apply (handle). As_matrix () IDF=idf.reshape (8889,1)
5. Calculation TFIDF
tfidf=tf*idftfidf_df=pandas. DataFrame (TFIDF)
6, the TFIDF value of each text in the top 100 words and the corresponding TFIDF value output
file=[] forRoot, Dirs,filesinchOs.walk (R'H:\19113117-Copy'): forNameinchFiles:name=name[0:-4] File.append (name) forIinchRange (len (corpos)): Sort=pandas. DataFrame (Tfidf_df.loc[:,i].order (Ascending=false) [: 100]). Reset_index () names=sort.columns.tolist () Names[names.index (i)]='value'Sort.columns=names Tagis=Textvector.index[sort.index]Print(File[i]) forTinchRange (len (Tagis)):Print(Tagis[t],sort.loc[t].value)
Compute TFIDF, keyword extraction---python