ANSJ use and related information download reference: http://iamyida.iteye.com/blog/2220833 Refer to http://www.cnblogs.com/luxh/p/5016894.html configuration and SOLR and Tomcat's1, from http://iamyida.iteye.com/blog/2220833 download good ANSJ need relevant information, the following is downloaded.ANSJ Information: Http://pan.baidu.com/s/1kTLGp7L2. Copy ANSJ related files to the SOLR project1) put Ansj_seg-2.0.8.jar, Nlp-lang-0.2.jar and Solr-analyzer-ansj-5.1.0.jar in the SOLR projectDrop dir
returned directly after some statuses are reset. I only focus on the creation process. Creating a new streams operation is not actually performed.Streams = new SavedStreams ();Setprevioustmenstream (streams );Streams. tokenStream = new SentenceTokenizer (reader );Streams. filteredTokenStream = new WordTokenFilter (streams. tokenStream );Streams. filteredTokenStream = new PorterStemFilter (streams. filteredTokenStream );If (! StopWords. isEmpty ()){St
The coreseek-3.2.13 is compatible with the sph0000- 0.9.9 configuration and can be directly used without modification.
However, to better search for Chinese characters, you need to use the configuration parameters added by coreseek to set Chinese word segmentation.
The following are the core configurations of Chinese word segmentation. Please read them carefully and apply them to your Configuration:
Source data source name #......}
Index name {# Use the Sphinx configuration directly in the follo
Coreseek-3.2.13 compatible with sph0000- 0.9.9 Configuration
The coreseek-3.2.13 is compatible with the sph0000- 0.9.9 configuration and can be directly used without modification.However, to better search for Chinese characters, you need to use the configuration parameters added by coreseek to set Chinese word segmentation.The following are the core configurations of Chinese word segmentation. Please read them carefully and apply them to your Configuration:
Source data source name #......}
Index
Borrowed from the Su Jianlin of the Great God's blog about the emotional analysis of the three article. And on this basis, the new word is added. Disable Word download link: Stop word
Code Environment:
python2.7
TENSORFLOW-GPU 1.0
Jieba
The accuracy rate after the test is as high as 98%, the result is as follows:
The code is as follows:
#-*-Coding:utf-8-*-' on GTX1070, 11s round after 30 rounds of iterations, the training set accuracy rate of 98.41% dropout can not be used too much, otherwise
Some tips or notes:to make the changes of the configuration files effective, the SOLR search engine needs to be rebooted. To delete all the indexed data from the yourcorename of SOLR, run ' bin/post-c yourcorename-d ' -Create another fieldtype to add flexibility to use a different stopwords list. Add the following content to Schema.xml. and create an empty text file ' Alterstopwords.txt ' in/server/solr/conf. You can add any content to the this file
-separated string, it is simply tailored for us. Then our SQL will become SELECT * from content WHERE find_in_set (' 2 ', tags) and ID In the process of flipping through these functions, you should have been deeply aware of the MySQL designer's affirmation of separating the stored field methods with commas, because there are many ways to deal with this problem. It looks so much better, everything seems perfect, isn't it? In fact, if you have more tags, you need to create multiple SQL statements
Recently in the busy exam things, there is no time to knock code, one months also not a few days to see the code, recently saw the visual word cloud, see the Internet is also a lot of such tools,But not perfect, some do not support Chinese, some Chinese word frequency statistics are inexplicable, some do not support custom shapes, all can not customize the colorSo online search, decided to use Python to draw the word cloud, the main use is Wordcloud Library, installation only need Pip Isntall Wo
Vector Space Model
The basic idea is to represent each document as a vector of certain weighted word frequencies. In order to do so, the following parsing and extraction steps are needed.
Ignoring case, extract all unique words from the entire set of documents.
Eliminate non-content-bearing ''stopwords ''such as ''a', ''and'', ''the'', etc. for sample lists of stopwords, see [#! Frakes: Ba
Words and Their frequenciesplot the wordcloud Example 1: tweets via TwitterStep 1:Load all the required packageslibrary(twitteR)library(tm)library(wordcloud)library(RColorBrewer)Step 2:Let's get some tweets in English containing the words "Machine Learning"mach_tweets = searchTwitter("machine learning", n=500, lang="en")Step 3:Extract the text from the tweets in a vectormach_text = sapply(mach_tweets, function(x) x$getText())Step 4:Construct the lexical corpus and the term document matrixwe use
of turning over these functions, You should have deeply realized that mysql designers are certain about the storage field method separated by commas, because there are many methods designed to deal with this problem.
It looks much better, and everything seems perfect, right? Actually, no. If you have more tags, you need to create multiple SQL statements, and some records have more tags associated with each other, and some have fewer tags. How can we sort them by relevance.
At this time, you can
# This is a function of image processingFrom Scipy.miscImport ImreadFrom WordcloudImport Wordcloud, Stopwords, ImagecolorgeneratorImport Matplotlib.pylabAs Plt# parse PictureBack_color = Imread ("./veer-141001498.png")# Set Font pathFont ="C:\Windows\Fonts\STXINGKA. TTF "WC = Wordcloud (Background_color="White",# background Color max_words=500,# Maximum number of words Mask=back_color,# Mask, generate the area of the word cloud background, the value o
default delimiter for Full-text retrieval is punctuation and stopwords, which is what we need. Full-Text search divides the strings in match and against by commas and then matches them.
It should be noted that the above SQL is just an example, and if you do this directly, you can't get any results. Reasons in the following
You need to set the Tags field fulltext index (if only the test, you can not do, build index just improve performance, n
(' and later, text is being preprocessed ... ') ' stutter participle tool for Chinese word processing: Read_folder_path: Primitive Corpus root path to be processed write_folder_path Chinese participle Data Cleansing Corpus "Def chsegment (Read_folder_path,write_folder_path): Stopwords ={}.fromkeys ([Line.strip () for line in open ( ‘.. /database/stopwords/ch_stopwords.txt ', ' R ', encoding= ' Utf-8 ')] #
In Python3, filter processing becomes an iterative object, with 2 solutions:
① cut into Python2
② a layer list outside of the filter
DF = Df.dropna ()
lines=df.content.values.tolist ()
sentences=[] for line in
lines:
try:
segs= Jieba.lcut (line)
Segs = filter (lambda x:len (x) >1, segs)
Segs = filter (lambda x:x not in Stopwords, Segs)
s Entences.append (segs)
except exception,e:
print line
co
sklearn.model_selection import train_test_splitX_train,X_test,y_train,y_test= train_test_split(X,y,random_state=42,test_size=0.25)
Get discontinued words
def get_custom_stopwords(stop_words_file): with open(stop_words_file,encoding="utf-8") as f: custom_stopwords_list=[i.strip() for i in f.readlines()] return custom_stopwords_list
stop_words_file = "stopwords.txt"stopwords = get_custom_stopwords(stop_words_file) # 获取停用词
string, it is simply tailored for us. Then our SQL will becomeSELECT * FROM content WHERE FIND_IN_SET(‘2‘, tags) AND id 1In the process of flipping through these functions, you should have been deeply aware of the MySQL designer's affirmation of separating the stored field methods with commas, because there are many ways to deal with this problem.It looks so much better, everything seems perfect, isn't it? In fact, if you have more tags, you need to create multiple SQL statements, and some reco
language message box [Python learning] simply crawl pictures in the image gallery [Python knowledge] crawler knowledge BeautifulSoup Library installation and brief introduction [PYTHON+NLTK] Natural Language Processing simple introduction and NLTK bad environment configuration and Getting started knowledge (i) If you have "Reportlab Version 2.1+ is needed!" Good solution can tell me, I am grateful t
.
The untangle– easily transforms an XML file into a Python object.
Clean
bleach– Clean up HTML (requires html5lib).
Sanitize– brings clarity to the chaotic world of data.
Text ProcessingA library for parsing and manipulating simple text.
General
difflib– (Python standard library) helps with differentiated comparisons.
levenshtein– quickly calculates Levenshtein distance and string similarity.
fuzzywuzzy– fuzzy string Matching.
esmr
MySQL 665.3.2 Basic Command 685.3.3 Integration with Python 715.3.4 database technology and best practices 745.3.5 "Six-degree space game" in MySQL 755.4 Email 776th. Read Document 806.1 Document Encoding 806.2 Plain Text 816.3 CSV 856.4 PDF 876.5 Microsoft Word and. docx 88Part II Advanced Data acquisitionChapter 7th Data Cleansing 947.1 Writing code Cleaning data 947.2 data storage and then cleaning 98Chapter 8th Natural Language Processing 1038.1 Summarizing Data 1048.2 Markov Model 1068.3 N
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.