Python Ai Road-fourth: Jieba Gensim better not split the most simple similarity implementation

Last Update:2018-07-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The simple question and answer has been achieved, then the problem has arisen, I am not sure the problem must be "your name", it may be "who you Are", "your name" and so on, which leads to another technology in artificial intelligence:

Natural language Processing (NLP): The general meaning is to let the computer understand a sentence to express the meaning, NLP is equivalent to the computer in the thinking of what you say, let the computer know "who you are", "Your Name", "Your name" is a meaning

This is going to be done: semantic similarity

Next we use Python Dafa to implement a simple natural language processing

And now we're going to use Python's powerful three-party library.

The first is a library called Jieba, which is a Chinese character string.

Pip Install Jieba

We usually call this library stutter participle is really stuttering participle, and this thesaurus is made in the China, the basic use of this stuttering participle:

 import   Jiebakey_word  = "   What's your name   " #   Define a sentence, based on this sentence, word breaker   Cut_word  = jieba.cut (key_word) #    Use the Cut method in stuttering participle to "what's your Name" word breaker  print  (Cut_word) #   <generator object Tokenizer.cut at 0x03676390> does not understand the generator, it is ignored here  cut_word_list  = List (Cut_word) #   If you don't understand the generator, remember to make the Generator object list  print   (cut_word_list) #   [' You ', ' call ', ' What ', ' name ']

The test code is very obvious, it is very clear to the Chinese string into a list of the store up

The second one is a language training library called Gensim.

Pip Install Gensim

This training library is very powerful, which encapsulates a lot of machine learning algorithms, is currently the mainstream application of artificial intelligence library, this is not very good understanding, the need for certain Python data processing skills

ImportJiebaImportGensim fromGensimImportCorpora fromGensimImportModels fromGensimImportSIMILARITIESL1= ["What's your name?","How old are you this year?","How tall are you, how big your breasts are.","how big is your chest ?"]a="How old are you this year?"all_doc_list= [] forDocinchl1:doc_list= [Word forWordinchJieba.cut (DOC)] All_doc_list.append (doc_list)Print(all_doc_list) doc_test_list= [Word forWordinchJieba.cut (a)]#Production CorpusDictionary = corpora. Dictionary (All_doc_list)#make a word bag#the understanding of the word bag#A word bag is a dictionary of many, many words that are arranged to form a word (key) with a flag bit (value)#For example: {' What ': 0, ' You ': 1, ' the name ' £ 2, ' is ': 3, ' ': 4, ' Up ': 5, ' This year ': 6, ' How old ': 7, ' many ': 8, ' yes ': 9, ' chest ': 10, ' High ': one}#as to what it is for, take a question and look downPrint("Token2id", Dictionary.token2id)Print("Dictionary", dictionary, type (dictionary)) Corpus= [Dictionary.doc2bow (DOC) forDocinchAll_doc_list]#Corpus:#This is to match the words in each list in the all_doc_list with the key in the dictionary#get a matching result, such as [' You ', ' This year ', ' How old ', ' up ']#can be obtained [(1, 1), (5, 1), (6, 1), (7, 1)]#1 is for you 1 delegates appear once, 5 represents a 1 representative appeared once, and so on 6 = this year, 7 = how oldPrint("Corpus", Corpus, type (corpus))#will need to find the similarity of the word list made corpus Doc_test_vecDoc_test_vec =Dictionary.doc2bow (doc_test_list)Print("Doc_test_vec", Doc_test_vec, type (DOC_TEST_VEC))#using LSI models to train corpus corpora (primary knowledge Corpus)LSI =models. Lsimodel (Corpus)#Here's just need to learn LSI model to understand that here do not elaboratePrint("LSI", LSI, type (LSI))#The training result of corpus corpusPrint("Lsi[corpus]", Lsi[corpus])#Get corpus Doc_test_vec vector representation in the training results of Corpus corpusPrint("Lsi[doc_test_vec]", Lsi[doc_test_vec])#Text Similarity#The sparse matrix similarity corpus The training result of the master corpus as the initial valueindex = similarities. Sparsematrixsimilarity (Lsi[corpus], num_features=Len (Dictionary.keys ()))Print("Index", index, type (index))#The matrix similarity calculation is made by the vector representation of Corpus Doc_test_vec in the training result of corpus corpus and the vector representation of corpus corpus.SIM =Index[lsi[doc_test_vec]]Print("Sim", SIM, type (SIM))#a sort of subscript and similarity results with the highest similarity results#cc = sorted (Enumerate (SIM), Key=lambda item:item[1],reverse=true)CC = sorted (Enumerate (SIM), key=LambdaItem:-ITEM[1])Print(cc) Text=L1[cc[0][0]]Print(A,text)

High Energy ahead

Python Ai Road-fourth: Jieba Gensim better not split the most simple similarity implementation

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python Ai Road-fourth: Jieba Gensim better not split the most simple similarity implementation

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python Ai Road-fourth: Jieba Gensim better not split the most simple similarity implementation

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support