Python Ai Road-fourth: Jieba Gensim better not split the most simple similarity implementation

Source: Internet
Author: User

The simple question and answer has been achieved, then the problem has arisen, I am not sure the problem must be "your name", it may be "who you Are", "your name" and so on, which leads to another technology in artificial intelligence:

Natural language Processing (NLP): The general meaning is to let the computer understand a sentence to express the meaning, NLP is equivalent to the computer in the thinking of what you say, let the computer know "who you are", "Your Name", "Your name" is a meaning

This is going to be done: semantic similarity

Next we use Python Dafa to implement a simple natural language processing

And now we're going to use Python's powerful three-party library.

The first is a library called Jieba, which is a Chinese character string.

Pip Install Jieba

We usually call this library stutter participle is really stuttering participle, and this thesaurus is made in the China, the basic use of this stuttering participle:

 import   Jiebakey_word  = "   What's your name   " #   Define a sentence, based on this sentence, word breaker   Cut_word  = jieba.cut (key_word) #    Use the Cut method in stuttering participle to "what's your Name" word breaker  print  (Cut_word) #   <generator object Tokenizer.cut at 0x03676390> does not understand the generator, it is ignored here  cut_word_list  = List (Cut_word) #   If you don't understand the generator, remember to make the Generator object list  print   (cut_word_list) #   [' You ', ' call ', ' What ', ' name ']  

The test code is very obvious, it is very clear to the Chinese string into a list of the store up

The second one is a language training library called Gensim.

Pip Install Gensim

This training library is very powerful, which encapsulates a lot of machine learning algorithms, is currently the mainstream application of artificial intelligence library, this is not very good understanding, the need for certain Python data processing skills

ImportJiebaImportGensim fromGensimImportCorpora fromGensimImportModels fromGensimImportSIMILARITIESL1= ["What's your name?","How old are you this year?","How tall are you, how big your breasts are.","how big is your chest ?"]a="How old are you this year?"all_doc_list= [] forDocinchl1:doc_list= [Word forWordinchJieba.cut (DOC)] All_doc_list.append (doc_list)Print(all_doc_list) doc_test_list= [Word forWordinchJieba.cut (a)]#Production CorpusDictionary = corpora. Dictionary (All_doc_list)#make a word bag#the understanding of the word bag#A word bag is a dictionary of many, many words that are arranged to form a word (key) with a flag bit (value)#For example: {' What ': 0, ' You ': 1, ' the name ' £ 2, ' is ': 3, ' ': 4, ' Up ': 5, ' This year ': 6, ' How old ': 7, ' many ': 8, ' yes ': 9, ' chest ': 10, ' High ': one}#as to what it is for, take a question and look downPrint("Token2id", Dictionary.token2id)Print("Dictionary", dictionary, type (dictionary)) Corpus= [Dictionary.doc2bow (DOC) forDocinchAll_doc_list]#Corpus:#This is to match the words in each list in the all_doc_list with the key in the dictionary#get a matching result, such as [' You ', ' This year ', ' How old ', ' up ']#can be obtained [(1, 1), (5, 1), (6, 1), (7, 1)]#1 is for you 1 delegates appear once, 5 represents a 1 representative appeared once, and so on 6 = this year, 7 = how oldPrint("Corpus", Corpus, type (corpus))#will need to find the similarity of the word list made corpus Doc_test_vecDoc_test_vec =Dictionary.doc2bow (doc_test_list)Print("Doc_test_vec", Doc_test_vec, type (DOC_TEST_VEC))#using LSI models to train corpus corpora (primary knowledge Corpus)LSI =models. Lsimodel (Corpus)#Here's just need to learn LSI model to understand that here do not elaboratePrint("LSI", LSI, type (LSI))#The training result of corpus corpusPrint("Lsi[corpus]", Lsi[corpus])#Get corpus Doc_test_vec vector representation in the training results of Corpus corpusPrint("Lsi[doc_test_vec]", Lsi[doc_test_vec])#Text Similarity#The sparse matrix similarity corpus The training result of the master corpus as the initial valueindex = similarities. Sparsematrixsimilarity (Lsi[corpus], num_features=Len (Dictionary.keys ()))Print("Index", index, type (index))#The matrix similarity calculation is made by the vector representation of Corpus Doc_test_vec in the training result of corpus corpus and the vector representation of corpus corpus.SIM =Index[lsi[doc_test_vec]]Print("Sim", SIM, type (SIM))#a sort of subscript and similarity results with the highest similarity results#cc = sorted (Enumerate (SIM), Key=lambda item:item[1],reverse=true)CC = sorted (Enumerate (SIM), key=LambdaItem:-ITEM[1])Print(cc) Text=L1[cc[0][0]]Print(A,text)
High Energy ahead

Python Ai Road-fourth: Jieba Gensim better not split the most simple similarity implementation

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.