Git:https://github.com/linyi0604/machinelearning
Word vector technology Word2vec each successive lexical fragment will have a certain constraint on the back, called contextual context , to find the semantic dimension of the sentence.
1 fromSklearn.datasetsImportfetch_20newsgroups2 fromBs4ImportBeautifulSoup3 ImportNLTK, re4 fromGensim.modelsImportWord2vec5 6 #nltk.download (' Punkt ')7 8 9 " "Ten word vector technology Word2vec One each successive lexical fragment will have a certain constraint on the back called the contextual context A - find the semantic connection between sentences - the " " - - #Connect to download News data -News = Fetch_20newsgroups (subset=" All") +X, y =News.data, News.target - + #define a function to separate the sentences in each piece of news and return a list of sentences A defnews_to_sentences (News): atNews_text =BeautifulSoup (News). Get_text () -Tokenizer = Nltk.data.load ("Tokenizers/punkt/english.pickle") -Raw_sentences =tokenizer.tokenize (News_text) -sentences = [] - forSentinchraw_sentences: -temp = Re.sub ("[^a-za-z]"," ", Sent.lower (). Strip ()). Split () in sentences.append (temp) - to returnsentences + - #The sentences in the long news are stripped out for training . thesentences = [] * forIinchx: $Sentence_list =news_to_sentences (i)Panax NotoginsengSentences + =sentence_list - the + #Configure the dimension of the word vector ANum_features = 300 the #the frequency of the words that are to be considered +Min_word_count = 20 - #number of CPU cores used in parallel computing $Num_workers = 2 $ #defines the context window size of the training word vector -Context = 5 -downsapling = 1e-3 the - #vector Model of training wordsWuyiModel =Word2vec. Word2vec (Sentences, theworkers=Num_workers, -Size=Num_features, WuMin_count=Min_word_count, -window=context, Aboutsample=downsapling) $ #This setting represents the current well-trained word vector as the final version, and can also speed up the training of the model. -Model.init_sims (replace=True) - - #use well-trained models to find 10 words related to college in text A Print(Model.most_similar ("College")) + " " the [(' Wisconsin ', 0.7664438486099243), - (' osteopathic ', 0.7474539279937744), $ (' Madison ', 0.7433826923370361), the (' Univ ', 0.7296794652938843), the (' Melbourne ', 0.7212647199630737), the (' Walla ', 0.7068545818328857), the (' Maryland ', 0.7038443088531494), - (' Carnegie ', 0.7038302421569824), in (' Institute ', 0.7003713846206665), the (' informatics ', 0.6968873143196106)] the " "
The path of machine learning: Python practice Word2vec word vector technology