International - English

Cart Console

Topic Center

Contact Sales

首頁 > 開發者 > Python

Python入門練習（一）：基於全切分，一元文法模型的漢語分詞

最後更新：2018-12-07 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

1.<beautiful data>中的例子，由於沒有中文語料庫，故用英文串代替，思路一樣（如將finallylast）切分成['finally','last']

2.代碼切分模組

代碼

import operator
def segment(text):
    "Return a list of words that is the best segmentation of text."
    if not text: return []
    candidates = ([first]+segment(rem) for first,rem in splits(text))
    return max(candidates, key=Pwords)

def splits(text, L=20):
    "Return a list of all possible (first, rem) pairs, len(first)<=L."
    return [(text[:i+1], text[i+1:])
            for i in range(min(len(text), L))]

def Pwords(words):
"The Naive Bayes probability of a sequence of words."
return product(Pw(w) for w in words)

def product(nums):
"Return the product of a sequence of numbers."
return reduce(operator.mul, nums, 1)

class Pdist(dict):
    "A probability distribution estimated from counts in datafile."
    def __init__(self, data=[], N=None, missingfn=None):
        for key,count in data:
            self[key] = self.get(key, 0) + int(count)
        self.N = float(N or sum(self.itervalues()))
        self.missingfn = missingfn or (lambda k, N: 1./N)
    def __call__(self, key):
        if key in self: return self[key]/self.N
        else: return self.missingfn(key, self.N)

def datafile(name, sep='\t'):
    "Read key,value pairs from file."
    for line in file(name):
        yield line.split(sep)

def avoid_long_words(key, N):
"Estimate the probability of an unknown word."
return 10./(N * 10**len(key))

N = 1024908267229 ## Number of tokens

Pw = Pdist(datafile(r'C:\Python26\Myngrams\count_1w.txt'), N, avoid_long_words)

2.注意：在Myngrams 添加一個空的__init__.py

3.驗證

from Myngrams import Mysegment

Mysegment.segment('finallylast')
['finally', 'last']

Mysegment.segment('unregardedsitdown')
['un', 'regarded', 'sitdown']

由於訓練語料中沒有unregarded這個詞，加上sitdown當成一個詞的機率》P(sit)P(down)所以這個結果分錯了。考慮採用二元文法分詞

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

相關關鍵詞：

Python中的底線的用法介紹 01-13

python讀寫ini檔案樣本(python讀寫檔案)_python 01-19

python CMDB開發 09-19

python：發送郵件 12-08

python學習筆記2-列（list） 12-08

python學習筆記1-賦值與字串 12-08

聯繫我們

該頁面正文內容均來源於網絡整理，並不代表阿里雲官方的觀點，該頁面所提到的產品和服務也與阿里云無關，如果該頁面內容對您造成了困擾，歡迎寫郵件給我們，收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容，歡迎發送郵件至： info-contact@alibabacloud.com 進行舉報並提供相關證據，工作人員會在 5 個工作天內聯絡您，一經查實，本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python入門練習（一）：基於全切分，一元文法模型的漢語分詞

聯繫我們

熱門內容

熱門主題

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support