Jieba Participle source Reading

Source: Internet
Author: User
Tags split

Jieba is an open-source Chinese word thesaurus, these days to see the next source, do the next record.

After downloading Jieba, tree gets the main part of the directory tree structure as follows:

├──jieba│  ├──analyse│  │  ├──analyzer.py│  │  ├─ ─idf.txt│  │  ├──__init__.py│  │  ├──textrank.py│ &nbsp ;
    │  └──tfidf.py│  ├──_compat.py│  ├──dict.txt│  ├──finalseg │  │  ├──__init__.py│  │  ├──prob_emit.p│  │ & nbsp ├──prob_emit.py│  │  ├──prob_start.p│  │  ├──prob_start.py│& Nbsp; │  ├──prob_trans.p│  │  └──prob_trans.py│  ├──__init_ _.py│  ├──__main__.py│  └──posseg│  ├──char_state_tab.p│   ├──char_state_tab.py│  ├──__init__.py│  ├──prob_emit.p│ &n     Bsp ├──proB_emit.py│  ├──prob_start.p│  ├──prob_start.py│  ├──prob_tr
    Ans.p│  ├──prob_trans.py│  └──viterbi.py├──license├──manifest.in ├──readme.md├──setup.py└──test

The analyse directory is a few algorithms for extracting text keywords.

Dict.txt is the total thesaurus, where each line records a word and the word frequency and part of speech.

__init__.py is the main entrance to Jieba.

FINALSEG is if using Hmm, then after the initial word breaker to call the code here, mainly on the implementation of Hmm.

Then we introduce several functions in the main interface __init__.py:

def gen_pfdict (self, f):
    lfreq = {}
    ltotal = 0
    f_name = resolve_filename (f)
    for Lineno, line in enumerate ( F, 1):
        try: Line
            = Line.strip (). Decode (' utf-8 ')
            word, freq = Line.split (") [: 2]
            freq = Int (freq)
            Lfreq[word] = freq
            ltotal + = freq for
            ch in xrange (len (Word)):
                Wfrag = word[:ch + 1]
                if Wfrag not in LF Req:
                    Lfreq[wfrag] = 0
        except ValueError:
            raise ValueError (
                ' Invalid dictionary entry in%s @ line%s :%s '% (f_name, Lineno, line))
    F.close ()
    return lfreq, Ltotal

Gen_pfdict load Dict.txt generates a dictionary tree, Lfreq stores dict.txt how many times each word appears, and all prefixes for each word, the frequency of the prefix is set to 0,ltotal is the total number of occurrences of all words. The resulting dictionary tree exists self. The freq. For example, the word "eclectic", in the dictionary tree lfreq is such {"No": 0, "informal":, "informal", 0, "eclectic": Freq} where freq is the word frequency recorded in the Dict.txt.


def get_dag (self, sentence):
    self.check_initialized ()
    DAG = {}
    N = len (sentence) for
    K in Xrange (N):
        tmplist = []
        i = k
        Frag = sentence[k] While
        i < N and frag in self. FREQ:
            if self. Freq[frag]:
                tmplist.append (i)
            i + = 1
            frag = sentence[k:i + 1]
        if not tmplist:
            tmplist.append (k )
        Dag[k] = tmplist
    return DAG

Freq is a dictionary tree generated from Dict.txt, Get_dag function is based on freq for each sentence sentence generate a directed acyclic graph, the diagram information exists in the dictionary dag, where Dag[pos] is a list [a, B, c...],pos from 0 to Len ( Sentence)-1, representing Sentence[pos:a + 1],sentence[pos, B + 1] ... These words appear in the dict.

For example, with the phrase "but not so unexpected or unbelievable," as input, the DAG is generated as follows, simply by marking the position of the sentence morphemes.

0 [0] but
1 [1] Also
2 [2] and
3 [3, 4] not
4 [4] Yes
5 [5, 6] so
6 [6]
7 [7, 8, 10] Unexpected
8 [8]
9 [9, 10] expected
10 [10] Material
11 [11] or
12 [12, 13, 15] Unbelievable
13 [13] With
14 [14, 15] Confidence
15 [15] Letter


The next step is the segmentation of sentences, namely Jieba.cut. The specific word segmentation process is summed up as follows:
1. Given the sentence to be participle, use regular (Re_han) to get the matching Chinese characters (and English characters) cut into a list of phrases;
2. Use the Get_dag (sentence) function to obtain a DAG for the sentence to be split, first detect whether the (check_initialized) process has loaded the thesaurus, if the uninitialized thesaurus calls the Initialize function to initialize, Initialize to determine whether there is a cached prefix dictionary cache_file file, if there is a corresponding cache file directly using the Marshal.load method to load the prefix dictionary, if none pass Gen_ Pfdict to the specified thesaurus dict.txt to generate a prefix dictionary, to the Jieba process after the initialization of the work is done to call Get_dag to obtain a sentence Dag;
3. Specify specific methods according to Cut_block (__CUT_ALL,__CUT_DAG,__CUT_DAG_NO_HMM) to use a DAG for each phrase, such as cut_block=__cut_dag when using a dag (look up dictionary) and dynamic Planning , the maximum probability path is obtained, the words which are not found in the dictionary in the Dag are combined into a new fragment phrase, and the word segmentation is made using HMM model, that is, the recognition of new words by the author, that is, the recognition of new words outside the dictionary;
4. Use the yield syntax of Python to generate a word generator that is returned by word;


def __cut_all (self, sentence):
    dag = self.get_dag (sentence)
    Old_j =-1
    for K, L in Iteritems (DAG):
        If Len (L) = = 1 and k > Old_j:
            yield sentence[k:l[0] + 1]
            Old_j = l[0]
        else: for J in
            L:
                if J > K:
                    Yield Sentence[k:j + 1]
                    Old_j = j
__cut_all is a full-pattern shard, which actually shows all the combinations in the DAG. The results for the last sentence are as follows: but/also/and/not/then/unexpected/unanticipated/unexpected/or/difficult/unbelievable/believable


Def calc (self, sentence, DAG, route):
    N = len (sentence)
    route[n] = (0, 0)
    logtotal = log (self.total)
    For IDX in xrange (N-1,-1,-1):
        route[idx] = max ((log (self). Freq.get (sentence[idx:x + 1]) or 1)-
                          logtotal + route[x + 1][0], x) for x in Dag[idx])
The Calc function calculates the most likely result of a cut based on the probability of occurrence, where the probability of word A is the number of occurrences of a, divided by the total number of occurrences of all words. The maximum probability of a cut word by DP, Route[i][0] represents the maximal probability of Sentence[i:len], Route[i][1] represents the first shard position of the substring Sentence[i:len].


def __cut_dag_no_hmm (self, sentence):
        DAG = self.get_dag (sentence)
        route = {}
        self.calc (sentence, Dag, route)
        x = 0
        N = len (sentence)
        buf = ' while
        x < N:
            y = route[x][1] + 1
            l_word = sentence[x:y]
            if Re_eng.match (L_word) and Len (l_word) = = 1:
                buf + = L_word
                x = y
            Else:
                if buf:
                    yield buf
  buf = '
                yield l_word
                x = y
        if buf:
            yield buf
            buf = '

__CUT_DAG_NO_HMM is a precise pattern that does not use Hmm, the first call to Calc, the resulting route[i][1] holds the location information of the Shard, and then iterates through the output slicing method.

def __cut_dag (self, sentence): DAG = self.get_dag (sentence) route = {} self.calc (sentence, DAG, route) x =
        0 buf = "N = Len (sentence) while x < n:y = Route[x][1] + 1 L_word = sentence[x:y]
                    if y-x = = 1:buf + L_word else:if buf:if len (buf) = = 1: Yield buf buf = "else:if not" self.
                            Freq.get (BUF): recognized = Finalseg.cut (BUF) for T in recognized:
                            Yield T else:for elem in BUF:
            Yield elem buf = ' yield L_word x = y if Buf:if len (buf) = = 1: Yield buf elif not self.
        Freq.get (BUF): recognized = Finalseg.cut (BUF) for T in Recognized:yield tElse:for Elem in Buf:yield elem 

__cut_dag use the maximum probability path and hmm at the same time, for the maximum probability segmentation calculated by the dynamic programming, the continuous Word collection and the non-login words are collected with buf, and then the finalseg.cut is used to use HMM for word segmentation.


And then Final_seg's __init__.py.

def Viterbi (Obs, states, Start_p, trans_p, emit_p):
    V = [{}]  # tabular
    path = {} for
    y in states:  # in It
        v[0][y] = Start_p[y] + emit_p[y].get (obs[0], min_float)
        path[y] = [y]
    for T in xrange (1, Len (OBS)):
        v. Append ({})
        NewPath = {}
        for y in states:
            em_p = Emit_p[y].get (obs[t], min_float)
            (prob, state) = Max ( c13/>[(V[t-1][y0] + trans_p[y0].get (y, min_float) + em_p, y0) for y0 in Prevstatus[y]])
            v[t][y] = prob
            NewPath [Y] = path[state] + [y]
        path = NewPath

    (prob, state) = Max ((V[len (OBS)-1][y], y) as y in ' ES ')

    return (p Rob, Path[state])
The realization function of the Viterbi algorithm in Hmm is that the Viterbi algorithm is given the model parameter and the observation sequence to seek the hidden state sequence, in which the observation sequence is the sentence itself, and the hidden sequence is a sequence consisting of {B, M, E, S}, B denotes the beginning of the word, and M denotes the middle of the word. , e denotes the end of the word and S denotes a word.
In the input parameters of the function, OBS is the input observation sequence, that is, the sentence itself, States represents the hidden state of the set, that is {B, M, E, s},start_p means the first word in {B, M, E, S} These hidden states of the probability, trans_p is the state transfer matrix, The conversion probabilities between hidden states are recorded, for example trans_p[' B ' [' E '] represents the probability of being transferred from state B to state E, emit_p is the probability matrix of the emission probability, which represents a transition from a hidden state to a observed state, such as p[' B ' [' \u4e00 '] The meaning of the representation is that the observed word in the ' B ' state is the probability size of ' \u4e00 ' (the corresponding Chinese character is ' one ').
V is a list, v[i][j] means for the sub-observation sequence obs[0 ~ i], in the first position of the hidden state is the largest probability of J, is actually a simple DP.
Path is a route that records the state transition.

def __cut (sentence):
    global emit_p
    prob, pos_list = Viterbi (sentence, ' Bmes ', start_p, trans_p, emit_p)
    Begin, Nexti = 0, 0
    # Print pos_list, sentence for
    I, char in enumerate (sentence):
        pos = pos_list[i]
        if PO s = = ' B ':
            begin = i
        elif pos = = ' E ':
            yield sentence[begin:i + 1]
            Nexti = i + 1
        elif pos = = ' S ': 
  
   yield char
            Nexti = i + 1
    if Nexti < Len (sentence):
        yield Sentence[nexti:]
  
After the probability and path are obtained by calling the Viterbi algorithm, the sentence is segmented.


Reference:

Http://www.cnblogs.com/lrysjtu/p/4529325.html

http://blog.csdn.net/daniel_ustc/article/details/48195287

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.