The recent tossing index engine and data statistics work a lot, and Python dictionary frequently dealing with, so that a copy of this aspect of the API usage and the pit law record.
The indexing engine works by inverted indexes, which in turn map the text contained in a document to the document; This algorithm does not have too many tricks to say, in order to increase efficiency, index data can be moved into memory, this method can effect Wang Xianzhi XI calligraphy, as long as the 18 machine memory full, then the basic will be successful. And the basic idea for a simple example, now has the following document (participle has been completed) and its inclusion of the keywords
Doc_a: [Word_w, Word_x, word_y] doc_b: [word_x, Word_z] doc_c: [word_y]
Transform it to
Word_w, [doc_a] word_x, [Doc_a, Doc_b] word_y, [Doc_a, Doc_c] word_z, [Doc_b]
Written in Python code, it is
Doc_a = {' id ': ' A ', ' words ': [' word_w ', ' word_x ', ' word_y ']} Doc_b = {' id ': ' B ', ' Words ': [' word_x ', ' Word_z ']} doc_c = {' Id ': ' C ', ' Words ': [' word_y ']} docs = [Doc_a, Doc_b, doc_c] indices = Dict () for doc in docs: for word in doc[ ' Words ': if Word not in indices: Indices[word] = [] indices[word].append (doc[' id ']) print Indices
But here's a little trick, which is to determine whether the current word is already in the index dictionary branch
If Word not in indices: Indices[word] = []
Can be replaced by the Dict SetDefault (key, Default=none) interface. The function of this interface is, if key is in the dictionary, then say, take out the corresponding value; Otherwise, this key is created, and the default corresponding value is set. But in terms of design, I do not understand why the default value of None, does not seem to make much sense, if you really want to use this interface, the general will bring the default value it, as follows
For doc in docs: for word in doc[' words ': indices. SetDefault (Word, []). Append (doc[' id ')
This saves the branching and the code looks much less.
In some cases, however, SetDefault is not easy to use: When the default value is constructed very complex, or the default value has side effects, and a later case; The first two cases are word, that is, SetDefault does not apply to default scenarios where lazy evaluation is required. In other words, to take into account this need, SetDefault may be designed
def setdefault (self, Key, default_factory): If key isn't in self: self[key] = default_factory () return self[ Key
If this is true, then the code above should be changed to
For doc in docs: for word in doc[' words ': indices.setdefault (Word, list). Append (doc[' id '])
But there are actually other alternatives, which will be mentioned at the end.
If it's just an API flaw that you can foresee but may not actually encounter at all, here's a little bit of a face.
Consider now to carry out the word frequency statistics, that is, how many times a term appears in the article, if directly take Dict to write, roughly is
def word_count (words): count = Dict () for word in words: count.setdefault (Word, 0) + = 1 return count print Word_count ([' Hiiragi ', ' Kagami ', ' Hiiragi ', ' Tukasa ', ' Yosimizu ', ' Kagami '])
When you run the above code with gusto, the code will throw the exception to the tip of your nose with a thunderbolt less than a face---because the count.setdefault (Word, 0) that appears to the left of the + = operator is not an lvalue in Python. How, now began to think about C 艹 type system of well.
Because Python equates the default literal constant {} to Dict (), the idea that dict is a silver bullet is not advisable; Python has a lot of data structure, to solve statistical problems, the ideal solution is to collections.defaultdict this class. The following code must have a glance to understand
From collections import Defaultdict doc_a = {' id ': ' A ', ' words ': [' word_w ', ' word_x ', ' word_y ']} Doc_b = {' id ': ' B ', ' Words ': [' word_x ', ' Word_z ']} Doc_c = {' id ': ' C ', ' Words ': [' word_y ']} docs = [Doc_a, Doc_b, doc_c] indices = Defaultd ICT (list) for doc in docs: for word in doc[' words ': indices[word].append (doc[' id ')) print Indices def word_count (words): count = defaultdict (int) for word in words: Count[word] + = 1 return count< C11/>print word_count ([' Hiiragi ', ' Kagami ', ' Hiiragi ', ' Tukasa ', ' Yosimizu ', ' Kagami '])
Solved all the previous problems.
There is also a Counter in collections, which can be roughly thought of as an extension of defaultdict (int).