This article mainly introduces how to use the dictionary in Python to process index statistics. the dictionary is used as the basic knowledge in Python learning. This article is a related small practice, for more information about index engines and data statistics, see this article. I have worked with the Python dictionary frequently. now I have compiled a copy of the API usage and archive filing.
The basic working principle of the indexing engine is inverted indexing, which maps the text contained in a document to the document in turn. there are not many patterns in this algorithm. to increase efficiency, index data can be moved to the memory as much as possible. this method can help Wang Xianzhi learn the trend of calligraphy. as long as the memory of the 18 machines is full, it will basically become a success. the basic idea is as follows:
doc_a: [word_w, word_x, word_y] doc_b: [word_x, word_z] doc_c: [word_y]
Convert it
word_w -> [doc_a] word_x -> [doc_a, doc_b] word_y -> [doc_a, doc_c] word_z -> [doc_b]
If you write Python code
doc_a = {'id': 'a', 'words': ['word_w', 'word_x', 'word_y']} doc_b = {'id': 'b', 'words': ['word_x', 'word_z']} doc_c = {'id': 'c', 'words': ['word_y']} docs = [doc_a, doc_b, doc_c] indices = dict() for doc in docs: for word in doc['words']: if word not in indices: indices[word] = [] indices[word].append(doc['id']) print indices
However, here is a small trick, that is, to determine whether the current word is already in the branch of the index Dictionary.
if word not in indices: indices[word] = []
It can be replaced by the setdefault (key, default = None) interface of dict. the function of this interface is to take out the corresponding value if the key is in the dictionary. otherwise, create the key and set the default value to default. but from the design point of view, I don't understand why default has a default value of None, which does not seem to make much sense. if you really want to use this interface, it will always bring its own default value, as shown below:
for doc in docs: for word in doc['words']: indices. setdefault(word, []) .append(doc['id'])
In this way, the branch is saved, and the code looks much less.
However, in some cases, setdefault is not easy to use: when the construction of the default value is very complex, or when the default value has side effects, and one will be mentioned later; in the first two cases, setdefault is not applicable to scenarios where default requires inert value. in other words, to meet this requirement, setdefault may be designed
def setdefault(self, key, default_factory): if key not in self: self[key] = default_factory() return self[key]
If so, the above code should be changed
for doc in docs: for word in doc['words']: indices.setdefault(word, list ).append(doc['id'])
But there are actually other alternatives, which will be mentioned at the end.
If the above is just an API defect that can be foreseen but may not be encountered at all, then the following is a bit of a face.
Consider how many times a word appears in the article to conduct word frequency statistics. if you use dict to write it directly, it is roughly
def word_count(words): count = dict() for word in words: count.setdefault(word, 0) += 1 return count print word_count(['hiiragi', 'kagami', 'hiiragi', 'tukasa', 'yosimizu', 'kagami'])
When you start the above code with great enthusiasm, the code will immediately throw the exception to your nose-because it appears on the count on the left of the ++ = operator. setdefault (word, 0) is not a left value in Python. now let's get started with the C producer type system.
Because Python equivalent the Default literal constant {} to dict (), it is considered that dict is a silver bullet. There are many data structures in Python to solve the statistical problem, the ideal solution is collections. defaultdict class. the following code must be clear at a glance
from collections import defaultdict doc_a = {'id': 'a', 'words': ['word_w', 'word_x', 'word_y']} doc_b = {'id': 'b', 'words': ['word_x', 'word_z']} doc_c = {'id': 'c', 'words': ['word_y']} docs = [doc_a, doc_b, doc_c] indices = defaultdict(list) for doc in docs: for word in doc['words']: indices[word].append(doc['id']) print indices def word_count(words): count = defaultdict(int) for word in words: count[word] += 1 return count print word_count(['hiiragi', 'kagami', 'hiiragi', 'tukasa', 'yosimizu', 'kagami'])
Solved all the previous things.
In addition, there is a Counter in collections, which can be roughly considered as an extension of defaultdict (int.