How to use the dictionary in Python to process index statistics

Source: Internet
Author: User
This article mainly introduces how to use the dictionary in Python to process index statistics. the dictionary is used as the basic knowledge in Python learning. This article is a related small practice, for more information about index engines and data statistics, see this article. I have worked with the Python dictionary frequently. now I have compiled a copy of the API usage and archive filing.

The basic working principle of the indexing engine is inverted indexing, which maps the text contained in a document to the document in turn. there are not many patterns in this algorithm. to increase efficiency, index data can be moved to the memory as much as possible. this method can help Wang Xianzhi learn the trend of calligraphy. as long as the memory of the 18 machines is full, it will basically become a success. the basic idea is as follows:

  doc_a: [word_w, word_x, word_y]  doc_b: [word_x, word_z]  doc_c: [word_y]

Convert it

  word_w -> [doc_a]  word_x -> [doc_a, doc_b]  word_y -> [doc_a, doc_c]  word_z -> [doc_b]

If you write Python code

doc_a = {'id': 'a', 'words': ['word_w', 'word_x', 'word_y']} doc_b = {'id': 'b', 'words': ['word_x', 'word_z']} doc_c = {'id': 'c', 'words': ['word_y']}  docs = [doc_a, doc_b, doc_c] indices = dict()  for doc in docs:   for word in doc['words']:     if word not in indices:       indices[word] = []     indices[word].append(doc['id'])  print indices

However, here is a small trick, that is, to determine whether the current word is already in the branch of the index Dictionary.

if word not in indices:   indices[word] = []

It can be replaced by the setdefault (key, default = None) interface of dict. the function of this interface is to take out the corresponding value if the key is in the dictionary. otherwise, create the key and set the default value to default. but from the design point of view, I don't understand why default has a default value of None, which does not seem to make much sense. if you really want to use this interface, it will always bring its own default value, as shown below:

for doc in docs:   for word in doc['words']:     indices. setdefault(word, []) .append(doc['id'])

In this way, the branch is saved, and the code looks much less.
However, in some cases, setdefault is not easy to use: when the construction of the default value is very complex, or when the default value has side effects, and one will be mentioned later; in the first two cases, setdefault is not applicable to scenarios where default requires inert value. in other words, to meet this requirement, setdefault may be designed

def setdefault(self, key, default_factory):   if key not in self:     self[key] = default_factory()   return self[key]

If so, the above code should be changed

for doc in docs:   for word in doc['words']:     indices.setdefault(word, list ).append(doc['id'])

But there are actually other alternatives, which will be mentioned at the end.

If the above is just an API defect that can be foreseen but may not be encountered at all, then the following is a bit of a face.
Consider how many times a word appears in the article to conduct word frequency statistics. if you use dict to write it directly, it is roughly

def word_count(words):   count = dict()   for word in words:     count.setdefault(word, 0) += 1  return count  print word_count(['hiiragi', 'kagami', 'hiiragi', 'tukasa', 'yosimizu', 'kagami'])

When you start the above code with great enthusiasm, the code will immediately throw the exception to your nose-because it appears on the count on the left of the ++ = operator. setdefault (word, 0) is not a left value in Python. now let's get started with the C producer type system.

Because Python equivalent the Default literal constant {} to dict (), it is considered that dict is a silver bullet. There are many data structures in Python to solve the statistical problem, the ideal solution is collections. defaultdict class. the following code must be clear at a glance

from collections import defaultdict  doc_a = {'id': 'a', 'words': ['word_w', 'word_x', 'word_y']} doc_b = {'id': 'b', 'words': ['word_x', 'word_z']} doc_c = {'id': 'c', 'words': ['word_y']}  docs = [doc_a, doc_b, doc_c] indices = defaultdict(list)  for doc in docs:   for word in doc['words']:     indices[word].append(doc['id'])  print indices  def word_count(words):   count = defaultdict(int)   for word in words:     count[word] += 1  return count  print word_count(['hiiragi', 'kagami', 'hiiragi', 'tukasa', 'yosimizu', 'kagami'])

Solved all the previous things.

In addition, there is a Counter in collections, which can be roughly considered as an extension of defaultdict (int.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.