How to use the dictionary in Python to process index statistics

Last Update:2018-04-26 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article mainly introduces how to use the dictionary in Python to process index statistics. the dictionary is used as the basic knowledge in Python learning. This article is a related small practice, for more information about index engines and data statistics, see this article. I have worked with the Python dictionary frequently. now I have compiled a copy of the API usage and archive filing.

The basic working principle of the indexing engine is inverted indexing, which maps the text contained in a document to the document in turn. there are not many patterns in this algorithm. to increase efficiency, index data can be moved to the memory as much as possible. this method can help Wang Xianzhi learn the trend of calligraphy. as long as the memory of the 18 machines is full, it will basically become a success. the basic idea is as follows:

  doc_a: [word_w, word_x, word_y]  doc_b: [word_x, word_z]  doc_c: [word_y]

Convert it

  word_w -> [doc_a]  word_x -> [doc_a, doc_b]  word_y -> [doc_a, doc_c]  word_z -> [doc_b]

If you write Python code

doc_a = {'id': 'a', 'words': ['word_w', 'word_x', 'word_y']} doc_b = {'id': 'b', 'words': ['word_x', 'word_z']} doc_c = {'id': 'c', 'words': ['word_y']}  docs = [doc_a, doc_b, doc_c] indices = dict()  for doc in docs:   for word in doc['words']:     if word not in indices:       indices[word] = []     indices[word].append(doc['id'])  print indices

However, here is a small trick, that is, to determine whether the current word is already in the branch of the index Dictionary.

if word not in indices:   indices[word] = []

It can be replaced by the setdefault (key, default = None) interface of dict. the function of this interface is to take out the corresponding value if the key is in the dictionary. otherwise, create the key and set the default value to default. but from the design point of view, I don't understand why default has a default value of None, which does not seem to make much sense. if you really want to use this interface, it will always bring its own default value, as shown below:

for doc in docs:   for word in doc['words']:     indices. setdefault(word, []) .append(doc['id'])

In this way, the branch is saved, and the code looks much less.
However, in some cases, setdefault is not easy to use: when the construction of the default value is very complex, or when the default value has side effects, and one will be mentioned later; in the first two cases, setdefault is not applicable to scenarios where default requires inert value. in other words, to meet this requirement, setdefault may be designed

def setdefault(self, key, default_factory):   if key not in self:     self[key] = default_factory()   return self[key]

If so, the above code should be changed

for doc in docs:   for word in doc['words']:     indices.setdefault(word, list ).append(doc['id'])

But there are actually other alternatives, which will be mentioned at the end.

If the above is just an API defect that can be foreseen but may not be encountered at all, then the following is a bit of a face.
Consider how many times a word appears in the article to conduct word frequency statistics. if you use dict to write it directly, it is roughly

def word_count(words):   count = dict()   for word in words:     count.setdefault(word, 0) += 1  return count  print word_count(['hiiragi', 'kagami', 'hiiragi', 'tukasa', 'yosimizu', 'kagami'])

When you start the above code with great enthusiasm, the code will immediately throw the exception to your nose-because it appears on the count on the left of the ++ = operator. setdefault (word, 0) is not a left value in Python. now let's get started with the C producer type system.

Because Python equivalent the Default literal constant {} to dict (), it is considered that dict is a silver bullet. There are many data structures in Python to solve the statistical problem, the ideal solution is collections. defaultdict class. the following code must be clear at a glance

from collections import defaultdict  doc_a = {'id': 'a', 'words': ['word_w', 'word_x', 'word_y']} doc_b = {'id': 'b', 'words': ['word_x', 'word_z']} doc_c = {'id': 'c', 'words': ['word_y']}  docs = [doc_a, doc_b, doc_c] indices = defaultdict(list)  for doc in docs:   for word in doc['words']:     indices[word].append(doc['id'])  print indices  def word_count(words):   count = defaultdict(int)   for word in words:     count[word] += 1  return count  print word_count(['hiiragi', 'kagami', 'hiiragi', 'tukasa', 'yosimizu', 'kagami'])

Solved all the previous things.

In addition, there is a Counter in collections, which can be roughly considered as an extension of defaultdict (int.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

How to use the dictionary in Python to process index statistics

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

How to use the dictionary in Python to process index statistics

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support