How to handle index statistics with a dictionary in Python

Last Update:2016-06-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The recent tossing index engine and data statistics work a lot, and Python dictionary frequently dealing with, so that a copy of this aspect of the API usage and the pit law record.

The indexing engine works by inverted indexes, which in turn map the text contained in a document to the document; This algorithm does not have too many tricks to say, in order to increase efficiency, index data can be moved into memory, this method can effect Wang Xianzhi XI calligraphy, as long as the 18 machine memory full, then the basic will be successful. And the basic idea for a simple example, now has the following document (participle has been completed) and its inclusion of the keywords

  Doc_a: [Word_w, Word_x, word_y]  doc_b: [word_x, Word_z]  doc_c: [word_y]

Transform it to

  Word_w, [doc_a]  word_x, [Doc_a, Doc_b]  word_y, [Doc_a, Doc_c]  word_z, [Doc_b]

Written in Python code, it is

Doc_a = {' id ': ' A ', ' words ': [' word_w ', ' word_x ', ' word_y ']} Doc_b = {' id ': ' B ', ' Words ': [' word_x ', ' Word_z ']} doc_c = {' Id ': ' C ', ' Words ': [' word_y ']}  docs = [Doc_a, Doc_b, doc_c] indices = Dict () for  doc in docs: for   word in doc[ ' Words ':     if Word not in indices:       Indices[word] = []     indices[word].append (doc[' id '])  print Indices

But here's a little trick, which is to determine whether the current word is already in the index dictionary branch

If Word not in indices:   Indices[word] = []

Can be replaced by the Dict SetDefault (key, Default=none) interface. The function of this interface is, if key is in the dictionary, then say, take out the corresponding value; Otherwise, this key is created, and the default corresponding value is set. But in terms of design, I do not understand why the default value of None, does not seem to make much sense, if you really want to use this interface, the general will bring the default value it, as follows

For doc in docs:   for word in doc[' words ':     indices. SetDefault (Word, []). Append (doc[' id ')

This saves the branching and the code looks much less.
In some cases, however, SetDefault is not easy to use: When the default value is constructed very complex, or the default value has side effects, and a later case; The first two cases are word, that is, SetDefault does not apply to default scenarios where lazy evaluation is required. In other words, to take into account this need, SetDefault may be designed

def setdefault (self, Key, default_factory):   If key isn't in self:     self[key] = default_factory ()   return self[ Key

If this is true, then the code above should be changed to

For doc in docs:   for word in doc[' words ':     indices.setdefault (Word, list). Append (doc[' id '])

But there are actually other alternatives, which will be mentioned at the end.

If it's just an API flaw that you can foresee but may not actually encounter at all, here's a little bit of a face.
Consider now to carry out the word frequency statistics, that is, how many times a term appears in the article, if directly take Dict to write, roughly is

def word_count (words):   count = Dict ()   for word in words:     count.setdefault (Word, 0) + = 1  return count  print Word_count ([' Hiiragi ', ' Kagami ', ' Hiiragi ', ' Tukasa ', ' Yosimizu ', ' Kagami '])

When you run the above code with gusto, the code will throw the exception to the tip of your nose with a thunderbolt less than a face---because the count.setdefault (Word, 0) that appears to the left of the + = operator is not an lvalue in Python. How, now began to think about C 艹 type system of well.

Because Python equates the default literal constant {} to Dict (), the idea that dict is a silver bullet is not advisable; Python has a lot of data structure, to solve statistical problems, the ideal solution is to collections.defaultdict this class. The following code must have a glance to understand

From collections import Defaultdict  doc_a = {' id ': ' A ', ' words ': [' word_w ', ' word_x ', ' word_y ']} Doc_b = {' id ': ' B ', ' Words ': [' word_x ', ' Word_z ']} Doc_c = {' id ': ' C ', ' Words ': [' word_y ']}  docs = [Doc_a, Doc_b, doc_c] indices = Defaultd ICT (list) for  doc in docs: for   word in doc[' words ':     indices[word].append (doc[' id '))  print Indices  def word_count (words):   count = defaultdict (int) for   word in words:     Count[word] + = 1  return count< C11/>print word_count ([' Hiiragi ', ' Kagami ', ' Hiiragi ', ' Tukasa ', ' Yosimizu ', ' Kagami '])

Solved all the previous problems.

There is also a Counter in collections, which can be roughly thought of as an extension of defaultdict (int).



This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

How to handle index statistics with a dictionary in Python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

How to handle index statistics with a dictionary in Python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support