How to handle index statistics with a dictionary in Python

Source: Internet
Author: User
The recent tossing index engine and data statistics work a lot, and Python dictionary frequently dealing with, so that a copy of this aspect of the API usage and the pit law record.

The indexing engine works by inverted indexes, which in turn map the text contained in a document to the document; This algorithm does not have too many tricks to say, in order to increase efficiency, index data can be moved into memory, this method can effect Wang Xianzhi XI calligraphy, as long as the 18 machine memory full, then the basic will be successful. And the basic idea for a simple example, now has the following document (participle has been completed) and its inclusion of the keywords

  Doc_a: [Word_w, Word_x, word_y]  doc_b: [word_x, Word_z]  doc_c: [word_y]

Transform it to

  Word_w, [doc_a]  word_x, [Doc_a, Doc_b]  word_y, [Doc_a, Doc_c]  word_z, [Doc_b]

Written in Python code, it is

Doc_a = {' id ': ' A ', ' words ': [' word_w ', ' word_x ', ' word_y ']} Doc_b = {' id ': ' B ', ' Words ': [' word_x ', ' Word_z ']} doc_c = {' Id ': ' C ', ' Words ': [' word_y ']}  docs = [Doc_a, Doc_b, doc_c] indices = Dict () for  doc in docs: for   word in doc[ ' Words ':     if Word not in indices:       Indices[word] = []     indices[word].append (doc[' id '])  print Indices

But here's a little trick, which is to determine whether the current word is already in the index dictionary branch

If Word not in indices:   Indices[word] = []

Can be replaced by the Dict SetDefault (key, Default=none) interface. The function of this interface is, if key is in the dictionary, then say, take out the corresponding value; Otherwise, this key is created, and the default corresponding value is set. But in terms of design, I do not understand why the default value of None, does not seem to make much sense, if you really want to use this interface, the general will bring the default value it, as follows

For doc in docs:   for word in doc[' words ':     indices. SetDefault (Word, []). Append (doc[' id ')

This saves the branching and the code looks much less.
In some cases, however, SetDefault is not easy to use: When the default value is constructed very complex, or the default value has side effects, and a later case; The first two cases are word, that is, SetDefault does not apply to default scenarios where lazy evaluation is required. In other words, to take into account this need, SetDefault may be designed

def setdefault (self, Key, default_factory):   If key isn't in self:     self[key] = default_factory ()   return self[ Key

If this is true, then the code above should be changed to

For doc in docs:   for word in doc[' words ':     indices.setdefault (Word, list). Append (doc[' id '])

But there are actually other alternatives, which will be mentioned at the end.

If it's just an API flaw that you can foresee but may not actually encounter at all, here's a little bit of a face.
Consider now to carry out the word frequency statistics, that is, how many times a term appears in the article, if directly take Dict to write, roughly is

def word_count (words):   count = Dict ()   for word in words:     count.setdefault (Word, 0) + = 1  return count  print Word_count ([' Hiiragi ', ' Kagami ', ' Hiiragi ', ' Tukasa ', ' Yosimizu ', ' Kagami '])

When you run the above code with gusto, the code will throw the exception to the tip of your nose with a thunderbolt less than a face---because the count.setdefault (Word, 0) that appears to the left of the + = operator is not an lvalue in Python. How, now began to think about C 艹 type system of well.

Because Python equates the default literal constant {} to Dict (), the idea that dict is a silver bullet is not advisable; Python has a lot of data structure, to solve statistical problems, the ideal solution is to collections.defaultdict this class. The following code must have a glance to understand

From collections import Defaultdict  doc_a = {' id ': ' A ', ' words ': [' word_w ', ' word_x ', ' word_y ']} Doc_b = {' id ': ' B ', ' Words ': [' word_x ', ' Word_z ']} Doc_c = {' id ': ' C ', ' Words ': [' word_y ']}  docs = [Doc_a, Doc_b, doc_c] indices = Defaultd ICT (list) for  doc in docs: for   word in doc[' words ':     indices[word].append (doc[' id '))  print Indices  def word_count (words):   count = defaultdict (int) for   word in words:     Count[word] + = 1  return count< C11/>print word_count ([' Hiiragi ', ' Kagami ', ' Hiiragi ', ' Tukasa ', ' Yosimizu ', ' Kagami '])

Solved all the previous problems.

There is also a Counter in collections, which can be roughly thought of as an extension of defaultdict (int).

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.