How to handle index statistics with a dictionary in Python

Source: Internet
Author: User

This article mainly introduces the use of the dictionary in Python to deal with index statistics, the use of the dictionary is the basic knowledge of Python learning, this article is related to a small practice, the need for friends can refer to the

Recently toss index engine and data statistics work more, and the Python dictionary frequently deal with, so that a copy of this aspect API usage and pit method for filing.

The basic principle of indexing engine is the inverted index, which maps the text contained in a document in turn to the document; This aspect of the algorithm is not too many tricks, in order to increase efficiency, the index data can move into the memory, this method can be effective Wang Xianzhi the potential of calligraphy, as long as the 18 machine memory all stuffed, then the basic will be successful. And the basic idea for a simple example, now has the following document (participle has been completed) and its included keywords


1 2 3 Doc_a: [Word_w, Word_x, word_y] Doc_b: [word_x, Word_z] Doc_c: [word_y]

Transform it to


1 2 3 4 Word_w-> [doc_a] word_x-> [Doc_a, Doc_b] word_y-> [Doc_a, Doc_c] word_z-> [Doc_b]

Written in Python code, is


1 2 3 4 5 6 7 8 9 10 11 12 13-14 Doc_a = {' id ': ' A ', ' words ': [' word_w ', ' word_x ', ' word_y ']} Doc_b = {' id ': ' B ', ' Words ': [' word_x ', ' Word_z ']} doc_c = {' Id ': ' C ', ' Words ': [' word_y ']} docs = [Doc_a, Doc_b, doc_c] indices = Dict () for doc in Docs:for word in doc[' words '] : If Word not in indices:indices[word] = [] Indices[word].append (doc[' id ') print indices

But here's a little trick to determine whether the current word is already in the index dictionary


1 2 If Word not in indices:indices[word] = []

Can be replaced by the Dict SetDefault (key, Default=none) interface. The function of this interface is, if the key in the dictionary, then say, take out the corresponding value; Otherwise, the new key is created and the default corresponding value is set to defaults. But from the design point of view, I do not understand why the default value of None, does not seem to have much meaning, if you really want to use this interface, the general will bring the default value bar, as follows


1 2 3 For doc in Docs:for word in doc[' words ': indices. SetDefault (Word, []). Append (doc[' id '))

This saves the branching and the code looks a lot less.

In some cases, however, SetDefault is not easy to use: When the default construction is complex, or when a default value has a side effect, and a situation that is later mentioned; The first two situations word that setdefault does not apply to scenarios where default requires lazy evaluation. In other words, in order to accommodate this demand, setdefault may be designed to


1 2 3 4 def setdefault (self, Key, default_factory): If key not on self:self[key] = Default_factory () return Self[key]

If so, then the code above should be changed to


1 2 3 For doc in Docs:for word in doc[' words ': indices.setdefault (Word, list). Append (doc[' id ')

But there are other alternatives, which will be mentioned in the end.

If it's just a predictable API flaw that you might not actually encounter at all, then this is a bit of a slap in the face.

Consider now to do frequency statistics, that is, a word in the article how many times, if directly with Dict to write, is roughly


1 2 3 4 5 6 7 def word_count (words): Count = Dict () for Word in Words:count.setdefault (Word, 0) + = 1 return count print word_count ([' Hiiragi ', ' Kagami ', ' Hiiragi ', ' Tukasa ', ' Yosimizu ', ' Kagami ']

When you run up the code with gusto, the code will throw the exception to your nose as the Thunder doesn't cover your face---because count.setdefault (word, 0) that appears to the left of the + = operator is not a left value in Python. How, now start to think of C 艹 type system of the good.

Because Python has the default literal constant {} equivalent to Dict () the idea that Dict is silver bullet is undesirable; Python has a lot of data structure, to solve statistical problems, the ideal solution is to collections.defaultdict this class. The following code must take a look at it and understand


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21-22 From collections Import Defaultdict doc_a = {' id ': ' A ', ' words ': [' word_w ', ' word_x ', ' word_y ']} Doc_b = {' id ': ' B ', ' Wo RDS ': [' word_x ', ' Word_z ']} Doc_c = {' id ': ' C ', ' Words ': [' word_y ']} docs = [Doc_a, Doc_b, doc_c] indices = defaultdict ( List for doc in Docs:for word in doc[' words ': indices[word].append (doc[' id ') print indices def word_count (words) : Count = defaultdict (int) for Word in words:count[word] + = 1 return count print word_count ([' Hiiragi ', ' Kagami ', ' Hiir Agi ', ' Tukasa ', ' Yosimizu ', ' Kagami ']

A perfect solution to the old problems.


also has a Counter in the collections that can be roughly considered an extension of defaultdict (int).

Related Article

E-Commerce Solutions

Leverage the same tools powering the Alibaba Ecosystem

Learn more >

Apsara Conference 2019

The Rise of Data Intelligence, September 25th - 27th, Hangzhou, China

Learn more >

Alibaba Cloud Free Trial

Learn and experience the power of Alibaba Cloud with a free trial worth $300-1200 USD

Learn more >

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.