From word statistics to interview and word statistics to interview

Source: Internet
Author: User

From word statistics to interview and word statistics to interview

Many of the content in this article comes from the Internet. If you have any errors, please note.

Problem description

First, we define words as a string separated by spaces. The question of word statistics can be described as: in a normal English document (as an interview, Chinese Word Segmentation and word statistics are not mentioned here ), count the number of times each word appears. It is required to count the N words that appear most frequently and the corresponding times. The problem is simple and clear, so you do not need to explain it literally.

Why do interviewers like to take such questions?

Most of these problems have one thing in common: not only data storage, but also sorting. More importantly, the common method can solve the problem, but the interviewer is looking for the best performance, the most elegant form.

How can this problem be solved?

After carefully studying the questions, we found that there are only three things to do: 1. Word Count statistics 2. sorting. 3. Output The first N items. Problem 3 is derived from problem 2 and can be ignored. Further analysis shows that this is a typical top k problem. Aside from the classic solution of the top k problem, let's take a look at the solution step by step.

1. Map

C ++ Standard Template Library provides an efficient container Map, which is a key-Value Pair container (associated container). Therefore, you can easily store <words, number of times> the most concise solution is as follows:

int main(void){    Map <String , int> wc;    Map <String , int>::iterator j;    String w;    while( cin >> w ){        wc[ w ] ++;    }    for( j = wc.begin(); j != wc.end(); j++ ){        count << j->first <<  " : " j->second << "\n";    }    return 0;}

Because of the Map feature (data insertion ensures ordering), the output results are sorted by key (that is, word. Therefore, you also need to sort the output results. As for how to sort, sort, qsort, or self-built heap sort, I will not repeat it here.

This is quite good. The Map container itself has a built-in red/black tree, which makes insertion highly efficient.

But is this really good? Don't forget that the interviewer has higher requirements.

2. Hashtable, one-click direct access, faster Efficiency

In addition to Map, you can also use a custom hash table for ing word and word count. The hash table nodes should include three basic fields: pointer to a word or String, int of the occurrence times of a word, and pointer to the next node *. Let us assume that the solution for dealing with hash conflicts is the link address method. Then:

The Hash table node is defined:

typedef struct node{    char * word;    int  count;    node * next;} node;

Assume that the number of independent words processed cannot exceed 5 w. Therefore, you can select 50021 as the size of the hash table. For the hash algorithm, we can select the common BKDR algorithm (taking 31 as the multiplication factor). In this way, we can map words to an unsigned int INTEGER:

#define HASH_SIZE 50021;#define MUL_FACTOR 31;node* mmap[HASH_SIZE];unsigned int hashCode( char *p){    unsigned int result = 0;    while(*p != '\0' ){        result = result * MUL_FACTOR + *p++;    }    return result % HASH_SIZE;}

The wordCount (char * p) function is used to insert the occurrences of words and words into the hash table. If the hash table has corresponding nodes, the number of words is increased and the return value is returned, otherwise, you must add a new node to the table header of the hash table and set the number of initialization times to 1.

 1 void wordCount( char *w ){ 2     unsigned int hashCode = hash( p ); 3     node * p; 4  5     while( p = mmap[hashCode];p!=NULL;p = p->next ){ 6         if(strcmp(w, p->word) == 0){ 7             (p->count)++; 8             return ; 9         }10     }11 12     node*q = (node*) malloc(sizeof(node));13     q->count = 1;        14     q->word = (char *) malloc(sizeof(w) + 1);15     strcpy(q->word, w);16     q->next = mmap[hashCode]; 17     mmap[hashCode] = q;18 }

Similarly, the hash table only provides data storage without sorting and sorting.

This seems perfect.

Cool, lunjia does not understand algorithms or C ++ containers. Swollen?

3. Shell version word statistics

Linux provides many text tools, such as uniq, tr, and sort. The existence of pipelines solves the problem that text and words and intermediate processing results need to be stored properly. The most important thing is that these tools are a series of black boxes. You only need to know how to use them, so you don't have to care about the details of internal implementations.

For this question, use the Linux tool to handle it. The command can be:

cat  word |tr –cs a-zA-Z\’ ‘\n’ |tr A-Z a-z |sort |uniq –c |sort –k1,1nr |head –n 10

Explanation of each line of the program:

Here, I suddenly think of the question that I asked about the word count when I went to Baidu for an interview several years ago: a text contains many words, how do I count the number of times of each word per line? The answer is uncertain: every time a row is read in the PHP script, words and times are stored in the associated array ....... The results must be tragic.

4. AWK, a powerful tool for text processing

Since we have mentioned the linux tool, we have to mention awk, a powerful text analysis and processing tool. The Wiki says:

AWK is an excellent text processing tool. It is one of the most powerful Data Processing engines in Linux and Unix environments. AWK provides extremely powerful functions: Regular Expression matching, style loading, flow control, mathematical operators, process control statements, and even built-in variables and functions. It has almost all the exquisite features of a complete language.

Although it is only a good word, I have to admit that awk is really powerful.

awk -F ' |,'  '{    for(i=1; i<=NF;i++){        a[tolower($i)]++;    }}END{    for(i in a)        print i, a[i] |"sort -k2,2nr";}'  word

In gawk 3.1 +, the built-in functions asort and asorti can be used to sort arrays, but the sorting function is weak. For example, asort (a), if a is an associated array, asort performs sorting only on values, the key is discarded, and replaced by a new 1-n numeric index. This can be solved by passing the custom sorting function or result to the system sort sorting through the pipeline.

5. database version solution.

Let's assume that the text is already a line of words. The basic structure of the database table is:

CREATE TABLE `test` (  `word` varchar(20) DEFAULT NULL) ENGINE=MyISAM DEFAULT CHARSET=gbk;

Load data in file ):

awk 'BEGIN{    sql="insert into test(word) values ";    mysql="mysql -hxxxxxx -uxxxxx -pxxx test -e ";}{    for(i=1;i<=NF;i++){        sq="\"" sql "('\''" $i"'\'')" "\"";        print mysql sq |"sh -x";    }}' word

Simple query:

select word,count(word) as total from test group by word order by total desc;

You can get the number of words. This method is very simple, because the database has completed all the statistics and sorting for you, which is why many people turn to the database once they have any need.

The Key-Value-based Cache System (such as Redis) can also be used to sort data.

Thoughts

If you are an interviewer, which solution do you like? Algorithms and data structures are shell and awk. As far as I am concerned, I think that the interview is a selection of talents, rather than making it difficult for others by the so-called "Technical and lust. If I see someone using shell or awk, I will give him a high score. Although the algorithm is king, do you really need to "Everything starts from the source code?

 

References:
1. http://www.cnblogs.com/ggjucheng/archive/2013/01/13/2858470.html
2. shell script Guide
3. Programming Pearl River


How to interview statistics

Hello. I don't know what the interview means.
Postgraduate review:
1. collect statistics on professional questions, such as giving you a case, asking you how to conduct a survey, and how to analyze it. This can be seen in the requirements of colleges and universities, there will be instructions, focusing on which teaching materials or what aspects.
2. Generally, there will be questions and answers in Oral English, but this should be simple. There are not many really powerful professors. I suggest you look at the statistical English, frequently used statistical words, such as variables and related words, regression, mean, variance, and other terms and definitions must be written down.
Enterprise interview:
1. You must do the same in terms of professionalism, but it also means whether you do data analysis for general enterprises or data analysis by market research companies. If you are a general enterprise, it is good to describe statistics in a basic way. Just make a chart and write a simple data description report. If it is a market survey, we need to use inference statistics. You can learn the regression, factors, and other factors, but don't be confused.
2. Software: for general enterprises, such as word/excel. Excel is also very important for companies, but it is more likely to use professional statistical software, such as spss/eviews. In addition, no matter which PPT is important, with data and conclusions, a good carrier is also needed.
3. English: Generally, it depends on the company's customers. If it is a foreign company, it is positive in English, not only in terms of statistics, but also important in other aspects. If it is for domestic enterprises, the requirements for English are basically not high.

How to count words in C Language

# Include "stdio. h"
Main ()
{
Char s [81], c;
Int I, num = 0, word = 0;
Printf ("enter a line of English: \ n ");
Gets (s );
For (I = 0; (c = s [I])! = '\ 0'; I ++) // determines whether the sentence has arrived.
If (c = '') word = 0; // if it is a space, it is 0.
Else if (word = 0) // if it is not a space, determine whether there is a space before
{
Word = 1; // The switch used to determine whether a space exists
Num ++; // if there is a space before, add 1 to the number of words
}
Printf ("% d Words in the row. \ n", num );
}

The principle is simple:

Judge each character. If it is a space, mark the variable word as 0. If it is not a space, judge whether word is 0 (that is, whether there is a space before ), if there is a space word plus 1, then reset the word variable to 0. Repeat it to know the sentence is over.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.