Original: An algorithmic problem that took 1 years from four seconds to 10 milliseconds?
51 After the first week, because the move waist sprain, did not pay attention to lead to oppression nerves, lying in bed resting for several days. So it's okay to hang QQ, a netizen suddenly asked me an algorithm problem. So with this article. Deep feeling, so special to commemorate and write to new friends, as well as those who love the programming of non-professional personnel. I may have low technical content, but they are very real. Although I only spent a very little time, but to solve the problem of this netizen puzzled 1 years, this netizen is particularly grateful, and I feel special heart stuffed. Let's have a cup of tea and have a look at the process.
This text address: http://www.cnblogs.com/asxinyu/p/4504487.html
1. Character background
This netizen I also later chatted to understand his situation. He is 1 1977-born Hubei netizens, in order to analyze the relevant data, self-taught vb.net, this age of people also learned this, really not easy, and can use vb.net to develop more complex data analysis interface. In fact, later learned these, shame AH. So the problem with the algorithm, this friend encountered difficulties, also can understand.
Actually this friend very early is my QQ friend, also know is to do data analysis, all I have new algorithm aspect article will send him to see, occasionally chat, but did not ask me question. Last month published an article: Machine learning PageRank algorithm application and C # implementation (1) algorithm introduction, after the publication, he saw, only asked me this question.
Me: In fact, I am also a dabbler, the algorithm is not proficient, but only interested in amateur research. To tell you the truth, you want me to write a two-point search, I will not be able to 1:30, but I look at papers and materials, can write a Markov chain or Bayesian or the like ... How about this thing? In many problems, space efficiency and time efficiency, especially when the hardware conditions are so rich, can be considered a little less. Of course, this is not to say that the algorithm is useless, but for a lot of very ordinary people, the scale of the study is too small, and because of experience and special reasons, there is no algorithm and data structure basis, can only not consider, to solve the actual problem-based bar.
2. Original questions
The original question of the netizen is this, I copy from the QQ chat record directly:
There are two groups of randomly generated (0~99999) Int32 data A and B, a in order to determine if a in B is present and recorded in the Boolean type C, I tried the array and list (of T), In the VS2010 under the speed of my broken computer array about 4 seconds, and list (of T) to 24 seconds, the following is my use of the array and list (of T) code, please expert guidance, by the way there is no second kill method. (Note: His VB code I will not post, the idea will be able to know)
Help me see what I can do, thank you.
Some people say with hash, unfortunately I will not, nor Baidu to
His development environment is VS2010 + vb.net
I received his message when the phone is being used on the QQ, he also posted a paragraph of VB code, I am more disgusted with the direct code of the person. However, lying in bed, there is no matter, curious, I looked at the problem carefully, the code really did not see.
3. Problem-Solving process
Because it is on the phone, so did not open the computer to hit the code. Just thought for a moment.
The user's original code in the comparison is to use array.indexof, You can imagine 70,000 of the array, slow and normal.
1. First of all, I gave the hash to the negative. In fact, I thought, I was wrong, I thought he said the hash is the hash value of each element after the comparison, this is not superfluous it? It would be time to calculate the hash, or to compare ... Then why ...? Later I thought, he said may be "hash table", this is something, do not mention, hash table This method how do not know, should also be able to, but still see my method first.
2. I first gave him a preliminary plan to solve the problem sometimes not one step, try it first. My idea is that using indexof to find will waste a lot of time. So, you first sort the B, or B in the actual construction process can be sorted storage, and then a in turn contrast, the use of binary search, even conditional, a can also be sorted first, and then search for the record starting point, the dichotomy search, which can save a lot of time. A and B sort of problem, in fact, according to his situation, it can be sorted in the actual process, rather than post-generation sequencing, which will be more time-consuming.
The Netizen is also very quick, after about 1 hours, test out said:"I used the random number of tests, the speed is quite obvious, faster than Array.indexof."
3. The above mobile phone communication is not convenient, also casually said a bit, did not think he soon made it. Although a lot faster, but the time I did not ask. Then I took a bath and felt that the problem was not so much, that I had done something similar before, and that there should be a quicker way. Then the bath process, thinking for several seconds ... A train of thought also has, although this idea I feel very earth, but I think the actual effect should be very good, so after bathing, immediately open computer, with Netizen said a thought, consider he may not understand the algorithm of pseudo code or relatively strict expression (actually I do not know how to express strictly), so directly to play an analogy, Here in order to facilitate everyone to understand, I first probably wrote a thought, should be able to read it. As for the record in question in C, I did not ask him how to record, in fact, this is not related to the problem, the core in the previous how to determine whether to include:
I give that netizen is so analogy (primitive a bit chaotic, I write a little bit of a blog when the next), do not know whether you have ambiguous, feel or the above pseudo code easy to understand, but happy is, this netizen still understand:
A array: No matter, random, no sort,
B array: [5,2,4,1], assuming a maximum of 5, note No 3
Initializes a Boolean array of length 5 (maximum number): A[1],[2],[3],[4],[5]
Loop B, the value of B as the subscript of a, the corresponding position is marked true, for example
a[5]= true;
a[2]= true;
a[4]= true;
a[1]= true;
Note A[3] No, false
Finally loop A, direct contrast subscript, if a={2,3}, then:
A[2]=true, indicating existence, then c[2]=true, to C mark true
A[3]=false, there is no. False is marked in C
If you have a maximum of 99999, then this array will be so long you can directly set to 99999, wasting a bit of space;
If you can find the maximum value directly in your business, it's the best. Try it yourself.
This idea is very simple to understand. The Netizen also quickly understood, after a while to tell me the result of his.
Down to about 10 milliseconds, he expanded the data to 100,000, and it was very fast.
4. PostScript and C # implementation
After solving his problem, we chatted again the next day, and he thanked him by saying that the method was very fast. Said this 1 years, asked a lot of people, also looked for a lot of computer people, but the effect is not good ...
It is said to have asked a person who has taken a Microsoft certification ... Said his computer is not good, want to change ... This is a little too fucked up. Only tens of thousands of of the array, energy consumption of how much memory, are simple comparison calculation, need a good CPU ...
Later, I also explained to him that other people may not fully understand your problem, all a tendon to consider the problem of efficiency and speed, so there are many things to consider, give your advice is not necessarily appropriate. For these small problems, sacrifice a little space, not to mention a lot, and the memory is also cheap, and now motionless 2g,4g. Change time is also good enough. The space I'm talking about here is that the length of the array c is directly initialized, including the number of all numbers, because I don't know how his actual data came from, of course, if you can calculate the maximum value, certainly the best. This gives a slight calculation of the complexity of the time, and the loop can solve the problem 2 times. As for the first time I mentioned the problem of sorting and dichotomy, it is just beginning to think, no more in-depth thinking, because it is also considered that his data can be generated at the time of the sequencing, so it can save time, not all indexof, not slow to blame.
4.1 C # code implements the original method
Idle Nothing, I use C # to implement the original method of the Netizen, the code is as follows:
1 Static voidValidateArrayElement2 ()2 {3Stopwatch SP =NewStopwatch ();4Sp. Start ();//Start Timing5Random Rand =NewRandom ();6Int32 MaxValue =120000;//element Maximum, which is a hypothetical value7Int32 length =70000;//the length of a B8int32[] A =NewInt32[length];9int32[] B =NewInt32[length];Tenboolean[] C =NewBoolean[length]; One //random initialization of A/b array A for(inti =0; i < length; i++) - { -A[i] =Rand. Next (maxValue); theB[i] =Rand. Next (maxValue); - } - //loop A, verify presence, mark C corresponding position as true - for(inti =0; i < a.length; i++)if(B.contains (A[i])) C[i] =true; + sp. Stop (); - Console.WriteLine (sp. Elapsedmilliseconds); +}
Under test, my machine is x200+t9400,3g memory. Plus the data initialization total time is 4.3 seconds, so the actual time is about 4 seconds, and the Netizen's conclusion is similar. Take a look at my following method:
4.2 C # code implementation of the above algorithm
Using the method presented in section 3rd, I'll test the time:
1 Static voidvalidatearrayelement ()2 {3Stopwatch SP =NewStopwatch ();4 sp. Start ();5Random Rand =NewRandom ();6Int32 MaxValue =120000;//element Maximum, which is a hypothetical value7Int32 length =70000;//the length of a B8int32[] A =NewInt32[length];9int32[] B =NewInt32[length];Tenboolean[] C =NewBoolean[length]; Oneboolean[] Atemp =NewBoolean[maxvalue];//Temporary Auxiliary variables A //random initialization of A/b array - for(inti =0; i < length; i++) - { theA[i] =Rand. Next (maxValue); -B[i] =Rand. Next (maxValue); - } - //Loop B, verifying that the element exists + foreach(varIteminchB) Atemp[item] =true; - //loop A, verify presence, mark C corresponding position as true + for(inti =0; i < a.length; i++)if(Atemp[a[i]]) C[i] =true; ASp. Stop ();//Stop Timing at Console.WriteLine (sp. Elapsedmilliseconds); -}
The actual time is only about 5ms, if not calculate the data initialization time, basically only 1ms, and the Netizen 10ms a little difference, may and machine related bar. In general, the speed has really improved a lot.
As for the so-called hash table method, this is not achieved here, already fast enough.
Finally thanks to those who like me, love the programming of the amateur personnel ... Although we are not regular army, although we have not learned the data structure, also did not have the system to study the specialized algorithm course, has not received the specialized programming training, but as long as careful and the brain, solves some small scale question, still can. As for the efficiency of the large numbers of data, the problem of the algorithm is given to Daniel.
The rest of the time to the Netizen, this question is simple? How would you solve it? Expect comments to have better answers ... If it is spray, say the problem simple then forget it, no need, why bother me ...
4.3 hashset test
Thank Passer.net Netizens, said with HashSet, this class previously know, but rarely used, since the proposed, test it, the code is as follows:
1Stopwatch SP =NewStopwatch ();2 sp. Start ();3Random Rand =NewRandom ();4Int32 length =70000;//the length of a B5int32[] A =NewInt32[length];6int32[] B =NewInt32[length];7boolean[] C =NewBoolean[length];8 varTMP =Newhashset<int>();9 //random initialization of A/b arrayTen for(inti =0; i < length; i++) One { AA[i] =Rand. Next (); -B[i] =Rand. Next (); - if(!tmp. Contains (B[i])) the tmp. ADD (B[i]); - } - - //loop A, verify presence, mark C corresponding position as true + for(inti =0; i < a.length; i++) C[i] =tmp. Contains (A[i]); -Sp. Stop ();//Stop Timing +Console.WriteLine (sp. Elapsedmilliseconds);
Tested, about 17ms, a little bit slower than the article method, but also very fast, at an order of magnitude level bar. A hash table may be more useful for other complex similar data or large data volumes. But it doesn't matter, all methods, can solve the problem, do not have to dwell on these details.
An algorithmic problem that took 1 years from four seconds to 10 milliseconds?