Shuffle problem: Wash a pair of poker, what good way? Can be washed evenly, and can wash quickly? What is the efficient implementation of a disorderly arrangement relative to a file?
Chinaunix is indeed the place where the Shell masters are gathered, so long as you want to get to the point where you can basically find the answer. R2007 a tricky approach, using the Shell's $RANDOM variable to each line of the original file with a random line number and then sorted according to the random line number, and then the temporary addition to the line number to filter out, so that after the operation of the new file is equivalent to be randomly "wash" once:
Copy Code code as follows:
While read I;do echo "$i $RANDOM";d one<file|sort-k2n|cut-d ""-f1
Of course, if the content of each line of your source file is more complex, you must rewrite this code, but as long as you know the key skills to deal with, the remaining problems are not difficult to solve.
Another article from the Su Rong Rong in awk to achieve the shuffle effect of random file sorting code analysis (originally posted here, as well as a follow-up discussion of this post, if you do not have a login account can be here to see the essence of the article) is written in more detail:
--------------------------------------------------------------------
On the shuffle problem, in fact, there is a good shell solution, here is another three based on the awk method, there are errors also please hesitate to point out.
Method One: Poor lift
Similar to the exhaustive method, a hash is constructed to record the current number of rows that have been printed, and if the number of occurrences is more than once, it prevents duplication, but the disadvantage is that the overhead of the system is increased.
Copy Code code as follows:
Awk-v n= ' sed-n ' $= ' data '
begin{
Fs= "\ n";
Rs= ""
}
{
Srand ();
while (t!=n) {
X=int (N*rand () +1);
a[x]++;
if (a[x]==1)
{
Print $x; t++
}
}
}
' Data
method Two: Transform
Based on the method of array subscript transformation, the content of each row is stored in an array, and the contents of the array are exchanged through the transformation of the array subscript, and the efficiency is better than method one.
Copy Code code as follows:
#! /usr/awk
begin{
Srand ();
}
{
b[nr]=$0;
}
end{
C (B,NR);
for (x in B)
{
Print B[x];
}}
function C (arr,len,i,j,t,x) {
for (x in arr)
{
I=int (Len*rand ()) +1;
J=int (Len*rand ()) +1;
T=arr[i];
ARR[I]=ARR[J];
arr[j]=t;
}
}
method Three: hash out
The best of the three methods.
Using the properties of the hash in awk (see for details: 7.x in Info gawk), as long as you construct a random hash function, because each row of the linenumber is unique, you use:
Random number + per line linenumber------corresponds to------> line
is the random function that is constructed.
Thus there are:
Copy Code code as follows:
awk ' Begin{srand ()}{b[rand () nr]=$0}end{for (x in B) print b[x]} ' data
In fact, we worry about the use of memory too big problem do not care too much, you can do a test:
Test environment:
PM 1.4GHz cpu,40g hard disk, memory 256M laptop
SUSE 9.3 GNU Bash version 3.00.16 GNU Awk 3.1.4
Produces a random file of more than 500,000 rows, about 38M:
Copy Code code as follows:
Od/dev/urandom |DD count=75000 >data
Take a less efficient approach:
Shuffle the time used:
Copy Code code as follows:
Time awk-v n= ' sed-n ' $= ' data '
begin{
Fs= "\ n";
Rs= ""
}
{
Srand ();
while (t!=n) {
X=int (N*rand () +1);
a[x]++;
if (a[x]==1)
{
Print $x; t++
}
}
}
' Data
Results (omitted file contents):
Copy Code code as follows:
Real 3m41.864s
User 0m34.224s
SYS 0m2.102s
So efficiency is still barely acceptable.
Test of Method Two:
Copy Code code as follows:
Time Awk-f awkfile datafile
Results (omitted file contents):
Copy Code code as follows:
Real 2m26.487s
User 0m7.044s
SYS 0m1.371s
Efficiency is obviously better than the first one.
Then examine the efficiency of method three:
Copy Code code as follows:
Time awk ' Begin{srand ()}{b[rand () nr]=$0}end{for (x in B) print b[x]} ' data
Results (omitted file contents):
Copy Code code as follows:
Real 0m49.195s
User 0m5.318s
SYS 0m1.301s
It's pretty good for a 38M file.
--------------------------------------------------------------------
With a python version of the code written from FlyFly:
Copy Code code as follows:
#coding: gb2312
Import Sys
Import Random
def usage ():
Print "Usage:program srcfilename dstfilename"
Global filename
filename = ""
Try
filename = sys.argv[1]
Except
Usage ()
Raise ()
#open the phonebook file
f = open (filename, ' R ')
Phonebook = F.readlines ()
Print Phonebook
F.close ()
#write to file randomly
Try
filename = sys.argv[2]
Except
Usage ()
Raise ()
f = open (filename, ' W ')
Random.shuffle (Phonebook)
F.writelines (Phonebook)
F.close ()