Shell script to implement a variety of ways to arrange the contents of the file (shuffle problem) _linux Shell

Source: Internet
Author: User
Tags rand shuffle

Shuffle problem: Wash a pair of poker, what good way? Can be washed evenly, and can wash quickly? What is the efficient implementation of a disorderly arrangement relative to a file?

Chinaunix is indeed the place where the Shell masters are gathered, so long as you want to get to the point where you can basically find the answer. R2007 a tricky approach, using the Shell's $RANDOM variable to each line of the original file with a random line number and then sorted according to the random line number, and then the temporary addition to the line number to filter out, so that after the operation of the new file is equivalent to be randomly "wash" once:

Copy Code code as follows:

While read I;do echo "$i $RANDOM";d one<file|sort-k2n|cut-d ""-f1

Of course, if the content of each line of your source file is more complex, you must rewrite this code, but as long as you know the key skills to deal with, the remaining problems are not difficult to solve.

Another article from the Su Rong Rong in awk to achieve the shuffle effect of random file sorting code analysis (originally posted here, as well as a follow-up discussion of this post, if you do not have a login account can be here to see the essence of the article) is written in more detail:
--------------------------------------------------------------------
On the shuffle problem, in fact, there is a good shell solution, here is another three based on the awk method, there are errors also please hesitate to point out.

Method One: Poor lift

Similar to the exhaustive method, a hash is constructed to record the current number of rows that have been printed, and if the number of occurrences is more than once, it prevents duplication, but the disadvantage is that the overhead of the system is increased.

Copy Code code as follows:

Awk-v n= ' sed-n ' $= ' data '
begin{
Fs= "\ n";
Rs= ""
}
{
Srand ();
while (t!=n) {
X=int (N*rand () +1);
a[x]++;
if (a[x]==1)
{
Print $x; t++
}
}
}
' Data

method Two: Transform

Based on the method of array subscript transformation, the content of each row is stored in an array, and the contents of the array are exchanged through the transformation of the array subscript, and the efficiency is better than method one.

Copy Code code as follows:

#! /usr/awk

begin{
Srand ();
}

{
b[nr]=$0;
}

end{

C (B,NR);
for (x in B)
{
Print B[x];
}}

function C (arr,len,i,j,t,x) {

for (x in arr)
{
I=int (Len*rand ()) +1;
J=int (Len*rand ()) +1;
T=arr[i];
ARR[I]=ARR[J];
arr[j]=t;
}

}


method Three: hash out

The best of the three methods.
Using the properties of the hash in awk (see for details: 7.x in Info gawk), as long as you construct a random hash function, because each row of the linenumber is unique, you use:

Random number + per line linenumber------corresponds to------> line

is the random function that is constructed.
Thus there are:

Copy Code code as follows:

awk ' Begin{srand ()}{b[rand () nr]=$0}end{for (x in B) print b[x]} ' data

In fact, we worry about the use of memory too big problem do not care too much, you can do a test:

Test environment:

PM 1.4GHz cpu,40g hard disk, memory 256M laptop
SUSE 9.3 GNU Bash version 3.00.16 GNU Awk 3.1.4

Produces a random file of more than 500,000 rows, about 38M:

Copy Code code as follows:

Od/dev/urandom |DD count=75000 >data

Take a less efficient approach:

Shuffle the time used:

Copy Code code as follows:

Time awk-v n= ' sed-n ' $= ' data '
begin{
Fs= "\ n";
Rs= ""
}
{
Srand ();
while (t!=n) {
X=int (N*rand () +1);
a[x]++;
if (a[x]==1)
{
Print $x; t++
}
}
}
' Data

Results (omitted file contents):
Copy Code code as follows:

Real 3m41.864s
User 0m34.224s
SYS 0m2.102s

So efficiency is still barely acceptable.

Test of Method Two:

Copy Code code as follows:

Time Awk-f awkfile datafile

Results (omitted file contents):
Copy Code code as follows:

Real 2m26.487s
User 0m7.044s
SYS 0m1.371s

Efficiency is obviously better than the first one.

Then examine the efficiency of method three:

Copy Code code as follows:

Time awk ' Begin{srand ()}{b[rand () nr]=$0}end{for (x in B) print b[x]} ' data

Results (omitted file contents):
Copy Code code as follows:

Real 0m49.195s
User 0m5.318s
SYS 0m1.301s

It's pretty good for a 38M file.
--------------------------------------------------------------------

With a python version of the code written from FlyFly:

Copy Code code as follows:

#coding: gb2312
Import Sys
Import Random

def usage ():
Print "Usage:program srcfilename dstfilename"
Global filename
filename = ""
Try
filename = sys.argv[1]
Except
Usage ()
Raise ()
#open the phonebook file

f = open (filename, ' R ')
Phonebook = F.readlines ()
Print Phonebook
F.close ()

#write to file randomly
Try
filename = sys.argv[2]
Except
Usage ()
Raise ()

f = open (filename, ' W ')
Random.shuffle (Phonebook)
F.writelines (Phonebook)
F.close ()

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.