Shell scripting multiple ways to rearrange the contents of a file (shuffle problem)

Last Update:2016-04-29 Source: Internet

Author: User

Tags rand shuffle

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Shuffle question: Wash a pair of poker, what good way? Can be washed evenly, and can wash fast? How efficient is the order of chaos relative to a file?

Chinaunix is really a place where the Shell masters are gathered, so long as you want to get the problem, there is basically to find the answer. R2007 gives a trickery method that uses the Shell's $RANDOM variable to add a random line number to each line of the original file and then sorts it according to the random line number, and then filters out the line number that is temporarily added, so that the new file obtained after the operation is the equivalent of being randomly "washed" once:

While read I;do echo "$i $RANDOM";d one<file|sort-k2n|cut-d ""-f1

Of course, if your source file has a complex line of content, you must rewrite the code, but as long as you know the key skills of processing, the remaining problems are not difficult to solve.

Another random file ordering code analysis from Su Rong Rong that uses awk to shuffle the effect (originally posted here, as well as a follow-up discussion of this post, if you are not logged into the account, you can check out the highlights section here) and write more detailed:
--------------------------------------------------------------------
On the shuffle problem, there has been a good shell solution, here another three based on the awk method, there is a mistake also please point out.

Method One: Poor lifting

Similar to the exhaustive method, a hash is constructed to record the number of times the line has been printed, and if more occurrences are not processed, this prevents duplication, but the disadvantage is that it increases the overhead of the system.

Awk-v n= ' Sed-n'$='Data ''Begin{fs="\ n"; RS=""}{srand (); while(t!=N) {x=int(N*rand () +1); A[X]++; if(a[x]==1) {print $x; t++    }  }}'Data

Method Two: Transform

Based on the method of array subscript transformation, that is, the content of each row is stored in an array, and the contents of the array are exchanged by the transformation of the array subscript, and the efficiency is better than that.

#! /usr/Awkbegin{srand ();} {B[NR]=$0;} End{c (B,NR);  for inch b)  {    print b[x];}  } function C (arr,len,i,j,t,x) { for-in arr)  {      I=int(len*rand ()) +1;      J=int(Len*rand ()) +1;      T=arr[i];      Arr[i]=Arr[j];      ARR[J]=t;  }}

Method Three: Hash

The best of the three methods.
Using the features of the hash in awk (see details: 7.x in Info gawk), just construct a random, non-repeating hash function, because the linenumber of each line of a file is unique, so use:

Random number + each line linenumber------corresponding------> The contents of that line

is the random function that is constructed.
Thus there are:

awk ' Begin{srand ()}{b[rand () nr]=$0}end{for (x in B) print b[x]} ' data

In fact, we worry about the use of memory too large problem do not care too much, you can do a test:

Test environment:

PM 1.4GHz cpu,40g HDD, Memory 256M Laptop
SUSE 9.3 GNU Bash version 3.00.16 GNU Awk 3.1.4

Produces a random file of more than 500,000 rows, approximately 38M:

Od/dev/urandom |dd  count=75000 >data

To take a less efficient approach:

Shuffle time used:

Time Awk-v n= ' sed-n'$='Data ''Begin{fs="\ n"; RS=""}{srand (); while(t!=N) {x=int(N*rand () +1); A[X]++; if(a[x]==1) {print $x; t++    }  }}'Data

Results (omission of file contents):

Real    3m41.864suser    0m34.224ssys     0m2.102s

So efficiency is barely acceptable.

Test of Method Two:

Time Awk-f awkfile datafile

Results (omission of file contents):

Real    2m26.487suser    0m7.044ssys     0m1.371s

Efficiency is significantly better than the first one.

Then examine the efficiency of method three:

Time awk ' Begin{srand ()}{b[rand () nr]=$0}end{for (x in B) print b[x]} ' data

Results (omission of file contents):

Real    0m49.195suser    0m5.318ssys     0m1.301s

It's pretty good for a 38M file.
--------------------------------------------------------------------

There is a Python version of the code from flyfly written in a random order:

#coding: gb2312 import sys import RANDOMDEF usage (): Print"Usage:program srcfilename dstfilename" Globalfilename filename="" Try: FileName= sys.argv[1] Except:usage () raise () #open the phonebook Filef= open (filename,'R') Phonebook=F.readlines () print phonebook f.close () #write to file randomlyTry: FileName= sys.argv[2] Except:usage () raise () F= open (filename,'W') Random.shuffle (phonebook) F.writelines (phonebook) F.close ()

Shell scripting multiple ways to rearrange the contents of a file (shuffle problem)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More