I want to shuffle a large file with millions of lines of strings in Linux. I tried 'sort -R' But it is very slow (takes like 50 mins for a 16M big file). Is there a faster utility that I can use in the place of it?
linux bash unix
shareimprove this question
asked Feb 6 '13 at 10:48
Shuf? en.wikipedia.org/wiki/Shuf – Anders Lindahl Feb 6 '13 at 10:51
millions of lines for a 16MB file: you have very short lines? BTW: 16 MB is not big. It will fit in core, and sorting will take less than a second, I guess. – wildplasser Feb 6 '13 at 10:56
@AndersLindahl : What's the entropy Shuf introduces? Is it as random as 'sort -R' – alpha_cod Feb 6 '13 at 11:05
@wildplasser : Oh...its a 16 Million line file, not 16 MB. Sorting is quite fast on this file, but 'sort -R' is very slow. – alpha_cod Feb 6 '13 at 11:05
@alpha_cod: I would guess it's /dev/random. You can control then entropy source with --random-source. – Anders Lindahl Feb 6 '13 at 11:33
This is a similar thread stackoverflow.com/questions/2153882/… – Ifthikhan Feb 6 '13 at 12:10
@AndersLindahl How about suggesting that as an answer? – that other guy Feb 6 '13 at 20:02
add a comment
up vote11down vote
Use shuf instead of sort -R (man page).
The slowness of sort -R is probably due to it hashing every line. shuf just does a random permutation so it doesn't have that problem.
(This was suggested in a comment but for some reason not written as an answer by anyone)