Listen to people say that text classification when processing 100G text files, incredibly without big data, processing method is to use the shell split to split into a number of small files.
Split command
NAME split-split a file into pieces synopsis split [OPTION] [INPUT [PREFIX]] DESCRIPTION Output Fixed-size pieces of INPUT to Prefixaa, Prefixab, ...; The default size is lines, and the default PREFIX is ' x '.
With no input, or when input was-, read standard input.
Mandatory arguments to long options is Mandatory for short options too. -A,--suffix-length=n use suffixes of length N (default 2)-B,--bytes=size put SIZE b Ytes per output file-c,--line-bytes=size put at most SIZE bytes of lines per output file- D,--numeric-suffixes use numeric suffixes instead of alphabetic-l,--lines=number PU T number lines per output file--verbose print a diagnostic to standard error just before each OUTPU T file is opened--HELP display this Help and exit--version output version information and
Exit
SIZE May has a multiplier suffix:b for, K for 1K, and M for 1 Meg.
-l split file by row
-B splits files by the specified size, supports B,K,M
Cases:
Split-b 256m Result_guid_active_train_all Small
Ll-lh
-rw-rw-r--1 256M June 20:29 Smallaa
-rw-rw-r--1 256M June 20:29 Smallab
-rw-rw-r--1 256M June 20:29 Smallac
-rw-rw-r--1 256M June 20:29 Smallad
-rw-rw-r--1 256M June 20:29 smallae
-rw-rw-r--1 256M June 20:29 Smallaf
-rw-rw-r--1 256M June 20:29 Smallag
-rw-rw-r--1 256M June 20:29 Smallah
-rw-rw-r--1 256M June 20:29 Smallai
-rw-rw-r--1 256M June 20:29 Smallaj