聽人說做文本分類時處理100G的文字檔,居然不用大資料,處理方法就是用shell的split去分割成若干小檔案。
split命令
NAME split - split a file into piecesSYNOPSIS split [OPTION] [INPUT [PREFIX]]DESCRIPTION Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default size is 1000 lines, and default PREFIX is ‘x’. With no INPUT, or when INPUT is -, read standard input. Mandatory arguments to long options are mandatory for short options too. -a, --suffix-length=N use suffixes of length N (default 2) -b, --bytes=SIZE put SIZE bytes per output file -C, --line-bytes=SIZE put at most SIZE bytes of lines per output file -d, --numeric-suffixes use numeric suffixes instead of alphabetic -l, --lines=NUMBER put NUMBER lines per output file --verbose print a diagnostic to standard error just before each output file is opened --help display this help and exit --version output version information and exit SIZE may have a multiplier suffix: b for 512, k for 1K, m for 1 Meg.
-l按行分割檔案
-b按指定大小分割檔案,支援b,k,m
例:
split -b 256m result_guid_active_train_all small
ll -lh
-rw-rw-r-- 1 256M Jun 17 20:29 smallaa
-rw-rw-r-- 1 256M Jun 17 20:29 smallab
-rw-rw-r-- 1 256M Jun 17 20:29 smallac
-rw-rw-r-- 1 256M Jun 17 20:29 smallad
-rw-rw-r-- 1 256M Jun 17 20:29 smallae
-rw-rw-r-- 1 256M Jun 17 20:29 smallaf
-rw-rw-r-- 1 256M Jun 17 20:29 smallag
-rw-rw-r-- 1 256M Jun 17 20:29 smallah
-rw-rw-r-- 1 256M Jun 17 20:29 smallai
-rw-rw-r-- 1 256M Jun 17 20:29 smallaj