Linux File Sorting and FASTA file operations, linux File Sorting fasta
File Sorting seq: generates a series of numbers. man seq is used to view specific values. We use seq to generate input files used for downstream analysis.
# Generate the number from 1 to 10. The step size is 1 $ seq 1 1012345678910 # generate the number from 1 to 10. The step size is 1, use spaces to separate $ seq-s ''1 101 2 3 4 5 6 7 8 9 10 # generate the number from 1 to 10. The step size is 2 # If there are 3 numbers, the number in the middle is the step size, and the last one is always the maximum value $ seq-s '1 2 101 3 5 7 9 $ cat <(seq 0 3 17) <(seq 3 6 18)> test $ cat test 036912153915
Sort: sort by character encoding by default. To sort by number, add the-n parameter.
# The system first ranks 0, and then 1, 3, 6, 9 $ sort test012151531049 # sort by number size $ sort-n test03426121515
Sort-u: removes duplicate rows, equivalent to sort | uniq
$ sort -nu test03691215
Sort file | uniq-d: Get duplicate rows (d = duplication)
$ sort -n test | uniq -d3915
Sort file | uniq-c: obtains the number of duplicates in each row.
# The first column lists the number of times each row appears, the second column is the original row $ sort-n test | uniq-c 1 0 2 3 1 6 2 9 1 12 2 15 # change the file to see more clearly $ cat <END> test2> a> B> c> B> a> e> d> a> END # the number of times each row appears in the first column, the second column is the original row $ sort test2 | uniq-c 3 a 2 B 1 c 1 d 1 e # The files must be sorted before the uniq operation is executed, otherwise the result is very strange $ cat test2 | uniq-c 1 a 1 B 1 c 1 B 1 a 1 e 1 d 1
Sort out the results of uniq-c so that the original row is in front and the count of each row is in the back. Awk is a powerful text processing tool that processes data in a row-by-row mode. Read a row each time and perform the operation. OFS: output file column separtor; FS is the column Separator of the input file (blank by default ). From column 1st to column n in awk, records are $1, $2... $ N. BEGIN indicates that the basic parameters are set before the file is read. The END parameter corresponds to the parameter, and only operations are performed after the file is read. {}, Which does not start with BEGIN or END, is the part for reading and processing files.
# The awk operation is the result of the previous plating step. Remove unnecessary blank space and replace the two columns $ sort test2 | uniq-c | awk 'in in {OFS = "\ t ";} {print $2, $1} 'a 3b 2c 1d 1e 1
Sort files in two columns according to the second column, sort-k2, 2n.
# Sorting by value in the second column $ sort test2 | uniq-c | awk 'in in {OFS = "\ t" ;}{ print $2, $1} '| sort-k2, 2nc 1d 1e 1b 2a 3 # The second column is ordered by the value size # The second column is the same and then sorted in the reverse alphabetical order of the first column (-r) # note the difference between the sequence of the first three rows and the result of the previous step $ sort test2 | uniq-c | awk 'in in {OFS = "\ t" ;}{ print $2, $1} '| sort-k2, 2n-k1, 1re 1d 1c 1b 2a 3
FASTA sequence extraction generates a single-line sequence FASTA file to extract the specific gene sequence. The simplest is to use the grep command. The main purpose is to match strings in a file and perform a series of operations based on this. If a regular expression is used, it is very powerful. There are many regular expression versions, and almost every language has its own rules.
# Generate a single row sequence FASTA File $ cat <END> test. fasta> SOX2> Telecom> POU5F1> ACGAGGGACGCATCGGACGACTGCAGGACTGTC> NANOG> CGGAAGGTAGTCGTCAGTGCAGCGAGTCCGT> END $ cat test. fasta> tags> NANOGCGGAAGGTAGTCGTCAGTGCAGCGAGTCCGT # grep matches the line containing SOX2 #-A 1 indicates that the output row contains the next line of the matched row (A: after) $ grep-A 1 'sox2 'test. fasta> SOX2ACGAGGGACGCAT CGGACGACTGCAGGACTGTC # first judge whether the current row is> the beginning. If yes, it indicates the sequence name line. Replace the number greater than the number and obtain the name. # Sub replacement, sub (replaced part, to be replaced with, to be replaced with a string) # if it does not start with a greater than sign, it is a sequence row and is stored. # Seq [name]: It is equivalent to creating a dictionary. The name is the key and the sequence is the value. Then you can use name to retrieve the sequence. $ Awk 'in in {OFS = FS = "\ t"} {if ($0 ~ />/) {Name = $0; sub (">", "", name);} else seq [name] = $0 ;} END {print "> SOX2"; print seq ["SOX2"]} 'test. fasta> SOX2ACGAGGGACGCATCGGACGACTGCAGGACTGTC
Multi-row FASTA sequence extraction requires a little trouble. One way is to convert it into a single-row sequence and use the above method for processing. Sed and tr are the most common character Replacement Tools.
$ Cat <END> test. fasta> SOX2> latency> POU5F1> latency> NANOG> latency> acgagggacgcatcggacgacgactgc133> latency> END # END of the row starting with "> "add a TAB key, in order to separate names from sequences # The TAB key is invisible, look at the small # \ (\) to record the Matching content, \ 1 to represent () matching content # We will talk about sed $ sed's/^ \ (>. * \)/\ 1 \ t/'test. fasta> SOX2 placement> POU5F1 placement> NANOG placement # Use cat-A to display all the symbols in the file # ^ I to represent the tab key # $ to indicate the end of the line $ sed's/^ \( >. * \)/\ 1 \ t/'test. fasta | cat-A> SOX2 ^ I $ variable $> POU5F1 ^ I $ variable $> NANOG ^ I $ variable $ ACGAGGGACGCATCGGACGACTGCAGG $ variable # replace all linefeeds with spaces # The second parameter of the idea, the quotation mark is a space $ sed's/^ \ (>. * \)/\ 1 \ t/'test. fasta | tr '\ n'> SOX2 variable> POU5F1 variable> NANOG variable acgagggacgcatcggacgactgctgvariable # Replace the last space with the line break $ sed's/^ \( >. * \)/\ 1 \ t/'test. fasta | tr '\ n''' | sed-e's/$/\ n/'> SOX2 variable generation> POU5F1 variable generation> NANOG variable ACGAGGGACGCATCGGACGACTGCAGG variable # Put'> 'replace with a linefeed. Note that space + greater than sign is replaced. # When multiple replacement commands are used, use-e to separate $ sed's/^ \ (>. * \)/\ 1 \ t/'test. fasta | tr '\ n' | sed-e's/$/\ n/'-e's/>/\ n>/G'> SOX2 then ACGAGGGACGCATCGGACGACTGCAGGAC> POU5F1 using CGGAAGGTAGTCGTCAGTGCAGCGAGTCC> NANOG using acgagggacgcatcggacgactgc?## replace all spaces with $ sed's/^ \ (>. * \)/\ 1 \ t/'test. fasta | tr' \ n' | sed-e's/$/\ n/'-e's/>/\ n>/G'-e's/ // G'> SOX2 variable> POU5F1 CGGAAGGTAGTCGTCAGTGCAGCGAGTCCGTCGGAAGGTAGTCGTCAGTGCAGCGAGTCC> NANOG variable # convert the TAB key to the line break $ sed's/^. * \)/\ 1 \ t/'test. fasta | tr' \ n' | sed-e's/$/\ n/'-e's/>/\ n>/G'-e's/ // G'-e's/\ t/\ n/G'> response> POU5F1CGGAAGGTAGTCGTCAGTGCAGCGAGTCCGTCGGAAGGTAGTCGTCAGTGCAGCGAGTCC> response TGT
Or simply, use the previous awk to make slight modifications.
# Only One difference # For a single row of fasta files, you only need to record one row, seq [name] = $0 # For Good fasta files, each line of sequence needs to be added to the previous sequence, seq [name] = seq [name] $0 $ awk 'in in {OFS = FS = "\ t"} {if ($0 ~ />/) {Name = $0; sub (">", "", name);} else seq [name] = seq [name] $0 ;} END {print "> SOX2"; print seq ["SOX2"]} 'test. fasta> restart