Linux File Sorting and FASTA file operations, linux File Sorting fasta

Source: Internet
Author: User

Linux File Sorting and FASTA file operations, linux File Sorting fasta
File Sorting seq: generates a series of numbers. man seq is used to view specific values. We use seq to generate input files used for downstream analysis.

# Generate the number from 1 to 10. The step size is 1 $ seq 1 1012345678910 # generate the number from 1 to 10. The step size is 1, use spaces to separate $ seq-s ''1 101 2 3 4 5 6 7 8 9 10 # generate the number from 1 to 10. The step size is 2 # If there are 3 numbers, the number in the middle is the step size, and the last one is always the maximum value $ seq-s '1 2 101 3 5 7 9 $ cat <(seq 0 3 17) <(seq 3 6 18)> test $ cat test 036912153915
Sort: sort by character encoding by default. To sort by number, add the-n parameter.
# The system first ranks 0, and then 1, 3, 6, 9 $ sort test012151531049 # sort by number size $ sort-n test03426121515
Sort-u: removes duplicate rows, equivalent to sort | uniq
$ sort -nu test03691215
Sort file | uniq-d: Get duplicate rows (d = duplication)
$ sort -n test | uniq -d3915
Sort file | uniq-c: obtains the number of duplicates in each row.
# The first column lists the number of times each row appears, the second column is the original row $ sort-n test | uniq-c 1 0 2 3 1 6 2 9 1 12 2 15 # change the file to see more clearly $ cat <END> test2> a> B> c> B> a> e> d> a> END # the number of times each row appears in the first column, the second column is the original row $ sort test2 | uniq-c 3 a 2 B 1 c 1 d 1 e # The files must be sorted before the uniq operation is executed, otherwise the result is very strange $ cat test2 | uniq-c 1 a 1 B 1 c 1 B 1 a 1 e 1 d 1
Sort out the results of uniq-c so that the original row is in front and the count of each row is in the back. Awk is a powerful text processing tool that processes data in a row-by-row mode. Read a row each time and perform the operation. OFS: output file column separtor; FS is the column Separator of the input file (blank by default ). From column 1st to column n in awk, records are $1, $2... $ N. BEGIN indicates that the basic parameters are set before the file is read. The END parameter corresponds to the parameter, and only operations are performed after the file is read. {}, Which does not start with BEGIN or END, is the part for reading and processing files.
# The awk operation is the result of the previous plating step. Remove unnecessary blank space and replace the two columns $ sort test2 | uniq-c | awk 'in in {OFS = "\ t ";} {print $2, $1} 'a 3b 2c 1d 1e 1
Sort files in two columns according to the second column, sort-k2, 2n.
# Sorting by value in the second column $ sort test2 | uniq-c | awk 'in in {OFS = "\ t" ;}{ print $2, $1} '| sort-k2, 2nc 1d 1e 1b 2a 3 # The second column is ordered by the value size # The second column is the same and then sorted in the reverse alphabetical order of the first column (-r) # note the difference between the sequence of the first three rows and the result of the previous step $ sort test2 | uniq-c | awk 'in in {OFS = "\ t" ;}{ print $2, $1} '| sort-k2, 2n-k1, 1re 1d 1c 1b 2a 3
FASTA sequence extraction generates a single-line sequence FASTA file to extract the specific gene sequence. The simplest is to use the grep command. The main purpose is to match strings in a file and perform a series of operations based on this. If a regular expression is used, it is very powerful. There are many regular expression versions, and almost every language has its own rules.
# Generate a single row sequence FASTA File $ cat <END> test. fasta> SOX2> Telecom> POU5F1> ACGAGGGACGCATCGGACGACTGCAGGACTGTC> NANOG> CGGAAGGTAGTCGTCAGTGCAGCGAGTCCGT> END $ cat test. fasta> tags> NANOGCGGAAGGTAGTCGTCAGTGCAGCGAGTCCGT # grep matches the line containing SOX2 #-A 1 indicates that the output row contains the next line of the matched row (A: after) $ grep-A 1 'sox2 'test. fasta> SOX2ACGAGGGACGCAT CGGACGACTGCAGGACTGTC # first judge whether the current row is> the beginning. If yes, it indicates the sequence name line. Replace the number greater than the number and obtain the name. # Sub replacement, sub (replaced part, to be replaced with, to be replaced with a string) # if it does not start with a greater than sign, it is a sequence row and is stored. # Seq [name]: It is equivalent to creating a dictionary. The name is the key and the sequence is the value. Then you can use name to retrieve the sequence. $ Awk 'in in {OFS = FS = "\ t"} {if ($0 ~ />/) {Name = $0; sub (">", "", name);} else seq [name] = $0 ;} END {print "> SOX2"; print seq ["SOX2"]} 'test. fasta> SOX2ACGAGGGACGCATCGGACGACTGCAGGACTGTC
Multi-row FASTA sequence extraction requires a little trouble. One way is to convert it into a single-row sequence and use the above method for processing. Sed and tr are the most common character Replacement Tools.
$ Cat <END> test. fasta> SOX2> latency> POU5F1> latency> NANOG> latency> acgagggacgcatcggacgacgactgc133> latency> END # END of the row starting with "> "add a TAB key, in order to separate names from sequences # The TAB key is invisible, look at the small # \ (\) to record the Matching content, \ 1 to represent () matching content # We will talk about sed $ sed's/^ \ (>. * \)/\ 1 \ t/'test. fasta> SOX2 placement> POU5F1 placement> NANOG placement # Use cat-A to display all the symbols in the file # ^ I to represent the tab key # $ to indicate the end of the line $ sed's/^ \( >. * \)/\ 1 \ t/'test. fasta | cat-A> SOX2 ^ I $ variable $> POU5F1 ^ I $ variable $> NANOG ^ I $ variable $ ACGAGGGACGCATCGGACGACTGCAGG $ variable # replace all linefeeds with spaces # The second parameter of the idea, the quotation mark is a space $ sed's/^ \ (>. * \)/\ 1 \ t/'test. fasta | tr '\ n'> SOX2 variable> POU5F1 variable> NANOG variable acgagggacgcatcggacgactgctgvariable # Replace the last space with the line break $ sed's/^ \( >. * \)/\ 1 \ t/'test. fasta | tr '\ n''' | sed-e's/$/\ n/'> SOX2 variable generation> POU5F1 variable generation> NANOG variable ACGAGGGACGCATCGGACGACTGCAGG variable # Put'> 'replace with a linefeed. Note that space + greater than sign is replaced. # When multiple replacement commands are used, use-e to separate $ sed's/^ \ (>. * \)/\ 1 \ t/'test. fasta | tr '\ n' | sed-e's/$/\ n/'-e's/>/\ n>/G'> SOX2 then ACGAGGGACGCATCGGACGACTGCAGGAC> POU5F1 using CGGAAGGTAGTCGTCAGTGCAGCGAGTCC> NANOG using acgagggacgcatcggacgactgc?## replace all spaces with $ sed's/^ \ (>. * \)/\ 1 \ t/'test. fasta | tr' \ n' | sed-e's/$/\ n/'-e's/>/\ n>/G'-e's/ // G'> SOX2 variable> POU5F1 CGGAAGGTAGTCGTCAGTGCAGCGAGTCCGTCGGAAGGTAGTCGTCAGTGCAGCGAGTCC> NANOG variable # convert the TAB key to the line break $ sed's/^. * \)/\ 1 \ t/'test. fasta | tr' \ n' | sed-e's/$/\ n/'-e's/>/\ n>/G'-e's/ // G'-e's/\ t/\ n/G'> response> POU5F1CGGAAGGTAGTCGTCAGTGCAGCGAGTCCGTCGGAAGGTAGTCGTCAGTGCAGCGAGTCC> response TGT
Or simply, use the previous awk to make slight modifications.
# Only One difference # For a single row of fasta files, you only need to record one row, seq [name] = $0 # For Good fasta files, each line of sequence needs to be added to the previous sequence, seq [name] = seq [name] $0 $ awk 'in in {OFS = FS = "\ t"} {if ($0 ~ />/) {Name = $0; sub (">", "", name);} else seq [name] = seq [name] $0 ;} END {print "> SOX2"; print seq ["SOX2"]} 'test. fasta> restart

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.