Linux file sequencing and Fasta file operations

Source: Internet
Author: User

file SortSEQ: produces a series of numbers; Man seq to see its specific use. We use SEQ to generate input files for downstream analysis.
# generates a number from 1 to 10, the step is 1$ seq 1 1012345678910# produces the number from 1 to 10, the step is 1, separated by a space of $ Seq-s ' 1 101 2 3 4 5 6 7 8 9 10# produces a number from 1 to 10, with a step of 3 The number of steps, the last one is always the maximum value $ Seq-s ' 1 2 101 3 5 7 9$ Cat < (seq 0 3) < (seq 3 6) >test$ Cat test 036912153915
Sort: Sorts, sorted by character encoding by default. If you want to sort by numeric size, you need to add the-n parameter.
# may not match the expected sort, the system first ranked 0, then ranked 1, 3, 6, 9$ sort test012151533699# by numeric size $ sort-n test033699121515
Sort-u: Removing duplicate rows, equivalent to sort | Uniq
$ sort-nu test03691215
Sort File | uniq-d: Get duplicate rows (d = duplication)
$ sort-n Test | uniq-d3915
Sort File | Uniq-c: Gets the number of repetitions per row.
# The first column is the number of occurrences per row, the second column is the original row $ sort-n Test | Uniq-c  1 0  2 3  1 6  2 9  1 2 15  # Change a file to see more clearly $ cat <<end >test2> a> b> c> b& Gt a> e> d> a> END # The first column is the number of occurrences per row, the second column is the original row $ sort Test2 |  Uniq-c      3 A      2 b      1 c      1 D      1 E # before performing the uniq operation, the file should be sorted first, otherwise the result is bizarre $ cat Test2 | uniq-c      1 A      1 B      1 C      1 B      1 a      1 e      1 D      1 A
Tidy up the results of the uniq-c so that the original line is in front, and the count of each line is behind. Awk is a powerful text processing tool that handles data patterns on a per-row basis. Each time a row is read, the operation is performed. OFS: The column delimiter (output file column separtor) for the input file, and the column delimiter (the default is a white space character) for the inputs. The columns in awk are listed in columns 1th through N, recorded as $ $, $ ... $n. Begin indicates that the basic parameters are set before the file is read, and that it corresponds to end, only after the file read is complete. The {}, which does not begin with begin, is the part where the file is read and processed.
# Awk's operation is to gold-plated the result of the previous step, removing the extra blanks and then swapping 2 columns of $ sort test2 | uniq-c | awk ' begin{ofs= \ t ';} {print $, $ ' a    3b    2c    1d    1e    1
For two columns of files, Yasuteru the second column to sort, sort-k2,2n.
# second column sorted by numeric size $ sort Test2 | uniq-c | awk ' begin{ofs= \ t ';} {print $, $} ' | SORT-K2, 2NC    1d    1e    1b    2a    3 # The second column is sorted by numeric size # The second column is the same as the alphabetical order of the first column (-R) # Note the difference between the previous 3 lines and the previous results $ sort Test2 | uniq-c | awk ' begin{ofs= \ t ';} {print $2,$1} ' | Sort-k2,2n-k1,1re    1d    1c    1b    2a    3
Fasta Sequence ExtractionTo generate a single-line sequence Fasta file, extract sequences of specific genes, the simplest is to use the grep command. The main purpose is to match the strings in the file, which is the basis for a series of operations. If you use regular expressions, you will be very powerful. There are many versions of regular expressions, and almost every language has its own rules.
# Generate a single-line sequence Fasta file $ cat <<end >test.fasta> >SOX2> acgagggacgcatcggacgactgcaggactgtc> >POU5F1> acgagggacgcatcggacgactgcaggactgtc> >NANOG> cggaaggtagtcgtcagtgcagcgagtccgt> end$ cat Test.fasta > Sox2acgagggacgcatcggacgactgcaggactgtc>pou5f1acgagggacgcatcggacgactgcaggactgtc> nanogcggaaggtagtcgtcagtgcagcgagtccgt# grep matches the line containing the SOX2 #-A 1 for the output row, containing the next row of matching rows (a:after) $ grep-a 1 ' SOX2 ' Test.fasta >s Ox2acgagggacgcatcggacgactgcaggactgtc# first to determine whether the current line is a > start, if it is, the name of the sequence is the line, replace the greater than sign, take out the name. # Sub Substitution, sub (replaced part, to replace string) # If it does not start with the greater than sign, it is a sequence of rows, stored up. # Seq[name]: equivalent to building a dictionary, name is key, sequence is a value. You can then use the name to pick up the sequence. $ awk ' begin{ofs=fs= ' \ t '}{if ($0~/>/) {name=$0; Sub (">", "", Name);} else seq[name]=$0;} End{print ">sox2"; print seq["SOX2"]} ' TEST.FASTA>SOX2ACGAGGGACGCATCGGACGACTGCAGGACTGTC
Multi-line Fasta sequence extraction to be troublesome, one way is to turn into a single line sequence, which is handled in the above way. Both SED and TR are the most commonly used character replacement tools.
$ cat <<end >test.fasta> >SOX2> acgagggacgcatcggacgactgcaggactgtc> Acgagggacgcatcggacgactgcaggactgtc> acgagggacgcatcggacgactgcaggac> >POU5F1> Cggaaggtagtcgtcagtgcagcgagtccgt> cggaaggtagtcgtcagtgcagcgagtcc> >NANOG> Acgagggacgcatcggacgactgcaggactgtc> acgagggacgcatcggacgactgcagg> acgagggacgcatcggacgactgcaggactgtc> Acgagggacgcatcggacgactgcaggactgt> End # Add tab at the end of the line to the beginning of the > to separate the name and Sequence # tab is not visible, just look at the small # \ (\) indicates that the record matches the content, \1 means ()    The matching content recorded in # after we specifically speak sed$ sed ' s/^\ (>.*\)/\1\t/' Test.fasta >sox2    Acgagggacgcatcggacgactgcaggactgtcacgagggacgcatcggacgactgcaggactgtcacgagggacgcatcggacgactgcaggac>pou5f1    Cggaaggtagtcgtcagtgcagcgagtccgtcggaaggtagtcgtcagtgcagcgagtcc>nanog Acgagggacgcatcggacgactgcaggactgtcacgagggacgcatcggacgactgcaggacgagggacgcatcggacgactgcaggactgtcacgagggacgcatcggacgactgcagga CTGT #使用cat-A can display all the symbols in the file # ^i means tab # $ = line end $ sed ' s/^\ (>.*\)/\1\t/' Test.fasta | Cat-a>sox2^i$acgagggacgcatcggacgactgcaggactgtc$acgagggacgcatcggacgactgcaggactgtc$acgagggacgcatcggacgactgcaggac$>pou5f1^i$cggaaggtagtcgtcagtgcagcgagtccgt$ cggaaggtagtcgtcagtgcagcgagtcc$>nanog^i$acgagggacgcatcggacgactgcaggactgtc$acgagggacgcatcggacgactgcagg$ acgagggacgcatcggacgactgcaggactgtc$acgagggacgcatcggacgactgcaggactgt$ # Replace all newline characters with spaces # idea The second argument, the quotation marks as a space $ sed ' s/^\; *\)/\1\t/' Test.fasta | Tr ' \ n ' ' >sox2 ACGAGGGACGCATCGGACGACTGCAGGACTGTC ACGAGGGACGCATCGGACGACTGCAGGACTGTC ACGAGGGACGCATCGGACGACTGCAGGAC >pou5f1 cggaaggtagtcgtcagtgcagcgagtccgt CGGAAGGTAGTCGTCAGTGCAGCGAGTCC >NANOG A CGAGGGACGCATCGGACGACTGCAGGACTGTC Acgagggacgcatcggacgactgcagg ACGAGGGACGCATCGGACGACTGCAGGACTGTC ACGAGGGACGCATCGGACGACTGCAGGACTGT # Replace the last space with a newline character $ sed ' s/^\ (>.*\)/\1\t/' Test.fasta | Tr ' \ n ' | Sed-e ' s/$/\n/' >sox2 ACGAGGGACGCATCGGACGACTGCAGGACTGTC ACGAGGGACGCATCGGACGACTGCAGGACTGTC ACGAGGGACGCATCGGACGACTGCAGGAC >pou5f1 cggaaggtagtcgtcagtgcagcgagtccgt CGGAAGGTAGTCGTCAGTGCAGCGAGTCC >NANOG A CgagggacgcatcggacgactgcagGACTGTC Acgagggacgcatcggacgactgcagg ACGAGGGACGCATCGGACGACTGCAGGACTGTC acgagggacgcatcggacgactgcaggactgt # put ' > ' Replace with line break note that the space + greater than sign # is replaced when using multiple substitution commands, use-e to separate $ sed ' s/^\ (>.*\)/\1\t/' Test.fasta | Tr ' \ n ' | Sed-e ' s/$/\n/'-e ' s/>/\n>/g ' >sox2 ACGAGGGACGCATCGGACGACTGCAGGACTGTC ACGAGGGACGCATCGGACGACTGCAGGACTGTC AC Gagggacgcatcggacgactgcaggac>pou5f1 CGGAAGGTAGTCGTCAGTGCAGCGAGTCCGT Cggaaggtagtcgtcagtgcagcgagtcc>nanog ACGA GGGACGCATCGGACGACTGCAGGACTGTC Acgagggacgcatcggacgactgcagg ACGAGGGACGCATCGGACGACTGCAGGACTGTC ACGAGGGACGCATCGGACGACTGCAGGACTGT # Replace all spaces with $ sed ' s/^\ (>.*\)/\1\t/' Test.fasta | Tr ' \ n ' | Sed-e ' s/$/\n/'-e ' s/>/\n>/g '-e ' s///g ' >sox2 ACGAGGGACGCATCGGACGACTGCAGGACTGTCACGAGGGACGCATCGGACGACTGCAG Gactgtcacgagggacgcatcggacgactgcaggac>pou5f1 cggaaggtagtcgtcagtgcagcgagtccgtcggaaggtagtcgtcagtgcagcgagtcc>    NANOG AcgagggacgcatcggacgactgcaggactgtcacgagggacgcatcggacgactgcaggacgagggacgcatcggacgactgcaggactgtcacgagggacgcatcggaCGACTGCAGGACTGT # Convert Tab key to line break $ sed ' s/^\ (>.*\)/\1\t/' Test.fasta | Tr ' \ n ' | Sed-e ' s/$/\n/'-e ' s/>/\n>/g '-e ' s///g '-e ' s/\t/\n/g ' &GT;SOX2ACGAGGGACGCATCGGACGACTGCAGGACTGTCACGAGGGACGCATC Ggacgactgcaggactgtcacgagggacgcatcggacgactgcaggac> Pou5f1cggaaggtagtcgtcagtgcagcgagtccgtcggaaggtagtcgtcagtgcagcgagtcc> Nanogacgagggacgcatcggacgactgcaggactgtcacgagggacgcatcggacgactgcaggacgagggacgcatcggacgactgcaggactgtcacgagggacgcatcggacgactg Caggactgt
Or simply, make a slight change directly with the previous awk.
# The difference is only a little # for a single-line Fasta file, just one row, seq[name]=$0# for a good Fasta file, you need to add each line sequence to the previous sequence, seq[name]=seq[name]$0$ awk ' begin{ofs=fs= "\ T"}{if ($0~/>/) {name=$0; Sub (">", "", Name);} else seq[name]=seq[name]$0;} End{print ">sox2"; print seq["SOX2"]} ' test.fasta> Sox2acgagggacgcatcggacgactgcaggactgtcacgagggacgcatcggacgactgcaggactgtcacgagggacgcatcggacgactgcaggac

Linux file sequencing and Fasta file operations

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.