The use of blast under Linux---reproduced

Source: Internet
Author: User
Tags file copy

1, the blast compressed file decompression, and then the bin directory under the file copy to/usr/local/bin;
2, the production of soft links, the extracted files in the bin directory link to/home/username, eg:ln-s/home/username/blast/bin;
3, in the current user directory, edit BASHRC file, in the file to add export path=/home/username/bin/= $PATH;
4, in the current directory, the data file format, $formatdb-ifilename. suffix- p f-o T 5, the file to be blast converted to Test.txt file, copy the contents of the file as follows:>test ....Acgtcagtcgatcgat .....6, to do the comparison$blastall-P blastn-d filename. suffix-i test.txt-o test.out

Formatdb-i/home/liuguiyou/landsberg_arabidopsis/ncbi_arab2.fna-o t-p F

[Email protected] ~]$ blastall-p blastn-i/home/liuguiyou/landsberg_arabidopsis/cereon_ath_ler.fasta-d/home/ LIUGUIYOU/LANDSBERG_ARABIDOPSIS/NCBI_ARAB2.FNA-E 1e-10-o/home/liuguiyou/landsberg_arabidopsis/result


Assume:
The blast path you installed is/opt/blast/

1, the blast compressed file decompression, and then the bin directory under the file copy to/usr/local/bin;
---
Purpose: All users, regardless of their current path, can invoke the program in the blast package directly under the command line without specifying the path to the program
Explanation: The system's Path environment variable contains/usr/local/bin, when the command line is lowered with the blast package program, the system will go to/usr/local/bin path to find the appropriate command program
Recommendation: For simplicity, you can skip this step, unless you want to enter the program name to use the blast you installed when you sign in with another user

2, the production of soft links, the extracted files in the bin directory link to/home/username, eg:ln-s/home/username/blast/bin;
---
Objective: To prepare for the 3rd step
Recommendation: For simplicity, you can skip this step

3, in the current user directory, edit BASHRC file, in the file to add export path=/home/username/bin/= $PATH;
---
Correction: Here are some clerical errors, which should be edited. bashrc file and added to the file export path=/home/username/bin/: $PATH;
Purpose: After the current user logs in, you can enter the program name directly to use the blast
Recommendation: For simplicity, you can skip this step

4, in the current directory, the data file format, $FORMATDB-i filename. suffix-P f-o T
---
Perform:

Code:/opt/blast/bin/formatdb-i {data File here}-P F-o T



5, the file to be blast converted to Test.txt file, copy the contents of the file as follows:
>test ....
Acgtcagtcgatcgat .....
---
This is nothing to say, change the file name

6, to do the comparison
$blastall-P blastn-d filename. suffix-I test.txt-o test.out
---
Perform:

Code:/opt/blast/bin/blastall-p blastn-d {database file here}-I test.txt-o test.out


Additional notes:

1. To run the Library program Formatdb:

The process of building a library is to create an index file of the target sequence, which is formatdb. The program allows input format Fasta or ASN.1 format, usually we use a sequence of fasta format as input. The fasta sequence used to build the library is the basic command for DB.SEQ,FORMATDB:

Formatdb-i Db.seq [-options]

Common parameters are as follows:

-P (t/f): The meaning of the-p parameter is to select the type of repository, "T" for the protein library, and "F" for the nucleic acid library. The default value is "T".

-O (t/f): The meaning of the-o parameter is to determine whether to parse the sequence name and establish the sequence name index. "T" indicates that the sequence name index is established, and "F" indicates that the sequence name index is not established. The default value is "F".

Program output:

If a nucleic acid library is established, the output is DB.SEQ.NHR, Db.seq.nin, DB.SEQ.NSQ, and if the parameter "-O-T" is selected, it will also output DB.SEQ.NSD, Db.seq.nsi, Db.seq.nni, and Db.seq.nnd.

The output of the protein library and the nucleic acid library is similar, and the corresponding output files are: Db.seq.phr, Db.seq.pin, db.seq.psq and DB.SEQ.PSD, Db.seq.psi, DB.SEQ.PNI, DB.SEQ.PND.

In addition to these results, the program also outputs a log file (the default is Formatdb.log), which records information such as run time, version number, sequence number, and so on.


2. Run the Blastall program:

The main program of Blast is Blastall. The input file of the program is the query sequence (-i parameter) and the library file (-d parameter), the selection of the comparison type (-P parameter) and the output file (-o parameter) is specified by the user. Where the "-P" parameter has 5 kinds of values:

-P BLASTP: The protein sequence is compared with the protein library.

-P BLASTX: Comparison of nucleic acid sequences to the protein library.

-P BLASTN: Comparison of nucleic acid sequences to nucleic acid libraries.

-P TBLASTN: Comparison of protein sequences to nucleic acid libraries.

-P TBLASTX: Comparison of nucleic acid sequences to nucleic acid libraries at the protein level

Blastall is one of the most commonly used blast programs, its function is very powerful, it has a lot of parameters below, but the general use of parameters such as:-P,-I,-D,-O,-e and several.

3. Operating parameter Description:

  • -P: program name executed
  • -D: Search for database names
  • -I: The sequence file name to be queried (query file)
  • - E: (mathematical) expected (expectation value), the E value is a statistical threshold, the default value of 10, meaning that the result of a match due to random contingency results in less than the 10,e value is less reliable results.
  • -o: Query result output file name
  • -M: Show format options for the comparison result, default value is 0, which is pairwise format. In addition, we can choose different formats such as 1~6 according to different needs.
  • -I: The GI number [t/f] is displayed in the description row, the default value F
  • - V: The maximum number of single-line descriptions (one-line description), the default value of
  • - B: The maximum number of comparison results displayed, the default value of
  • -F: Filter [t/f] for the sequence to be queried for low-complexity zones (complexity regions, LCR), and the default value T. The dust program is used for BLASTN, and the other is the SEG program.
  • The so-called "low complexity area" refers to some or some of the residual base too much performance, short cycle repetition. For the genome sequence of higher mammals, the repeatmask procedure can be used to obscure the repeating element. In the output, the sequence nucleic acid in the LCR region is replaced by "N", and the protein sequence is replaced with "X".
  • -A: The number of processors used to run the blast program, with a default value of 1
  • -S: The Nucleic Acid chain (strand) used in searching the database, only valid for BLASTN, BLASTX, and tblastx; 1 means top,2 means bottom,3; default value 3
  • -T: Produces HTML-formatted output [t/f], default value F
  • -N: Using Megablast search [t/f], default value F
  • -G: Open a gap of the penalty (0 for using the default setting value), the default 0
  • -E: Extending the penalty for a gap (0 means using the default setting value), default 0
  • -Q: Penalty for a nucleic acid base mismatch (mismatch) (only valid for BLASTN), default value-3
  • -R: A proper match (match) award for a nucleic acid base (valid only for BLASTN), default value of 1
  • -M: The scoring matrix used, the default value BLOSUM62

4. Output result parameter description:

-M: Show formatting options for comparison results

0 = pairwise,1 = query-anchored showing identities,2 = query-anchored no identities,

3 = Flat query-anchored, show identities,

4 = Flat query-anchored, no identities,

5 = query-anchored No identities and blunt ends,

6 = Flat query-anchored, no identities and blunt ends,

7 = XML Blast output,

8 = tabular,

9 Tabular with comment lines

ASN, text

ASN, Binary

-M 8: The comparison result of the tabular format. The meanings of the columns from left to right are: query name, subject name, identity, alignment length, mismatch number, vacancy number, query alignment starting coordinates, query pair terminating coordinates, subject alignment start coordinates, subject alignment termination coordinates, expected value, Score on the comparison.

Query1 sub24 91.11 3 1 198 241 502208 502252 2.7e-06 50.05

Query1 sub21 98.68 151 2 0 532 682 1360665 1360515 1.0e-76 284.0

Query1 Sub21 86.17 94 1 198 290 479232 479139 4.8e-14 75.82

Query1 Sub21 87.04 7 0 238 291 1297867 1297920 6.9e-07 52.03

Query2 sub21 99.44 892 3 2 28 918 1351055 1350165 0.0 1713.2

Query2 sub21 87.58 153 1 343 495 1358110 1357960 2.1e-35 147.2

Query2 sub21 84.11 107 1 699 805 1305723 1305618 4.0e-12 69.88

Query2 Sub21 89.58 5 0 519 566 1305968 1305921 6.0e-08 56.00

Query2 sub14 88.24 153 1 343 495 145402 145252 8.7e-38 155.1

Query2 sub24 88.08 151 1 345 495 567561 567709 1.4e-36 151.2

Query2 sub24 87.80 123 1 686 808 563341 563220 1.9e-26 117.5

In the M8 format, the alignment direction of the sequence can be judged by the position of the subject. For example, in the 1th line of the above results, the starting coordinate of subject is less than the terminating coordinate, then the two sequence is the same direction alignment, and the subject starting coordinate in line 2nd is greater than the terminating coordinate, then the query sequence is the complementary chain ratio of subject.

-M 9: list format with comment lines. Format, like-M 8, simply adds a comment line in front of each query's comparison result to illustrate the meaning of the columns in the list.

# BLASTN 2.2.8 [jan-05-2004]

# Query:query1 Out.ace.1

# DATABASE:DATABASE.SEQ

# fields:query ID, Subject ID,% identity, alignment length, mismatches, gap openings, Q. Start, Q. End, S. Start, S. End , E-value, bit score

Query1 sub24 91.11 3 1 198 241 502208 502252 2.7e-06 50.05

Query1 sub21 98.68 151 2 0 532 682 1360665 1360515 1.0e-76 284.0

Query1 Sub21 86.17 94 1 198 290 479232 479139 4.8e-14 75.82

Query1 Sub21 87.04 7 0 238 291 1297867 1297920 6.9e-07 52.03

# BLASTN 2.2.8 [jan-05-2004]

# Query:query1 Out.ace.1

# DATABASE:DATABASE.SEQ

# fields:query ID, Subject ID,% identity, alignment length, mismatches, gap openings, Q. Start, Q. End, S. Start, S. End , E-value, bit score

Query2 sub21 99.44 892 3 2 28 918 1351055 1350165 0.0 1713.2

Query2 sub21 87.58 153 1 343 495 1358110 1357960 2.1e-35 147.2

Query2 sub21 84.11 107 1 699 805 1305723 1305618 4.0e-12 69.88

Query2 Sub21 89.58 5 0 519 566 1305968 1305921 6.0e-08 56.00

Query2 sub14 88.24 153 1 343 495 145402 145252 8.7e-38 155.1

Query2 sub24 88.08 151 1 345 495 567561 567709 1.4e-36 151.2

Query2 sub24 87.80 123 1 686 808 563341 563220 1.9e-26 117.5

-M 10 and 11: Text files and binaries in ASN format, respectively, are not introduced here.

The value of the "-M" parameter, from 1 to 6, is for ease of comparison between subjects, and 8 and 9 retain the original appearance of all the comparisons, just the format of the list, which greatly reduces the consumption of storage space and makes the results clearer and easier to read. But the M8/M9 format also has the corresponding disadvantage, is to lose a part of the information, in addition to the sequence length information and the comparison of the bar graph, but also in the BLASTX, Tblastn and Tblastx of the comparison of the loss of critical phase information, which is to be avoided as far as possible. Therefore, in large-scale blastn tasks, the output of the M8 format is often used to save space, while in small-scale high-precision alignment, the default output format is usually used, and other programs are used to extract useful information from the results.

Transferred from: http://blog.sina.com.cn/s/blog_4af3f0d20100ene9.html

The use of blast under Linux---reproduced

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.