1. Bowtie
Short Sequence comparison tool and blast are also short sequence comparison tools, which are fast and easy to understand.
The input can be a fastq or FASTA file.
Generate the Sam format of the comparison result file.
2. BWA
From: https://www.jianshu.com/p/1552cc6ac3be
Software that compares DNA sequences to the reference genome contains three algorithms:
Bwa-Backtrack: Suitable for matching sequences with a length not greater than BP;
BWA-SW: A sequence with a length of 70-1 m bp;
BWA-MEM: Combined with a 70-1 m bp sequence and high-quality Sequencing data, it is faster and more accurate.
Use whereis BWA to find its installation path:
[email protected]:/data1/zzl$ whereis bwabwa: /usr/bin/bwa /usr/share/bwa /usr/share/man/man1/bwa.1.gz
Enter BWA to get the following help:
Usage: bwa <command> [options]Command: index index sequences in the FASTA format mem BWA-MEM algorithm fastmap identify super-maximal exact matches pemerge merge overlapping paired ends (EXPERIMENTAL) aln gapped/ungapped alignment samse generate alignment (single ended) sampe generate alignment (paired ended) bwasw BWA-SW for long queries shm manage indices in shared memory fa2pac convert FASTA to PAC format pac2bwt generate BWT from PAC pac2bwtgen alternative algorithm for generating BWT bwtupdate update .bwt to the new format bwt2sa generate SA from BWT and OccNote: To use BWA, you need to first index the genome with `bwa index‘. There are three alignment algorithms in BWA: `mem‘, `bwasw‘, and `aln/samse/sampe‘. If you are not sure which to use, try `bwa mem‘ first. Please `man ./bwa.1‘ for the manual.
Steps:
1. index reference genome:
bwa index –a bwtsw hg19.fasta
Here we use the bwtsw algorithm to build the index and the final output result file:
Files of the following types are generated: BWT, PAC, Ann, AMB, and SA:
[email protected]:/data1/GRCm38$ lsGRCm38_68.fa GRCm38_68.fa.amb GRCm38_68.fa.ann GRCm38_68.fa.bwt GRCm38_68.fa.fai GRCm38_68.fa.pac GRCm38_68.fa.sa
2. Use the BWA-MEM Algorithm for comparison:
bwa mem –t 4 hg19.fasta read1.fq read2.fq > aln-pe.sam
I used this command:
bwa mem -t 4 ../hg19/hg19.fasta ERR580012_1.fastq.gz ERR580012_2.fastq.gz > aln-pe.sam
The mem algorithm is used.-T is used to select several threads, increase threads, and reduce the running time. Then, the FASTA file of the reference genome is used. And other parameters:
-p
Ignore the second input sequence. By default, the input sequence file is considered as single-ended sequencing, and the input two sequence files are double-ended sequencing. After this parameter is added, the second input sequence file will be ignored, and the first file will be compared as single-ended Sequencing data;
Save the final result to the Sam file.
So what isSingle-ended and double-ended sequencing:
From: https://www.cnblogs.com/Formulate0303/p/7843082.html
1. Single-ended sequencing (single-EAD) First, the DNA sample is segmented to form a-p segment, the primer sequence is connected to one end of the DNA segment, and then the end is added with a connector, the fragments are fixed on flowcell to generate a DNA cluster, and the single-ended read sequence is sequencing on the machine.
2. The paied-end method adds a sequencing primer binding site to both ends of the constructed DNA library to be tested. After the first round of sequencing is complete, the template chain of the first round of sequencing is removed, we use the read-sequencing module (paied-end module) to guide the regeneration and amplification of the complementary chain in the original position, so as to achieve the amount of template used for the second round of sequencing, and then perform the synthesis and sequencing of the second round of the complementary chain.
// In fact, this second point is not quite clear. [1]
3. compress the Sam file into the BAM format
samtools view –bS aln-pe_reorder.sam –o aln-pe.bam
Search for samtools help:
Usage: samtools <command> [options]Command: view SAM<->BAM conversion sort sort alignment file mpileup multi-way pileup depth compute the depth faidx index/extract FASTA tview text alignment viewer index index alignment idxstats BAM index stats (r595 or later) fixmate fix mate information flagstat simple stats calmd recalculate MD/NM tags and ‘=‘ bases merge merge sorted alignments rmdup remove PCR duplicates reheader replace BAM header cat concatenate BAMs bedcov read depth per BED region targetcut cut fosmid regions (for fosmid pool only) phase phase heterozygotes bamshuf shuffle and group alignments by name
-B indicates that the output file is in the BAM file format.-s indicates that the input file is a BAM file by default. If the input file is a SAM file, you 'd better add this parameter; otherwise, an error is reported. -O output file name
FinallyBamFile, where B refers to binary, which is fast in operation.
Run the following command to view the file header:
samtools view -H ESCell#8.sam
Introduction to some software functions in NGS