Sam/bam file Processing

Source: Internet
Author: User

When the sequenced FASTQ files are map to the genome, we usually get a file with a Sam or BAM extension. The full name of Sam is sequence Alignment/map format. BAM is the SAM binary file (b is taken from binary). So what is the format of the SAM file? If you want to get a real idea of the SAM file, you can view its documentation. The SAM header file and map results are composed. The header file consists of a line of comments at the start of the @. and the map result is something like this:

hwi-st1001:137:c12fpacxx:7:1115:14131:66670 0 chr1 12805 1 42m4i5m * 0 0 TTGGATGCCCCTCCACACCCTCTTGATCTTCCCTGTGATGTCACCAATATG CCCFFFFFHHGHHJJJJJHJJJJJJJJJJJJJJJJIJJJJJJJJJJJJIJJ as:i:-28 xn:i:0 Xm:i:2 xo:i:1xg:i:4 nm:i:6 md:z:2c41c2 yt:z:uu nh:i:3 cc:z:chr15 cp:i:102518319 xs:a:+ HI:i:0HWI-ST1001:137:C12FPACXX : 7:2313:17391:30032 272 CHR1 13494 1 51M * 0 0 actgcctggcgctgtgcccttcctttgctctgcccgctggagacagtgttt [email protected ] As:i:-3 xn:i:0 Xm:i:1 xo:i:0 Xg:i:0nm:i:1 md:z:44g6 yt:z:uu xs:a:+ nh:i:3 cc:z:chr15 cp:i:102517626 hi:i:0hwi-st1001:137:c12fpacxx : 7:1109:17518:53305 chr1 13528 1 51M * 0 0 CGCTGGAGCCGGTGTTTGTCATGGGCCTGGGCTGCAGGGATCCTGCTACAA ############ #AB =?:* B?; a?<2+233++; A+a2+<[email protected],a<a<=> as:i:-5 xn:i:0 Xm:i:2 xo:i:0 Xg:i:0nm:i:2 md:z:8a21t20 yt:z:uu xs:a:+ nh:i:4 cc:z:chr15 cp:i:102517592 hi:i:0 

Looks very similar to the FASTQ file, it also has read name, sequence, quality and other information, but not exactly the same. First, each read is only a row, but it is tabbed into a number of columns, a total of 12 columns, respectively, recorded:

1. Read name

2. Sam tags

3. Chromosome

4.5′ End Start position

5. MAPQ (mapping quality, describe the quality of the pair, the larger the number, the higher the specificity)

6. Cigar string, record insert, delete, mismatch and splice junctions (post-cut splicing connector)

7. Mate name, record mate pair information

8. Mate's Location

9. Length of the template

Ten. Read sequence

One. Read quality

12. Marking for programs

Obviously, the information that chromosome to cigar is very important. But it doesn't matter to us, we just need to know what the Sam/bam file is. It is important that you perform downstream operations. To manipulate the Sam/bam file, you first need to install Samtools. Its installation process, like all Linux/unix programs, generates an executable program after make, and then informs the system of its path, or places it where it can be found. Like what:

Tar zxvf samtools-0.1.18.tar.bz2 cd samtools-0.1.18/make samtoolpath= ' pwd ' Path=path: $samtoolpath

You can then follow the tools described on the Samtools home page for a variety of operations. Our most common steps, such as 0. Sam,bam Conversion

Samtools view-h file.bam > File.sam samtools view-b-S File.sam > File.bam

1. Sorting Bam file. Most downstream programs require that BAM files be queued.

Samtools sort File.bam Outputprefix

2. Create BAM Index. This is also required by most downstream procedures.

Samtools Index Sorted.bam

3. Index template genome. This is also required by most downstream procedures.

Samtools Faidx Homo_sapiens_assembly19.fasta

In many cases, we will also see a mapping file with the extension bed. Its specific format is also several changes, but now with the description of the UCSC prevail. Convert from Bam file to bed file, we need to install Bedtools. Download the installation is not much to say. Example of a command to convert from a BAM file to a bed file:

Bamtobed-i Reads.bam > reads.bed

More specific information can be found in its documentation. Of course, there are a number of formats to record the results of mapping, most of which are included in the UCSC help documentation. For example, the last time someone asked. BW is what file (bigwig file) and so on, you can find the answer there. When referring to the Fastq file last time, there were questions about its quality assessment, so how do you evaluate the results of mapping after mapping? The simplest is to evaluate the quality of mapping by Samtools.

Samtools Idxstats Aln.sorted.bam

Note that this step needs to go through sort and index. The results will show:

Chr1 195471971 6112404 0 chr10 130694993 3933316 0 chr11 122082543 6550325 0 chr12 120129022 3876527 0 chr13 120421639 551 1799 0 Chr14 124902244 3949332 0 chr15 104043685 3872649 0 chr16 98207768 6038669 0 chr17 94987271 13544866 0 chr18 907026 4739331 0 chr19 61431566 2706779 0 chr2 182113224 8517357 0 chr3 160039680 5647950 0 chr4 156508116 4880584 0 CHR5 1518 34684 6134814 0 chr6 149736546 7955095 0 chr7 145441459 5463859 0 chr8 129401213 5216734 0 chr9 124595110 7122219 0 ChrM 1 6299 1091260 0 Chrx 171031299 3248378 0 Chry 91744698 259078 0 * 0 0 0

The first column is the chromosome name, the second column is the sequence length, the third column is the mapped reads number, and the fourth column is the number of unmapped reads. If it is rnaseq, we can use Broad Institute RNA-SEQC to get a more complete report. After downloading to a file, you may need to install BWA for more accurate results, but you can analyze it if you do not install it. In general, this step does not require particularly accurate results, so I seldom use the BWA option. If the downloaded file is at the end of a. zip, simply change it to a. jar to run. Download the required example Rna-seq Data on its homepage. After the download is finished, unzip the extract. Next Run:

Samtools index Example/thousandreads.bam samtools faidx example/homo_sapiens_assembly19.fasta Java-xmx2048m-jar Rna-seqc_v1.1.7.jar-n 1000-s "testid|example/thousandreads.bam| Testdesc "-t example/gencode.v7.annotation_goodcontig.gtf-r example/homo_sapiens_assembly19.fasta-o./testReport/- Start GC-GC Example/gencode.v7.gc.txt

Only one of the above parameters is not the same as the description of the document is the use of-xmx2048m to specify the Java Virtual machine memory size of 2G. If you encounter Java.lang.OutOfMemoryError, you can also specify a larger size.

Of course, if it is your own file, it will take two more steps:

The genome names of 1.bam,reference and GTF files must be identical.

2. You need to use the createsequencedictionary in the Picard toolkit to build a dictionary file.

Originally from: http://pgfe.umassmed.edu/ou/archives/3050

Bioinformatics Exchange Forum Http://bbs.bbioo.com/forum-76-1.html

Sam/bam file Processing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.