Why you shoshould QC your reads and your assembly?

Source: Internet
Author: User
Carp genome: http://www.ntv.cn/a/20140923/52953.shtml on the determination of carp genome, data quality control was questioned. Why you shoshould QC your reads and your assembly?Graham etheringtonhttp: // grahametherington. Blogspot. co. uk/2014/09/why-you-should-qc-your-reads-and-your.html

The genome sequence of the Common CarpCyprinus carpioWas published in Nature last week. By coincidence, I was doing some QC on some domesticated ferret (Mustela ptorius furo) Reads, which had thrown some kmer warnings in the fastqc tool. I blacked out the kmers in pagers and was quite perplexed by the number of hits that I found in the carp genome. nearly all of the first 150 hits were all from the carp genome. anyway, I looked a bit further into my odd kmers and it turns out that they were the ends of some illumina adapter sequences that had presumably been inreceivated in To the specified red-reads on the shorter ends of the insert size. This then took me back to the carp genome-What had creeped into that?


In the paper, the authors state that they used 454, illumina and solid sequencing and also used some previusly published Bac-end sequences. the BAC-end and 454 sequences were assembled with the Celera generator er and the illumina, solid and 454 8kb mate-pair sequences were mapped to the Assembly to construct the scaffolds. finally, they used the specified red-end information from the short encrypted red-end reads to fill the gaps between the scaffolds. the final assembly consists of 9377 scaffolds.

The only quality control they speak of is "We then filtered out low-quality and short reads to obtain a set of usable reads ".

So I thought I 'd look at what was actually in their assembly. I downloaded the carp genome assembly (9377 scaffolds) and created a blast database from it and then created a FASTA file of illumina adapter sequences (found here) and used them as query sequences to blast against the carp genome. there is some redundancy in the illumina adapter sequences, So I collapsed them, so retaining only unique sequences and then removed any adapter sequences that were sub-sequences of longer adapter (the final file consisted of 81 sequences ). the blast resulted in 3750 hits (evalue <8.00e-06) of which 1009 were of 100% identity.

This gave me a final tally of at least 20 illumina adapter sequences inconfigurated into the final common carp genome assembly. out of the 9377 scaffolds, 277 appears to have illumina adapter sequences in them. I 've got Ded the counts of the different illumina adapter sequences (non-redundant) for the scaffolds at the bottom of the page.

I 've not looked for adapter sequences used in solid or 454 sequencing yet. It wocould be interesting to see what that throws up.

So, a lesson to be learned here. qc your Assembly, especially if you're not overly stringent with your read QC.


Here's the data:
Common carp genome scaffolds
Illumina adapter Sequences
Illumina adapter sequences collapsed
Illumina adapters v carp genome blast

Why you shoshould QC your reads and your assembly?

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.