Jellyfish K-mer analysis and genome size estimate

Last Update:2015-09-29 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Jellyfish can't use fq.gz first to FQ. gunzip-c *.fq.gz > *.fq $ jellyfish count-t 30-c-M 21-s 150G--min-quality=20--quali Ty-start=33./*.FASTQ assume a haploid genome, for simplicity. In the picture provided, the first peak at depth ~31 indicates amount of 1-copy content (in other words, the genome have ex actly 1 copy of this kmer, so it is unique). The weak peak at ~62x indicates the amount of 2-copy content. Everything under ~11x can is assumed to being error kmers, unrelated to genome size.

So, to estimate manually, take the sum of the Counts of a unique kmers under the first peak and multiply by 1; Add the sum of the counts of unique kmers under the peak at 2x the depth of the first peak and multiply by 2; etc, for all peaks. This would give you the haploid genome size. If your genome is tetraploid, the actual size would be the quarter of your result, since the first peak would correspond to Mutat Ions present on only 1 ploidy (1/0/0/0 genotype).

You can make this and accurate by modelling the peaks as a sum of Gaussian curves, but that probably won ' t change the Res Ult much. Of course, this method was subjective because calling peaks is subjective.

Please note-i think 17-mers is too short for this kind of analysis. I prefer 31-mers because they is the longest computationally-efficient kmers. Also, FYI, Bbnorm is faster than jellyfish and can Also generate kmer-frequency histograms:

khist.sh IN=READS.FQ Hist=khist.txt

Also, it makes more sense to plot these things as log-log rather than linear-linear; And the y-axis should are count, not frequency, which are useless for the purpose of genome-size estimation. Outline

Count K-mer occurence using jellyfish (jellyfish count)
Summarize as histogram (jellyfish histo)
Plot graph with R
Determine the total number of K-mer analyzed and the peak position
Compare the peak shape with Poisson distribution

Count K-mer occurence

In this example we had 5 pair of FASTQ files in three different subdirectories. The file to process can is specified with "*/*.QF.FASTQ" and veriied with LS.

$ ls */*.qf.fastqrun1/s_1_1_sequence.qf.fastq  run2/s_2_2_sequence.qf.fastqrun1/s_1_2_sequence.qf.fastq  RUN3/S_1_1_SEQUENCE.QF.FASTQRUN2/S_1_1_SEQUENCE.QF.FASTQ  run3/s_1_2_sequence.qf.fastqrun2/s_1_2_ SEQUENCE.QF.FASTQ  RUN3/S_2_1_SEQUENCE.QF.FASTQRUN2/S_2_1_SEQUENCE.QF.FASTQ  run3/s_2_2_ Sequence.qf.fastq

Next, we issue the jellyfish Count command

Jellyfish count-t 8-c-M 25-s 5g-o spec1_25mer--min-quality=20--quality-start=33 */*.QF.FASTQ


-T 8: Specifies the number of threads to be; used. This value should is equal to the number of cores on the machine or the number of slots you reserved through job Managemen T system ($NSLOTS in SGE or Uge).

-C: Specifies the both strands is considered. If You don't specify this, the apparent depth would being half,---that's undesirable

-M 25: specified that is counting for the Mer (i.e, K=25)

-S 5G: Is; some kind of magical number specification of hash size. This should is as high as the physical memory allows. The higher the faster, but exceeding the available memory leads to failure or extremely slow counting.

-O Spec1_25mer: Specifies the prefix of output file names.

--quality-start=33: specified that your FASTQ file has a based quality value string. Be careful on the dataformat. There is cases that your data is based depending on the sequending system and software versions.

--min-quality=20: specifies that nucleotide had qv lower than the should not included in the count. This selection reduces the k-mers derived from sequence errors and make the peak clearer.

*/*.qf.fastq: Would be expanded to the; ten filenames explained above by the shell and passed to jellyfish as input files

Summarize as histogram (jellyfish histo)

First confirm that got the output file

$ ls spec1_25mer*spec1_25mer_0

Now, there is a single file spec1_25mer_0

$ jellyfish Histo-o Spec1_25mer.histo spec1_25mer_0

Confirm that got the output

$ ls spec1_25mer*spec1_25mer_0  Spec1_25mer.histo

Examine the numbers by your eyes

$ head-25 spec1_25mer.histo1 4619385832 956060443 192804774 138367545 110184806 95550907 85579358 78632449 731950510 6920 88011 658972312 632192313 614863814 603612015 597226416 596223417 598769618 605117119 615442920 629737321 648513522 670057 923 693257024 721762725 7533211

Error running:
Terminate called after throwing an instance of ' Jellyfish::invertible_hash::errorallocation '
What (): Failed to allocate 628292358736 bytes of memoryaborted (core dumped) Resolution: 50G should is more than enough. That's the amount of memory I usually am using and I have a never had any problems. This also means so you probably does not need the high memory nodes. Freemaofafu

Jellyfish K-mer analysis and genome size estimate

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Jellyfish K-mer analysis and genome size estimate

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Jellyfish K-mer analysis and genome size estimate

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support