Jellyfish K-mer analysis and genome size estimate

Source: Internet
Author: User

Jellyfish can't use fq.gz first to FQ. gunzip-c *.fq.gz > *.fq $ jellyfish count-t 30-c-M 21-s 150G--min-quality=20--quali Ty-start=33./*.FASTQ assume a haploid genome, for simplicity. In the picture provided, the first peak at depth ~31 indicates amount of 1-copy content (in other words, the genome have ex actly 1 copy of this kmer, so it is unique). The weak peak at ~62x indicates the amount of 2-copy content. Everything under ~11x can is assumed to being error kmers, unrelated to genome size.

So, to estimate manually, take the sum of the Counts of a unique kmers under the first peak and multiply by 1; Add the sum of the counts of unique kmers under the peak at 2x the depth of the first peak and multiply by 2; etc, for all peaks. This would give you the haploid genome size. If your genome is tetraploid, the actual size would be the quarter of your result, since the first peak would correspond to Mutat Ions present on only 1 ploidy (1/0/0/0 genotype).

You can make this and accurate by modelling the peaks as a sum of Gaussian curves, but that probably won ' t change the Res Ult much. Of course, this method was subjective because calling peaks is subjective.

Please note-i think 17-mers is too short for this kind of analysis. I prefer 31-mers because they is the longest computationally-efficient kmers. Also, FYI, Bbnorm is faster than jellyfish and can Also generate kmer-frequency histograms:

khist.sh IN=READS.FQ Hist=khist.txt

Also, it makes more sense to plot these things as log-log rather than linear-linear; And the y-axis should are count, not frequency, which are useless for the purpose of genome-size estimation. Outline
    1. Count K-mer occurence using jellyfish (jellyfish count)
    2. Summarize as histogram (jellyfish histo)
    3. Plot graph with R
    4. Determine the total number of K-mer analyzed and the peak position
    5. Compare the peak shape with Poisson distribution
Count K-mer occurence

In this example we had 5 pair of FASTQ files in three different subdirectories. The file to process can is specified with "*/*.QF.FASTQ" and veriied with LS.

$ ls */*.qf.fastqrun1/s_1_1_sequence.qf.fastq  run2/s_2_2_sequence.qf.fastqrun1/s_1_2_sequence.qf.fastq  RUN3/S_1_1_SEQUENCE.QF.FASTQRUN2/S_1_1_SEQUENCE.QF.FASTQ  run3/s_1_2_sequence.qf.fastqrun2/s_1_2_ SEQUENCE.QF.FASTQ  RUN3/S_2_1_SEQUENCE.QF.FASTQRUN2/S_2_1_SEQUENCE.QF.FASTQ  run3/s_2_2_ Sequence.qf.fastq

Next, we issue the jellyfish Count command

Jellyfish count-t 8-c-M 25-s 5g-o spec1_25mer--min-quality=20--quality-start=33 */*.QF.FASTQ
-T 8
Specifies the number of threads to be
used. This value should is equal to the number of cores on the machine or the number of slots you reserved through job Managemen T system ($NSLOTS in SGE or Uge).
-C
Specifies the both strands is considered. If You don't specify this, the apparent depth would being half,---that's undesirable
-M 25
specified that is counting for the Mer (i.e, K=25)
-S 5G
Is
some kind of magical number specification of hash size. This should is as high as the physical memory allows. The higher the faster, but exceeding the available memory leads to failure or extremely slow counting.
-O Spec1_25mer
Specifies the prefix of output file names.
--quality-start=33
specified that your FASTQ file has a based quality value string. Be careful on the dataformat. There is cases that your data is based depending on the sequending system and software versions.
This is relevant if you specify--min-quality
--min-quality=20
specifies that nucleotide had qv lower than the should not included in the count. This selection reduces the k-mers derived from sequence errors and make the peak clearer.
*/*.qf.fastq
Would be expanded to the
ten filenames explained above by the shell and passed to jellyfish as input files
Summarize as histogram (jellyfish histo)

First confirm that got the output file

$ ls spec1_25mer*spec1_25mer_0

Now, there is a single file spec1_25mer_0

$ jellyfish Histo-o Spec1_25mer.histo spec1_25mer_0

Confirm that got the output

$ ls spec1_25mer*spec1_25mer_0  Spec1_25mer.histo

Examine the numbers by your eyes

$ head-25 spec1_25mer.histo1 4619385832 956060443 192804774 138367545 110184806 95550907 85579358 78632449 731950510 6920 88011 658972312 632192313 614863814 603612015 597226416 596223417 598769618 605117119 615442920 629737321 648513522 670057 923 693257024 721762725 7533211
Error running:
Terminate called after throwing an instance of ' Jellyfish::invertible_hash::errorallocation '
What (): Failed to allocate 628292358736 bytes of memoryaborted (core dumped) Resolution: 50G should is more than enough. That's the amount of memory I usually am using and I have a never had any problems.     This also means so you probably does not need the high memory nodes. Freemaofafu

Jellyfish K-mer analysis and genome size estimate

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.