Jellyfish can't use fq.gz first to FQ. gunzip-c *.fq.gz > *.fq $ jellyfish count-t 30-c-M 21-s 150G--min-quality=20--quali Ty-start=33./*.FASTQ assume a haploid genome, for simplicity. In the picture provided, the first peak at depth ~31 indicates amount of 1-copy content (in other words, the genome have ex actly 1 copy of this kmer, so it is unique). The weak peak at ~62x indicates the amount of 2-copy content. Everything under ~11x can is assumed to being error kmers, unrelated to genome size.
So, to estimate manually, take the sum of the Counts of a unique kmers under the first peak and multiply by 1; Add the sum of the counts of unique kmers under the peak at 2x the depth of the first peak and multiply by 2; etc, for all peaks. This would give you the haploid genome size. If your genome is tetraploid, the actual size would be the quarter of your result, since the first peak would correspond to Mutat Ions present on only 1 ploidy (1/0/0/0 genotype).
You can make this and accurate by modelling the peaks as a sum of Gaussian curves, but that probably won ' t change the Res Ult much. Of course, this method was subjective because calling peaks is subjective.
Please note-i think 17-mers is too short for this kind of analysis. I prefer 31-mers because they is the longest computationally-efficient kmers. Also, FYI, Bbnorm is faster than jellyfish and can Also generate kmer-frequency histograms:
khist.sh IN=READS.FQ Hist=khist.txt
Also, it makes more sense to plot these things as log-log rather than linear-linear; And the y-axis should are count, not frequency, which are useless for the purpose of genome-size estimation. Outline
- Count K-mer occurence using jellyfish (jellyfish count)
- Summarize as histogram (jellyfish histo)
- Plot graph with R
- Determine the total number of K-mer analyzed and the peak position
- Compare the peak shape with Poisson distribution
Count K-mer occurence
In this example we had 5 pair of FASTQ files in three different subdirectories. The file to process can is specified with "*/*.QF.FASTQ" and veriied with LS.
$ ls */*.qf.fastqrun1/s_1_1_sequence.qf.fastq run2/s_2_2_sequence.qf.fastqrun1/s_1_2_sequence.qf.fastq RUN3/S_1_1_SEQUENCE.QF.FASTQRUN2/S_1_1_SEQUENCE.QF.FASTQ run3/s_1_2_sequence.qf.fastqrun2/s_1_2_ SEQUENCE.QF.FASTQ RUN3/S_2_1_SEQUENCE.QF.FASTQRUN2/S_2_1_SEQUENCE.QF.FASTQ run3/s_2_2_ Sequence.qf.fastq
Next, we issue the jellyfish Count command
Jellyfish count-t 8-c-M 25-s 5g-o spec1_25mer--min-quality=20--quality-start=33 */*.QF.FASTQ
-
-
-T 8
-
Specifies the number of threads to be
-
used. This value should is equal to the number of cores on the machine or the number of slots you reserved through job Managemen T system ($NSLOTS in SGE or Uge).
-
-
-C
-
-
Specifies the both strands is considered. If You don't specify this, the apparent depth would being half,---that's undesirable
-
-
-M 25
-
-
specified that is counting for the Mer (i.e, K=25)
-
-
-S 5G
-
Is
-
some kind of magical number specification of hash size. This should is as high as the physical memory allows. The higher the faster, but exceeding the available memory leads to failure or extremely slow counting.
-
-
-O Spec1_25mer
-
-
Specifies the prefix of output file names.
-
-
--quality-start=33
-
-
specified that your FASTQ file has a based quality value string. Be careful on the dataformat. There is cases that your data is based depending on the sequending system and software versions.
This is relevant if you specify--min-quality
-
-
--min-quality=20
-
-
specifies that nucleotide had qv lower than the should not included in the count. This selection reduces the k-mers derived from sequence errors and make the peak clearer.
-
-
*/*.qf.fastq
-
Would be expanded to the
-
ten filenames explained above by the shell and passed to jellyfish as input files
Summarize as histogram (jellyfish histo)
First confirm that got the output file
$ ls spec1_25mer*spec1_25mer_0
Now, there is a single file spec1_25mer_0
$ jellyfish Histo-o Spec1_25mer.histo spec1_25mer_0
Confirm that got the output
$ ls spec1_25mer*spec1_25mer_0 Spec1_25mer.histo
Examine the numbers by your eyes
$ head-25 spec1_25mer.histo1 4619385832 956060443 192804774 138367545 110184806 95550907 85579358 78632449 731950510 6920 88011 658972312 632192313 614863814 603612015 597226416 596223417 598769618 605117119 615442920 629737321 648513522 670057 923 693257024 721762725 7533211
Error running:
Terminate called after throwing an instance of ' Jellyfish::invertible_hash::errorallocation '
What (): Failed to allocate 628292358736 bytes of memoryaborted (core dumped) Resolution: 50G should is more than enough. That's the amount of memory I usually am using and I have a never had any problems. This also means so you probably does not need the high memory nodes. Freemaofafu
Jellyfish K-mer analysis and genome size estimate