Home / ERDS
Last update: Feb 14, 2012
ERDS
(Estimation by Read Depth with SNVs)
Author: Mingfu Zhu.
Organization: Center for Human Genome Variation, Duke University School of Medicine.
Introduction
Download and release notes
Installation and usage
Format of output files
FAQs
Citing ERDS
Introduction
ERDS is a free, open-source software,
designed for inferring copy number variants (CNVs) in high-coverage human genomes using next
generation sequence (NGS) data. When a CNV presents in a test genome, multiple signatures, weak
or strong, would present in the alignment data. ERDS starts from read depth (RD) information,
and integrates other signatures as well to call CNVS sensitively and accurately.
It is important to read FAQs before using ERDS. In particular please note that
the current version of ERDS is NOT suitable for whole exome data.
Download and release notes
A example project for impatient users.
Feb. 09 2012: ERDS1.04.01.
-
A bug due to different versions of reference genomesis is fixed.
Jan. 12 2012: ERDS1.04.
Oct. 20 2011: ERDS1.04.
Aug. 09 2011: Important note: Current version of ERDS is not for whole exome data
-
ERDS estimates the expected RD of windows with given GC percentage using an
Expectation-Maximization algorithm. In this step ERDS assumes that the majority of RD signals
were contributed by normal regions (CN=2). This assumption is certainly true for whole genome
sequence data, but invalid for exome data since un-captured regions are most abundant.
Oct. 19 2010: ERDS1.03.
-
Special treatment for regions of segmental duplication has been successfully implemented.
Jun. 28 2010: ERDS1.02.
Apr. 19 2010: ERDS1.01.
Installation and usage
The hmm packages were written in C and require compilation. Go to both hmm and phmm directories and type:
The usage of ERDS is pretty straightforward as long as you set up correct parameters. There is no particular
order of input parameters.
$perl $ERDS_dir/erds_pipeline.pl -e $ERDS_dir -n $sample_name -o $output_dir -b $bam_file -s $snps_indels_file -r $ref_file -a $ref_fai_file -v $ref_version OPTIONS
You must define the following parameters.
-
-e <string>: specify the ERDS_dir
-
-n <string>: specify the sample_name
-
-o <string>: specify the output_dir
-
-b <string>: specify the bam_file
-
-s <string>: specify the variant file (can be in samtools formart or in vcf format).
-
-r <string>: specify the reference file
-
-a <string>: specify the reference fai file
-
-v <string>: specify whatever string indicating the version of reference genome, eg NCBI_b37. For filling in vcf output only
You can optionally define the following parameters. For those idication flags, no following strings is needed.
-
-c: if indicated, run cluster. By default single cpu
-
--vcfformat: if indicated, regard the inputted snpindel as vcf format. By default nonvcf format (samtools pileup format)
-
--non_sd: if indicated, no special treatment for SD regions. By default treat SD regions differently.
-
--samtools <string>: specify the samtools. By default version 1.12 as attached
-
--build <36|37>: followed by the choices of 36 or 37. By default 36.
You may also change the default setting of most of parameters. You can do this
by specifying either a new parameter file or individual parameters in the
commond line. You can get help by type:
-
$perl $ERDS_dir/erds_pipeline.pl
Back to the top
Format of output files
ERDS output files in .events and .vcf format. Each row in the events file corresponds to a CNV detected and is in the format of
chromosome start end length CN_type summed_score precise_boundary reference_cn and inferred_cn
-
Scores were calculated using Poisson model. Usually the higher of a score the more reliable a CNV is, but the length is also an important factor. They are missing for some small deletions. Those deletions were generated by scanning through low coverage windows and scores were less reliable even they were calculated.
-
Precise_boundary is 1 if ERDS had confidence to infer both left and right boundaries at bp-resolutions and 0 otherwise. Precision is always 0 for duplications.
-
Reference_cn is the number of copies ERDS thinks the sequence of the region presents in the reference genome. In some regions the number in male genome differs from that in female genome even in autosomes.
-
Inferred_cn is the number of copies ERDS thinks the sequence of the region presents in the sequenced genome. It may equal reference_cn for a deletion due to the repeated sequences or inaccurate boundaries.
Back to the top
FAQs
Q1: What are things ERDS can do?
A1: ERDS can call deletions and duplications for whole genome sequencing data of
human genomes sequenced at high coverage (above 20X recommended).
Q2: What are things ERDS cannot do?
A2: ERDS is not functional in the following situations
(a) Non-human genomes.
(b) Human reference genome but not in build 36 or 37.
(c) Not whole genome sequencing data.
(c) Sequencing coverage lower than 10X.
(d) Types of structural variations other than deletions and duplications.
Q3: How fast/slow is ERDS?
A3: The factors affecting the running time mainly include the machine conditions, the network
connectivity to the drives and the total number of reads in the alignment file.
For a sample sequenced at 40X with paired-end reads,
read length 100bp, library size 300bp and data stored in a local drive, it took ERDS
about 10 hours to call CNVs in a PC with 8G memory.
Q4: I would to specify more parameters for the input.
A4: You can manually change parameters in the parameter.txt file under the
software directory. This parameter.txt file serves as the blueprint and all parameters
in the file parameter-$samplename.txt under the $output_dir will be used in the program.
Or please contact the author.
Some common requests will be addressed in latter versions.
Q5: I received error messages when running ERDS.
A5: Please contact the author
with a copy of the error messages.
Q6: Will this tool be maintained regularly?
A6: In the predictable future, YES.
Q7: I was pissed off by ERDS. What are the other tools for detection of CNVs in whole genome sequencing data?
A7: Some famed ones include but not limited to Genome STRiP, CNVnator and Breakdancer.
Back to the top
Citing ERDS
Mingfu Zhu*, Anna C. Need*, Yujun Han, Dongliang Ge, Jessica M. Maia, Qianqian Zhu, Erin L. Heinzen,
Elizabeth T. Cirulli, Kimberly Pelak, Min He, Elizabeth K. Ruzzo, Curtis Gumbs, Abanish Singh, Sheng Feng,
Kevin V. Shianna and David B. Goldstein. Inferring copy number variants in high-coverage genomes using ERDS.
Submitted.
|