Home / ERDS
Last update: Feb 14, 2012

ERDS
(Estimation by Read Depth with SNVs)
Author: Mingfu Zhu.
Organization: Center for Human Genome Variation, Duke University School of Medicine.


  • Introduction
  • Download and release notes
  • Installation and usage
  • Format of output files
  • FAQs
  • Citing ERDS

  • Introduction

    ERDS is a free, open-source software, designed for inferring copy number variants (CNVs) in high-coverage human genomes using next generation sequence (NGS) data. When a CNV presents in a test genome, multiple signatures, weak or strong, would present in the alignment data. ERDS starts from read depth (RD) information, and integrates other signatures as well to call CNVS sensitively and accurately.
    It is important to read FAQs before using ERDS. In particular please note that the current version of ERDS is NOT suitable for whole exome data.

    Download and release notes

    A example project for impatient users.

    Feb. 09 2012: ERDS1.04.01.

    • A bug due to different versions of reference genomesis is fixed.

    Jan. 12 2012: ERDS1.04.

    Oct. 20 2011: ERDS1.04.

    • Released in house.

    Aug. 09 2011: Important note: Current version of ERDS is not for whole exome data

    • ERDS estimates the expected RD of windows with given GC percentage using an Expectation-Maximization algorithm. In this step ERDS assumes that the majority of RD signals were contributed by normal regions (CN=2). This assumption is certainly true for whole genome sequence data, but invalid for exome data since un-captured regions are most abundant.
    Oct. 19 2010: ERDS1.03.
    • Special treatment for regions of segmental duplication has been successfully implemented.

    Jun. 28 2010: ERDS1.02.

    • It was used to analyze CNVs in 20 samples in Pelak et al..

    Apr. 19 2010: ERDS1.01.


    Installation and usage

    The hmm packages were written in C and require compilation. Go to both hmm and phmm directories and type:

      $make

    The usage of ERDS is pretty straightforward as long as you set up correct parameters. There is no particular order of input parameters.

      $perl $ERDS_dir/erds_pipeline.pl -e $ERDS_dir -n $sample_name -o $output_dir -b $bam_file -s $snps_indels_file -r $ref_file -a $ref_fai_file -v $ref_version OPTIONS

      You must define the following parameters.
      • -e <string>: specify the ERDS_dir
      • -n <string>: specify the sample_name
      • -o <string>: specify the output_dir
      • -b <string>: specify the bam_file
      • -s <string>: specify the variant file (can be in samtools formart or in vcf format).
      • -r <string>: specify the reference file
      • -a <string>: specify the reference fai file
      • -v <string>: specify whatever string indicating the version of reference genome, eg NCBI_b37. For filling in vcf output only

      You can optionally define the following parameters. For those idication flags, no following strings is needed.

      • -c: if indicated, run cluster. By default single cpu
      • --vcfformat: if indicated, regard the inputted snpindel as vcf format. By default nonvcf format (samtools pileup format)
      • --non_sd: if indicated, no special treatment for SD regions. By default treat SD regions differently.
      • --samtools <string>: specify the samtools. By default version 1.12 as attached
      • --build <36|37>: followed by the choices of 36 or 37. By default 36.

      You may also change the default setting of most of parameters. You can do this by specifying either a new parameter file or individual parameters in the commond line. You can get help by type:

      • $perl $ERDS_dir/erds_pipeline.pl

    Back to the top

    Format of output files

    ERDS output files in .events and .vcf format. Each row in the events file corresponds to a CNV detected and is in the format of

      chromosome start end length CN_type summed_score precise_boundary reference_cn and inferred_cn

    • Scores were calculated using Poisson model. Usually the higher of a score the more reliable a CNV is, but the length is also an important factor. They are missing for some small deletions. Those deletions were generated by scanning through low coverage windows and scores were less reliable even they were calculated.

    • Precise_boundary is 1 if ERDS had confidence to infer both left and right boundaries at bp-resolutions and 0 otherwise. Precision is always 0 for duplications.

    • Reference_cn is the number of copies ERDS thinks the sequence of the region presents in the reference genome. In some regions the number in male genome differs from that in female genome even in autosomes.

    • Inferred_cn is the number of copies ERDS thinks the sequence of the region presents in the sequenced genome. It may equal reference_cn for a deletion due to the repeated sequences or inaccurate boundaries.

    Back to the top

    FAQs

    Q1: What are things ERDS can do?
    A1: ERDS can call deletions and duplications for whole genome sequencing data of human genomes sequenced at high coverage (above 20X recommended).

    Q2: What are things ERDS cannot do?
    A2: ERDS is not functional in the following situations
    (a) Non-human genomes.
    (b) Human reference genome but not in build 36 or 37.
    (c) Not whole genome sequencing data.
    (c) Sequencing coverage lower than 10X.
    (d) Types of structural variations other than deletions and duplications.

    Q3: How fast/slow is ERDS?
    A3: The factors affecting the running time mainly include the machine conditions, the network connectivity to the drives and the total number of reads in the alignment file. For a sample sequenced at 40X with paired-end reads, read length 100bp, library size 300bp and data stored in a local drive, it took ERDS about 10 hours to call CNVs in a PC with 8G memory.

    Q4: I would to specify more parameters for the input.
    A4: You can manually change parameters in the parameter.txt file under the software directory. This parameter.txt file serves as the blueprint and all parameters in the file parameter-$samplename.txt under the $output_dir will be used in the program. Or please contact the author. Some common requests will be addressed in latter versions.

    Q5: I received error messages when running ERDS.
    A5: Please contact the author with a copy of the error messages.

    Q6: Will this tool be maintained regularly?
    A6: In the predictable future, YES.

    Q7: I was pissed off by ERDS. What are the other tools for detection of CNVs in whole genome sequencing data?
    A7: Some famed ones include but not limited to Genome STRiP, CNVnator and Breakdancer.

    Back to the top

    Citing ERDS

    Mingfu Zhu*, Anna C. Need*, Yujun Han, Dongliang Ge, Jessica M. Maia, Qianqian Zhu, Erin L. Heinzen, Elizabeth T. Cirulli, Kimberly Pelak, Min He, Elizabeth K. Ruzzo, Curtis Gumbs, Abanish Singh, Sheng Feng, Kevin V. Shianna and David B. Goldstein. Inferring copy number variants in high-coverage genomes using ERDS. Submitted.