Pool-HMM

Authors

  • S. Boitard(contact author), INRA
  • D. Robelin, INRA
  • R. Kofler, Institut für Populationsgenetik, Vetmeduni Vienna, Vienna, Austria.

Description

This program aims at estimating allele frequencies and detecting selective sweeps, using NGS data from a sample of pooled individuals from the same population. It implements the derivations of Boitard et al (2012).

The estimation of allele frequencies is based on a probabilistic model, which accounts for differences of coverage and base quality among genomic positions. Using this probabilistic model, the program can estimate the allele frequency spectrum in any genomic region specified by the user. The allele frequency spectrum can also be estimated for any type of annotated feature (e.g. introns), using the script filter-pileup-by-feature.py.

The detection of selective sweeps is based on a Hidden Markov Model (HMM). In this model, each polymorphic site on the genome is assumed to have an hidden state, which can take one of the three following values : "Neutral", "Intermediate" and "Selection". These hidden states are inferred from the observed data, and at the end the sites with hidden state "Selection" are the sweep candidates.

Source code and detailed documentation

https://forge-dga.jouy.inra.fr/projects/pool-hmm

Infile

the NGS data in pileup format (samtools)

Outfiles

  • spectrum : the allele frequency spectrum estimated from the sample.
  • emit : emission probabilities of the HMM.
  • pred : predicted hidden states.
  • stat : summary of the predicted sweep windows.
  • post : posterior probabilities of hidden state "Selection".
  • estim : maximum a posteriori estimates of the number of derived (or minor depending on the option used) alleles in the sample (from 0 to n).

Needs

  • This program needs a version of Python older than 2.5 and strictly earlier than 3.
  • NumPy and SciPy have to be installed (you can download Numpy and SciPy from here : http://new.scipy.org/download.html)

Usage

  • Open a terminal and go into the directory where the Python code is stored.
  • Put the pileup file in this same directory.
  • Execute the command "python pool-hmm.py" followed by a list of input parameters.
  • The definition and usage of all input parameters can be obtained by executing the command "python pool-hmm.py -h"

Example

The file test_droso.pileup was obtained from a pool of 194 drosophila haplotypes (97 flies). To detect selective sweeps from this data, the command could be :

 $pool-hmm.py --prefix test_droso -n 194 --pred -k 0.001 --theta 0.005

Parameter theta specifies an a priori for theta=AN*mu in the studied population, and parameter k specifies the transition rate between hidden states in the HMM.

Parallelization

In order to speed up the computations, the program can be run simultaneously on several processes. The number of processes specified in the command line is not bounded, but the number of processes that can effectively work simultaneously is obviously determined by the architecture of the computer where the program is run.

References

  • S. Boitard, C. Schlötterer and A. Futschik (2009). Detecting selective sweeps: a new approach based on hidden markov models. Genetics 181: 1567-1578.
  • S. Boitard, C. Schltterer, V. Nolte, R. V. Pandey and A. Futschik (2012). Detecting selective sweeps from pooled next generation sequencing samples. Mol. Biol. Evol., doi : 10.1093/molbev/mss090.
  • S. Boitard, R. Kofler, P. Françoise, D. Robelin, C. Schlöotterer and Andreas Futschik (submitted). Pool-hmm : a Python program for the detection of selective sweeps from pooled next generation sequencing samples.