FLK

Authors

  • M. Bonhomme, INRA
  • C. Chevalet, INRA
  • B. Servin(contact author), INRA
  • S. Boitard, INRA
  • J. Abdallah, Department of Animal Production, Faculty of Agriculture, An-Najah National, University, Nablus, Palestine
  • S. Blott, Department of Genetics, Centre for Preventive Medicine, Newmarket, UK
  • M. SanCristobal, INRA

Description

Detecting selection in population trees: the Lewontin and Krakauer test extended.

This page is a documentation on how to compute the extended LK test (named FLK) on a SNP (biallelic marker) dataset. Instructions are provided along with a set of input file examples containing the pig data analyzed in Bonhomme et al.

Principle

The principle of the test is to compare patterns of differences between allele frequencies in several populations to their expectation under a neutral evolution. The null hypothesis of neutral evolution assumes a tree structure with branch length corresponding to the amount of genetic drift in each population (F). This tree is estimated from the matrix of Reynold's genetic distances between populations, using the neighbor joining (NJ) algorithm.

Software requirements

All softwares needed are free softwares available on all common computer operating systems. Please install the following required packages to use the programs provided on this page:
The test calculations are performed using R. The ape package is needed to estimate the NJ tree.
To derive the empirical distributions of the test under neutrality, a python program is provided (see below). It requires the simuPOP and numpy packages to run.

Input

Main input

In order to perform the test, the user needs to provide data on allele frequencies for several populations. To build the population tree, the program needs an outgroup population used to root the NJ tree. This file contains one line per population. Each line starts with the population name, followed by the list of allele frequencies for this population.
As it is assumed markers are biallelic (SNP), only one allele frequency is needed per marker. It doesn't matter the allele which frequency is reported in the file, as long as it is the same allele for all populations.
Excerpt of the input file for the pig dataset (Download the file)

GBDU02 0.6875 0 0 0.40425532 0.23958333 ...
FRLR01 0.73958333 0.03125 0.11 0.27659574 0.23469388 ...
GBLW05 0.55 0.03703704 0.22916667 0.125 0 ...
DEPI03 0.65306122 0.15306122 0.01 0.5 0 ...
FRMS01 0.3125 0 0 0 0.375 ...


For this dataset, the outgroup population is FRMS01

Additional input

Additionnaly, the user may provide a file with the Reynolds genetic distances already computed. This is convenient (and recommended) if the SNP data is small and restricted to a few regions of the genome. The format of the file is as follows:
Each line contains first the population name and then the corresponding row of the matrix of reynolds genetic distances. It is assumed that the population order is the same for the row and the columns. Population names in this file must match the ones in the main input file, although the order might be different.
Reynolds Genetic Distances for the pig dataset (Download the file )

GBDU02 0.0000 0.2422 0.2850 0.3916 0.2647
FRLR01 0.2422 0.0000 0.1732 0.3396 0.1501
GBLW05 0.2850 0.1732 0.0000 0.3572 0.1774
FRMS01 0.3916 0.3396 0.3572 0.0000 0.3436
DEPI03 0.2647 0.1501 0.1774 0.3436 0.0000

In the pig data analysed by Bonhomme et al., the Reynolds genetic distances were computed from microsatellite data.

Computing FLK test

In order to compute the FLK test, you will need the R code provided in the file FLK.R (Download the file).
We provide instructions to use the code through an example R session on the pig data. An analysis of the input file must follow the same steps. The R statements are in bold and comments in italic:

## import the functions
source('FLK.R')

## Read the SNP frequency data
freq=read.table('pig.dat',row.names=1)

## Read the matrix of Reynolds Genetic Distances
DR=read.table('pig.dist',row.names=1)

## Estimate the population tree with provided Reynolds matrix
F=Fij(freq,outgroup='FRMS01',D=DR)
 
## Alternatively estimate the population tree using Reynolds distances
## computed on the SNP data (not recommended here)
Fsnp=Fij(freq,outgroup='FRMS01')

## Now compute the FLK and LK tests
tests=FLK(freq,F)

The FLK R function returns a data frame where each line corresponds to results for a SNP. The order of the SNPs in the data frame is the same as on input. For each SNP, the function returns the mean heterozygosity (Ht), the FLK statistic (F.LK), the associated asymptotic p-value (F.LK.p.val), the original LK statistic (LK) and associated asymptotic p-value (LK.p.val).
Excerpt of the data frame obtained on the pig data:

Ht F.LK F.LK.p.val LK LK.p.val
0.45036473 4.422247e-01 0.931388145 0.34041443 0.95225673
0.10454975 1.286550e+00 0.732329480 1.03240612 0.79341130
0.15934366 2.034670e+00 0.565242014 1.71449799 0.63371538
0.43976966 1.783754e+00 0.618476643 1.43776139 0.69670761
0.20902125 2.316534e+00 0.509360914 2.14718524 0.54242600
0.37367636 7.443688e+00 0.059023131 5.54232704 0.13612880

Additional output files are created by the Fij function. It returns the estimated F matrix in a file named fij.txt and the NJ tree in the file named tree.txt. These files are needed to derive the empirical null distribution of the FLK statistic (see below).

Empirical null distribution of FLK

We provide a program called FLKnull (Download the file) to derive the empirical null distribution of the FLK statistic. This program performs simulations conditional on the dataset analysed (that is the population tree estimated from the data). The program needs the fij.txt and tree.txt files created by the Fij R function (see above).

Running the program

To run the program, open a terminal and go the directory containing the results of the analysis. Then just run the program by typing python FLKnull. This will perform 10,000 simulations conditional on the estimated population tree. Optionnaly, more (or less although not recommended) simulations can be specified as an argument to the program. For example typing python FLKnull 50000 will lead to performing 50,000 simulations. Note that the simulation process can take some time.
FLKnull returns the empirical quantiles of the null distribution of the tests for different heterozygocities. The results are provided in an output file named 'envelope.txt'. Each line of the file is composed of:

  • Heterozygosity
  • 0.005, 0.025, 0.5 (median), 0.975, 0.995 quantiles of the null distribution

The output file has a header as first line indicating the values for the different columns.

Plotting the distribution

We provide another R code (Download the file) to plot the null distribution envelope. This is done by calling source('plotNull.R') within your session (provided the output file 'envelope.txt' is in the current working directory). The actual estimated quantiles are plotted in gray. Because of the variance due to the simulation process, the envelope is better represented by fitting a spline on the actual quantiles. These are the lines represented in black: the solid lines correspond to the 0.005 and 0.995 quantiles, the dashed lines the 0.025 and 0.975 quantiles and the doted line to the median. If you find the variance around the spline to be too large, perform more simulations as explained above.
You can then add the observed value of your data by calling points(tests$Ht,tests$F.LK,pch=16). On the pig data this results in the following figure:


envelope.png