CCEGA HAPMAP Simulator - Disease Model

This page describes the disease model file format


  We allow for three different disease model specification types. The absolute genotype (AG) format specifies the P(genotype|disease) for each joint genotype for the L disease loci. There are a total of 3L such joint genotypes. The AG probabilities should add to 1.0, although there is some allowance for rounding error. Another specification types is genotype relative risk (GRR), which specifies the values P(disease | genotype)/P(disease|referent genotype). GRR also requires that the user specify the overall probability of disease. GRR requires that at least one of the joint genotypes has a relative risk value of 1 (and is therefore the referent genotype). Finally, the disease model can be specified in terms of the absolute risk of disease (AR), which is P(disease|genotype) for each joint genotype. For GRR and AR, the HapMap data are used by HAP-SAMPLE to automatically convert the values into the appropriate values P(genotype|disease) from which disease genotypes will be drawn.

The 3L joint genotypes are assumed to be ordered by sorting on the last of the disease loci, then sorting on the second-to-last disease locus, etc (example below). The order in which disease loci are specified is entirely up to the user. Currently HAP-SAMPLE allows only one disease locus per chromosome.

File formats are given in more detail below, with comments following "#" characters. The comments should not appear in the actual disease model files.

Example 1: AG format, one disease SNP

1	  # L=number of disease loci
rs868559  # the causal SNP.  Must be in the HapMap CEU data
0.005      # disease prevalence
AG	  # tells HAP-SAMPLE to use AG format
0.4225	  # P(genotypes | disease) start here
0.455	
0.1225
This is for the genotypes in the following order:
0
1
2

Example 2: AG format, two disease SNPs, one per chromosome

2          # L=number of disease loci
rs9439462  # first causal SNP
rs4662920  # second causal SNP
0.01       # disease prevalence
AG         # tells HAP-SAMPLE to use AG format
0.541696   # P(genotypes | disease) start here
0.270848
0.033856
0.094208
0.047104
0.005888
0.004096
0.002048
0.000256

This is for the genotypes in the following order:
0 0
1 0
2 0
0 1
1 1
2 1
0 2
1 2
2 2

Example 3: Genotype Relative Risk

2          # L=number of disease loci
rs9439462  # first causal SNP
rs868559   # second causal SNP
0.001      # disease prevalence
GRR        # tells HAP-SAMPLE to use GRR format
1.1        # relative risks start here
2.2
3.3
1          # at least one relative risk must be 1
2
3
0.3
0.6
0.9

Example 4: Absolute Risk

2         # L=number of disease loci
rs9439462 # first causal SNP
rs868559  # second causal SNP
0.05      # disease prevalence (this value is actually ignored in AR format)
AR        # tells HAP-SAMPLE to use AR format
0.005     # P(disease|genotype) starts here
0.02
0.03
0.005
0.002
0.003
0.001
0.002
0.003