Exact inference using auxiliary variables in a Bayesian network ... diagnostics; prequential monitor; triangulation. ..... 3.4 Posterior distribution ...

0 downloads 0 Views 546KB Size

Therese Graversen∗ University of Oxford

Steffen Lauritzen University of Oxford

arXiv:1307.4956v1 [stat.ME] 18 Jul 2013

February 10, 2018

Abstract Statistical analysis of DNA mixtures is known to pose computational challenges due to the enormous state space of possible DNA profiles. We propose a Bayesian network representation for genotypes, allowing computations to be performed locally involving only a few alleles at each step. In addition, we describe a general method for computing the expectation of a product of discrete random variables using auxiliary variables and probability propagation in a Bayesian network, which in combination with the genotype network allows efficient computation of the likelihood function and various other quantities relevant to the inference. Lastly, we introduce a set of diagnostic tools for assessing the adequacy of the model for describing a particular dataset. Keywords: Bayesian network; genotype representation; junction tree; model diagnostics; prequential monitor; triangulation.

1

Introduction

In this paper we demonstrate methods for exact computation in statistical analysis of DNA mixtures, where the need for summation over the space of possible DNA profiles for unknown contributors traditionally has involved some degree of approximation (Bill et al., 2005; Tvedebrink et al., 2010; Puch-Solis et al., 2012) and has only been made for two or three unknown contributors (Cowell et al., 2011). In contrast, using the methodology presented here and the corresponding implementation by Graversen (2013) in the R-package DNAmixtures, Cowell et al. (2013) were able to perform exact evaluation and subsequent numerical maximisation of the likelihood function for up to six unknown contributors. The present paper develops a suite of tools for inference in the statistical model described in Cowell et al. (2013) enabling evaluation of the likelihood function, computation of posterior probability of genotypes given a set of observed peak heights, and assessment of model adequacy. We exploit introduction of auxiliary variables combined with an efficient representation of the genotypes as a Bayesian network. The implementation in DNAmixtures interfaces the HUGIN API (Hugin Expert A/S, 2013) via RHugin (Konis, 2013). ∗

Corresponding author: Therese Graversen, Department of Statistics, University of Oxford, 1 South Parks Road, Oxford OX1 3TG, United Kingdom, email: [email protected]

1

1000 750 RFU 500 250 0 13

14

15

16

17 Allele

18

19

20

21

Figure 1: Stylized electropherogram exhibiting peaks for the alleles at one particular marker.

The plan of the paper is as follows: Section 2 briefly describes the relevant model for DNA mixture analysis and the computational methods are detailed in Section 3. In Section 4 we show how the methodology can be extended to calculate various quantities of interest; in particular we develop methods for assessing the adequacy of the model.

2

A statistical model for mixed traces of DNA

In statistical analysis of DNA mixtures it is of interest to draw inference about individual DNA profiles in a mixed trace of DNA. The observations consists of a set of peak heights in an electropherogram (EPG) produced after a chemical duplication process known as a polymerase chain reaction (PCR). Figure 1 represents a schematic illustration of part of an EPG. The DNA sequences at Short Tandem Repeat (STR) markers are characterised by a motif of base pairs repeated a number of times, so that a specific repeat number corresponding to an allele and a peak in the EPG typically indicates presence of the corresponding allele. A pair of alleles is called a genotype, and the genotypes across a set of markers constitute the DNA profile. The markers used for forensic identification are typically located at different chromosome-pairs or, if not, on well separated locations, rendering it reasonable to assume independence of genotypes across markers. The observed EPG is prone to artefacts known as stutter and dropout: Stutter refers to the phenomenon that some of the DNA alleles may lose a repeat motif during the PCR process and thus contribute to a peak at a lower repeat number. If there are allelic types {1, . . . , A} we thus assume that any allele a receives stutter from the amplification of allele a + 1. Dropout refers to the fact that peak heights occasionally are too small for the peak to be registered. We distinguish between known and unknown contributors to the sample, depending on whether their DNA profile is considered known or not. The computational complexity of the problem is directly associated with the huge number of possible allocations of genotypes to the unknown contributors. The genotypes of a DNA profile are assumed independent across markers so we briefly describe the model for one marker only, following Cowell et al. (2013).

2

2.1

Model for the genotypes of unknown contributors

We assume that the alleles of an unknown person are sampled from a reference population in Hardy–Weinberg equilibrium so that the two alleles can be considered sampled independently. Denote by nia the number of alleles of type a for contributor i. A genotype (ni1 , . . . , niA ) for an unknownPcontributor follows a multinomial distribution with allele frequencies (q1 , . . . , qA ) and a nia = 2. Unknown contributors are assumed unrelated so their genotypes are independent.

2.2

Peak height distribution for fixed genotypes

Analysing DNA that contains alleles of type a results in a peak at position a, and possibly also a smaller peak at position a − 1 due to stutter during the PCR process; thus the height of the peak Ha ≥ 0 for allele a depends on the presence of alleles of both type a and type a + 1. Peaks of height Ha below a chosen threshold C are not registered and so the observed peak heights are Za = Ha 1{Ha ≥C} . For given genotypes of the contributors we assume that the peak height Ha at allelic type a is gamma distributed with shape and scale parameters depending on the numbers nia , ni,a+1 of alleles of type a and a + 1 that each unknown contributor i possesses, as well as a set of model parameters, ψ = (ρ, η, ξ, φ); more precisely we assume that Ha ∼ Γ(λa , η), where k X λa = ρ {(1 − ξ)nia + ξni,a+1 } φi . (1) i=1

Here, and in the following, we have let ni,A+1 = 0; the parameter ξ is the mean stutter percentage, ρ is related to the general peak variability, φi denotes the fraction of DNA from individual i, and η is the scale. If λa = 0 the gamma distribution Γ(0, η) is considered degenerate at 0.

2.3

Likelihood function

The likelihood function is determined by the distribution of the observed peak heights. The observed peak heights are independent across markers m = 1, . . . , M , and thus the likelihood function factorises accordingly. Using this fact in combination with (1) we find `(ψ) =

=

=

M Y m=1 M Y m=1 M Y m=1

m fψ (Z1m , . . . , ZA ) m

m E fψ Z1m , . . . , ZA n m

E

Y Am

fψ (zam | na , na+1 )

,

(2)

a=1

where the expectation is taken with respect to the distribution of genotypes of the unknown contributors. Here and in the following n denotes the full set of genotypes for all individuals and na the vector na = (nia , i ∈ I) of allele-counts of type a. The expectation in (2) involves summation over all combinations of possible genotypes of potential contributors. There are {Am (Am + 1)/2}k possible combinations of genotypes 3

at a marker, and thus there are this many terms in the sum, each being a product of Am factors. Direct computation is typically infeasible when there are many alleles and many unknown contributors. We attack this computational problem by appropriate use of Bayesian network techniques, as detailed in Section 3 below. Note that our methodology can be used directly with other choices of distribution for the peak heights, provided that the distribution of the peak height for allele a depends only on the genotypes through the number of alleles of types a and a + 1.

3

Computational methods

As a consequence of (2), and for other purposes, the computational task in DNA mixture analysis involves repeated computation of the expectation E{h(X)} of non-negative functions h of a set of discrete variables X = {Xv }v∈V . We describe our computational approach in thid general setting before returning to the DNA mixture model in Section 3.2, where we give a network representation of a genotype for an unknown contributor to the trace.

3.1

Computation by auxiliary variables

Let X = {Xv }v∈V be a collection of discrete variables with a distribution represented by a Bayesian network. For B ⊆ V , we denote by XB the collection of variables {Xv }v∈B . Let h be a non-negative function which can be written on the form Y h(x) = hB (xB ), B∈B

for some set B of subsets of V and real-valued, non-negative functions hB . For each B ∈ B we introduce binary random variables Y B ∈ {0, 1} which are conditionally independent given the network and have conditional distributions (3) P Y B = 1 X = x = P Y B = 1 XB = xB = hB (xB )/k B . Here, the constant k B is chosen such that hB (xB )/k B ∈ [0, 1] over all states xB and so (3) defines a valid probability distribution. A simple choice would be k B = maxxB hB (xB ), i.e. the largest value that hB attains over the state space of XB . We use the state space {0, 1} for auxiliary variables, but note that this choice is unimportant for the method itself. Q The desired expectation E{ B∈B hB (XB )} can now be expressed as the probability of a specific configuration of the binary variables introduced. As Proposition 1 reveals, this is also the case for the expectation of a product of any subset of the variables hB (XB ). Proposition 1. For all B 0 ⊆ B it holds that \ Y Y B {Y = 1} kB . E hB (XB ) = P B∈B0

B∈B0

4

B∈B0

Proof. Using (3) and the fact that Y B are conditionally independent given X we get Y Y B E hB (XB ) = E P Y = 1 XB kB B∈B0

B∈B0

=E

Y

P Y

B

= 1 X

Y

B∈B0

kB

B∈B0

\ Y B =E P {Y = 1} X kB B∈B0

=P

\

B∈B0

{Y B = 1}

B∈B0

Y

kB

B∈B0

as desired. If the distribution of the variables {Xv }v∈V is modelled by a Bayesian network, this network can be extended to include the variables {Y B }B∈B by for each B adding Y B as a child of {Xv }v∈B with conditional distributions of Y B in (3). As the auxiliary variables are added as children of existing network nodes, no directed cycles are created and the extended network is a correct representation of the joint distribution of (X, Y ) since, given XB , Y B is conditionally independent of all other variables in the extended network. Figure 2 illustrates how the network is extended in case of a function h factorising over two sets of variables (X2 , X3 ) and (X3 , X4 , X5 ). X1

X4 X3

X2

X5

Y {2,3}

Y {3,4,5}

Figure 2: Extending a network with two binary variables for computation of E h{2,3} (X2 , X3 )h{3,4,5} (X3 , X4 , X5 ) . Here B = {{2, 3}, {3, 4, 5}}

3.1.1

Probability propagation

We now briefly describe probability propagation and explain how to exploit the normalising constants arising as a by-product of the propagation algorithm. We refer for example to Cowell et al. (1999) for further details. A computational structure is set up in the form of a so-called junction tree of subsets of the variables involved: first an undirected graph, the moralised graph, is constructed by adding undirected links between nodes that have a common child and removing directions for existing edges. Subsequently links are added to ensure that the resulting graph is chordal. This process is known as triangulation and can generally be done in many ways. Finally the cliques in the triangulated graph are arranged in a junction tree. 5

In the situation described above XB is the parent set of Y B in the extended network and the node set XB will thus be a complete set in the triangulated graph, hence contained in some clique. The efficiency of the method depends crucially on the size of cliques for the chosen triangulation, see further discussion in Section 3.5.1 below. A distribution p(x) is represented by an unnormalised probability function Q ζC (xC ) p(x) ∝ g(x) = QC∈C S∈S ζS (xS ) where S denotes the set of separators, i.e. intersections of pairs of neighbouring cliques P in the junction tree. The corresponding normalising constant is N1 = x g(x). The function g(x) is known as the charge and the functions ζ as potentials. A message passing operation referred to as propagation brings the charge on a canonical form, where all potentials of the charge are equal to the function g marginalised onto the corresponding clique or separator, i.e. X ζD (xD ) = g(y) for all D ∈ C ∪ S. y:yD =xD

P The normalising constant can then be computed efficiently after propagation as xD ζD (xD ), for instance choosing D as a separator S ∈ S with minimal state space. The charge g can be modified by entering so-called likelihood evidence `v (xv ) on single nodes leading to the charge Y g˜(x) = g(x) `v (xv ) v∈V

with normalising constant N2 =

X

g(x)

x

Y

`v (xv ).

v∈V

Taking the ratio of the normalising constants before and after propagating the likelihood evidence yields the expectation of the product of the likelihood evidence with respect to the distribution p(x): Q P `v (xv ) X g(x) Y N2 x g(x) P v∈V P = `v (xv ) = N1 y g(y) y g(y) v∈V x Y X Y `v (Xv ) . = p(x) `v (xv ) = E x

v∈V

v∈V

As shown in Proposition 2 below, this fact now ensures that the expectation of interest can be calculated by propagating likelihood evidence on the auxiliary variables. Proposition 2. Let likelihood evidence for each node Y B , B ∈ B 0 ⊆ B be given as: ( kB , Y B = 1 `B (Y B ) = 0, YB =0 and let N1 and N2 be the normalising constants before and after propagation of the likelihood evidence. Then we have Y N2 E hB (XB ) = . N1 0 B∈B

6

Proof. N2 =E N1 =E

Y

B

`B (Y )

B∈B

Y

kB 1{Y B =1}

B∈B0

=P

\

{Y

B

= 1}

B∈B0

Y

kB

B∈B0

which by Proposition 1 equals the desired expectation.

3.2

A Bayesian network representation of genotypes

The multinomial distribution of allele-counts (ni1 , . . . , niA ) representing the genotype of individual i doesP not in itself have Markovian properties. However, if we define the partial sums Sia = ab=1 nia counting the number of alleles of type up to and including a that person i possesses, we can represent the genotype in a Bayesian network as displayed in Figure 3. Si1

Si2

Si3

Si4

Si5

Si6

ni1

ni2

ni3

ni4

ni5

ni6

Figure 3: Network representation of a genotype at a marker with A = 6 allelic types.

If we imagine the two alleles in the genotype being allocated sequentially, then the number of alleles that a person has of type a + 1 only depends on how many alleles of the total two are left to allocate, and the allocation happens according to a binomial distribution. In Proposition 3 we establish the formal correctness of the network specification. Proposition 3. The distributions of genotypes and partial sums satisfy the following relations Si1 = ni1 , ni1 ∼ bin (2, q1 ) , and for a ∈ {2, . . . , A} Sia = Si,a−1 + nia , P nia | Si,a−1 ∼ bin 2 − Si,a−1 , qa / A q . b b=a

(4)

Finally, we have the conditional independence relations nia ⊥ ⊥ (ni1 , . . . , ni,a−1 , Si1 , . . . , Si,a−2 ) | Si,a−1 Sia ⊥ ⊥ (ni1 , . . . , ni,a−1 , Si1 , . . . , Si,a−2 ) | (Si,a−1 , nia ). 7

(5)

Proof. The unnumbered relations follow directly from the definition of the quantities involved. We further have p(ni1 , . . . , ni,a−1 , nia ) p(ni1 , . . . , ni,a−1 ) P 2−Si,a−1 −nia Q nib A a 2! Q q b=a+1 b b=1 qb (2−Si,a−1 −nia )! a b=1 nib ! P 2−Si,a−1 Q A a−1 nib 2! Q q b a−1 b=a b=1 qb (2−S )! n !

p(nia | ni1 , . . . , ni,a−1 ) =

=

i,a−1

=

b=1

ib

(2 − Si,a−1 )! nia !(2 − Si,a−1 − nia )! !2−Si,a−1 −nia qa × 1 − PA b=a qb

!nia

qa PA

b=a qb

.

The conditional independence (5) follows from the fact that the conditional distribution of nia given ni1 , . . . , ni,a−1 only depends on the condition through Si,a−1 ; inspection of the expression for the conditional distribution yields (4).

3.3

Auxiliary variables for computing the likelihood function

In order to compute the inner expectation in (2), we note that this is an expectation of a product over alleles, where each factor is a function of the variables na and na+1 , and so we can compute this expectation using auxiliary variables as described in Section 3.1: For each allele a, we add an auxiliary variable Oa with parents nia and ni,a+1 for all unknown contributors i, except for OA that is given only one parent niA per contributor. Figure 4 shows the network for modelling one marker of a mixture with two contributors and six alleles. Note that Oa and its parents nia , ni,a+1 , i ∈ {1, . . . , k} Si1

Si2

Si3

Si4

Si5

Si6

ni1

ni2

ni3

ni4

ni5

ni6

O1

O2

O3

O4

O5

O6

nj1

nj2

nj3

nj4

nj5

nj6

Sj1

Sj2

Sj3

Sj4

Sj5

Sj6

Figure 4: Bayesian network modelling the genotypes of 2 unknown contributors i and j for a marker with 6 possible allelic types.

are necessarily contained in the same clique, implying that any valid junction tree will 8

contain cliques with an associated state space that is exponential in the number k of unknown contributors. Unfortunately, as the moralised graph is not chordal – for instance (Si1 , ni1 , nj2 , ni3 , Si2 , Si1 ) is a cycle – further edges need to be added, resulting in an additional increase in the size of the cliques. We shall return to this issue in Section 3.5.1. The distribution of a peak height Za conditionally on the allele-counts is for a < A ( gψ (za | na , na+1 ), za ≥ C fψ (za | na , na+1 ) = (6) Gψ (C | na , na+1 ), za < C where g and G denotes the density respectively the cumulative distribution function for the gamma distribution with parameters as in (1). Define the distribution of Oa for an observed allele, where za ≥ C, as P(Oa = 1 | na , na+1 ) = gψ (za | na , na+1 )/kaψ ,

(7)

noting the dependence of the scaling factor kaψ on ψ. For an unobserved allele, where zam = 0, let the distribution of Oa be defined as P(Oa = 0 | na , na+1 ) = Gψ (C | na , na+1 ).

(8)

For convenience we have defined the auxiliary variables so that for all alleles the event Oa = 1 corresponds to the event that the peak at allele a is above the threshold C. Now Proposition 2 can readily be used to evaluate the contribution to the likelihood from marker m for a given value of ψ by propagating likelihood evidence ( kaψ 1{Oa =1} , if a is seen (9) `a (Oa ) = 1{Oa =0} , if a is unseen.

3.4

Posterior distribution of genotypes

When entering and propagating likelihood evidence as in (9) for a in a set of alleles B, we obtain a representation of the conditional distribution of the full network given the relevant state of the auxiliary variables Oa , a ∈ B. Furthermore, this distribution is identical to the conditional distribution of the nodes in the network given the peak height information {za }a∈B : \ \ p(x | {za }a∈B ) = p x {Oa = 1} {Oa = 0} (10) a∈B,

a∈B,

za >C

za =0

9

This follows from the following argument: Y p(x) `a (Oa ) a∈B

\ \ {Oa = 0} ∝ p x {Oa = 1} a∈B,

a∈B,

za >C

za =0

∝ p(x) P = p(x)

= p(x)

\

{Oa = 1}

\

a∈B,

a∈B,

za >C

za =0

Y

P(Oa = 1 | x)

Y

a∈B

a∈B

za >C

za =0

Y

P(Oa = 0 | x) Y

P(Oa = 1 | na , na+1 )

a∈B za >C

= p(x)

{Oa = 0} x

Y

{gψ (za | na , na+1 )/kaψ }

a∈B za >C

∝ p(x)

Y

P(Oa = 0 | na , na+1 )

a∈B za =0

Y

Gψ (C | na , na+1 )

a∈B za =0

fψ ({za }a∈B | x)

a∈B

∝ p(x | {za }a∈B ). As a consequence, we can easily sample from the conditional distribution of genotypes given peak height information, which we shall exploit in Sections 4.2 and 4.3 below.

3.5

Network complexity considerations

The main concerns when applying computation by auxiliary variables to a specific problem are that the junction tree representation of the network may not fit in the physical memory, and propagation and other network operations may take prohibitively long. Both of these issues are directly related to the total size of the network junction tree. An additional concern lies in finding a good triangulation, as this can be both timeand memory-consuming; we eliminate this additional cost by specifying triangulations directly. The total size of the junction tree is the sum of the sizes of state spaces for all cliques and separators and determines how many numbers are needed to store the clique and separator tables. Once a junction tree has been created for a network, computation by auxiliary variables involves setting the conditional probability tables for each auxiliary variable and propagating evidence. The number of elementary arithmetic operations for propagation is linear in the total size. Also, the number of cells that need updating when the conditional probability tables for the auxiliary variables change is, in the worst case, determined by the total size. In the following we study the relation of the total sizes of junction tree representations used for DNA mixture analysis to the number A of possible alleles at a marker and the number k of unknown contributors.

10

3.5.1

Junction tree sizes for DNA mixtures

We shall consider three different triangulations of networks of the type discussed in Section 3.3 and investigate the behaviour of the total sizes of the corresponding junction trees. We restrict attention to mixture networks where any allele a — apart from the last allele A — can receive stutter from a + 1. Any triangulation must necessarily have cliques that contain auxiliary variables with their parent sets as these are complete sets in the moralised graph. For all our junction trees we avoid adding additional variables to all such sets and simply combine any auxiliary variable with its parent set to form a clique. We can thus focus the discussion on triangulating the part of the moralised graph that does not involve auxiliary variables. If we have N binary auxiliary variables per allele, their cliques and corresponding separators contribute to the total size of the junction tree by n o T Saux = 3N (A − 1)32k + 3k , since there are N (A−1) cliques containing an auxiliary variable along with its 2k parents, and each is separated from the remaining junction tree by a separator containing the 2k parents. The N auxiliary variables for the last allele have only k parents. Bearing Figure 3 in mind, the structure of the genotype networks requires upper triangle sets {Si,a−1 , Sia , nia } to be in a clique as they are complete sets. If allele a − 1 receives stutter from a, then the lower triangle set {ni,a−1 , nia , Sia } is also complete in the moralised graph and must be contained in some clique. The first triangulation method we shall consider, uses the simple idea of slicing the network into cliques {Sia , Si,a+1 , nia , ni,a+1 }ki=1 for a = 1, . . . , A. The corresponding junction tree, which we shall refer to as the slice tree, is displayed in Figure 5. In addition to the cliques and separators arising from the auxiliary variables, the slice tree has A − 1 cliques each consisting of 4k nodes, and A − 2 separators between them, each consisting of 2k nodes. Thus the total size of the slice tree becomes T Sslice = (A − 1)34k + (A − 2)32k + T Saux . S11 S12 n11 n12 S21 S22 n21 n22 S31 S32 n31 n32

S12 S13 n12 n13 S22 S23 n22 n23 S32 S33 n32 n33

S13 S14 n13 n14 S23 S24 n23 n24 S33 S34 n33 n34

n11 n12 n21 n22 n31 n32 O1

n12 n13 n22 n23 n32 n33 O2

n13 n14 n23 n24 n33 n34 O3

n14 n24 n34 O4

Figure 5: Slice junction tree for k = 3 contributors, A = 4 alleles, and N = 1 auxiliary variable per allele.

However, we can improve on this triangulation by splitting each slice into two cliques as Figure 6 illustrates. The resulting triangle tree has 2(A − 1) cliques of each 3k nodes

11

S1,a S1,a+1 n1,a n1,a+1 S2,a S2,a+1 n2,a n2,a+1 S3,a S3,a+1 n3,a n3,a+1

S1,a n1,a n1,a+1 S2,a n2,a n2,a+1 S3,a n3,a n3,a+1

S1,a S1,a+1 n1,a+1 S2,a S2,a+1 n2,a+1 S3,a S3,a+1 n3,a+1

Figure 6: Splitting each slice into two cliques consisting of lower and upper for a reduction in total size.

and 2(A − 1) separators of each 2k nodes, and thus the total size T Striangle = 2(A − 1)33k + {2(A − 1) − 1}32k + T Saux grows significantly slower with the number of unknown contributors than the slice tree; see Figure 10. S11 n11 n12 S21 n21 n22 S31 n31 n32

S11 S12 n12 S21 S22 n22 S31 S32 n32

n11 n12 n21 n22 n31 n32 O1

S12 n12 n13 S22 n22 n23 S32 n32 n33

S12 S13 n13 S22 S23 n23 S32 S33 n33

n12 n13 n22 n23 n32 n33 O2

S13 n13 n14 S23 n23 n24 S33 n33 n34

S13 S14 n14 S23 S24 n24 S33 S34 n34

n13 n14 n23 n24 n33 n34 O3

n14 n24 n34 O4

Figure 7: Triangle junction tree for k = 3 contributors, A = 4 alleles, and N = 1 auxiliary variable per allele.

In the case of only one unknown contributor, the total size of the triangle tree cannot be reduced. However, with more than one unknown contributor, each clique containing k upper triangles can be further split into k cliques as in Figure 8. Note that the cliques S1,a S1,a+1 n1,a+1 S2,a S2,a+1 n2,a+1 S3,a S3,a+1 n3,a+1

S1,a S1,a+1 n1,a+1 S2,a n2,a+1 S3,a n3,a+1

S2,a

S1,a+1 n1,a+1 S2,a+1 n2,a+1

S3,a

S3,a n3,a+1

S1,a+1 n1,a+1 S2,a+1 n2,a+1 S3,a+1 n3,a+1

Figure 8: Splitting upper triangle cliques for a further reduction in total size.

containing k lower triangle sets cannot be split in a similar fashion. The resulting junction tree then has A − 1 cliques of each 3k nodes, a further k(A − 1) of each 2k + 1 nodes, and (k + 1)(A − 1) − 1 separators of 2k nodes between them. The total size of the tree is thus T Sopt = (A − 1)33k + {(4k + 1)(A − 1) − 1}32k + T Saux . A further slight reduction of the total size can be obtained by a small alteration in the cliques that cover nodes from the first two and last three alleles; the resulting tree is seen in Figure 9. We shall refer to this tree as the optimal tree, as this is the best 12

junction tree we have been able to construct. We have also investigated junction trees found by using triangulation algorithms implemented in HUGIN but none have smaller total size than our optimal tree.

S12 S12 n11 n12

S22 S22 n21 n22

S32 S32 n31 n32

S12 n11 n12

S12 n12 S22 n22

S12 n12 S22 n22 S32 n32

n21 n22

n21

n31 n32

n31 n32

n11 n12 n21 n22 n31 n32 O1

n31

S12 n12 n13 S22 n22 n23 S32 n32 n33 n12 n13 n22 n23 n32 n33 O2

S12 S13 n13 S22 n23 S32 n33

S13 n13 S22 S23 n23 S32 n33

S32

S13 n13 S23 n23 S33 n33

S13 n13 n14 S23 n23 n24 S33 n33 n34

S13 S14 n14 S23 n24 S33 n34

n13 n14 n23 n24 n33 n34 O3

S14 n14 S23 S24 n24 S33 n34

S33

S14 n14 S24 n24 S34 n34

S14 n14 n15 S24 n24 n25 S34 n34 n35

S15 S16 n16

S25 S26 n26

S35 S36 n36

S14 S15 n15 n16

S24 S25 n25 n26

S34 S35 n35 n36

n15 n16

n15 n16

S14 n15 n16 S24

S24 n25

S34

n25 n26 S34

n35

n14 n15 n24 n25 n34 n35 O4

n25 n26 S34

n35

n35 n36 n15 n16 n25 n26 n35 n36 O5

n16 n26 n36 O6

Figure 9: Optimal junction tree for a DNA mixture network with k = 3, A = 6, and N = 1.

The optimal junction tree can be generated by an elimination sequence which first eliminates all the auxiliary variables and then proceeds through the network nodes as SA , SA−1 , S1 , n1 , {na , Sa }A−2 a=2 , nA−1 , nA where Sa denotes {Sia }ki=1 etc. The exponential growth of the total size of the three types of junction tree is illustrated in Figure 10. Our numerical examples all include N = 3 auxiliary variables for each allele to reflect the size of the networks used in the R-package DNAmixtures. The choice of N makes little difference to the total size as this in all cases grows linearly with N . The network representations constructed for the genotypes have a large number of P state combinations that are impossible, for example due to the constraint that a nia = 2 for all i. In HUGIN there is a facility to compress the domain, such that only configurations of clique and separator states with non-zero probability are stored, thus reducing the effective size of the junction tree. There is a slight cost in terms of book-keeping, but for our purposes this cost is negligible. As is apparent from Figure 10, the exponential growth pattern prevails for the compressed domains. Note that after compression all three junction trees are approximately of the same size. Also, the reduction of total size obtained by compression is itself growing exponentially; ignoring any slight reduction in total size from compressing states with probability zero in the cliques with auxiliary variables, the total size for the compressed slice tree is T Scompr.slice = (A − 3)10k + {3N (A − 1) + A} 6k + 3N 3k . In general, to make a compression, one single propagation has to be performed and therefore the uncompressed networks set the limit for computational feasibility. When numbers are represented in single precision of each four bytes, the horisontal band in Figure 10 represents a range of capacities from 2GB to 512 GB of memory. Figure 10 indicates that using the optimal junction tree should enable computation for up to k = 6 unknown contributors, whereas using the slice tree restricts computation to around k = 4. There is a simple way of compressing the slice tree in that there are at most 10 possible configurations of the states in each of {Sia , Si,a+1 , nia , ni,a+1 }. So if the state 13

15

● ● ● ●

●

log10 of total size 10

● ● ● ●

● ● ● ●

5

● ● ●

●

●

● ●

0

●

Allele pair Slice tree Triangle tree Optimal tree

1

2

3

4 5 6 7 Unknown contributors

8

9

10

Figure 10: Total sizes of junction trees as a function of the number k of unknown contributors, in the case of A = 25 allelic types and N = 3 auxiliary variables per allele. Solid lines are uncompressed sizes and dashed lines compressed sizes. The horisontal band indicates total sizes ranging from 2GB to 512GB assuming numbers are represented in single precision.

space is defined by these from the outset, it would in principle be possible to handle up to k = 9 unknown contributors, as it the compressed network would determine the maximal capacity; however, the general flexibility of the representation would be reduced. 3.5.2

Other representations of genotypes

Clearly, the network that represents the genotype of an unknown contributor could be replaced by a different representation than the one suggested here and connected to the auxiliary variables in an appropriate way. We shall briefly consider two alternative representations of a genotype. Allele-pair representation More commonly, a genotype has been represented directly as an unordered pair of alleles; this representation has for example been used in Cowell et al. (2011). Including A alleles in the model there are A(A + 1)/2 possible unordered pairs. If an allele-pair is represented by a single node for each contributor, the parent set for each auxiliary variable in this network is the collection of the k unknown genotype-nodes, resulting in a junction tree where each clique and each separator contains all of the k genotype-nodes. Adding N auxiliary variables for each of A alleles yields the total size T Sallele-pair = (3N A − 1) {A(A + 1)/2}k . We note that this junction tree exhibits polynomial rather than linear growth in A, rendering the representation less efficient for markers with a large number of possible allelic types. For a fixed number of alleles, the growth in the number k of unknown 14

contributors is still exponential; see Figure 10. For junction trees based on the Markov representation of genotypes, the number of alleles makes a neglible impact on the total size. However, for the allele-pair representation the rate of growth depends heavily on the number of allelic types: For 25 alleles as in Figure 10 it is feasible to handle up to about 3 unknown contributors, whereas if only 10 allelic types are needed, then 4-5 unknown contributors can be handled. For 7 or more allelic types, the Markov representation in combination with optimal triangulation is superior to the allele-pair representation regardless of the number of unknown contributors. As the allele-pair representation is compressed by construction, there is no possibility of further compression of the junction tree. Single gene representation Another possibility, used for example in Dawid et al. (2002) and Mortera et al. (2003), is to model the genotype at the single gene level. A single gene can be represented by the same Markovian network structure as that in Figure 3 used for a genotype, just that each node nia or Sia has state space {0, 1} rather than {0, 1, 2}. However, there is a cost in that two such networks are needed per unknown contributor, resulting in a total size with growth-rate O(A × 23(2k) ) compared to O(A × 33k ) when using the genotype representation. Thus, the single gene network will always be inferior to the genotype network. The total size of the optimal single gene tree renders computations feasible for up to about 5 unknown contributors. Compression of the single gene slice tree yields a growth rate of O(A × 16k ), which still is considerably higher than O(A × 10k ) for the corresponding compressed genotype slice tree. It would stay feasible if k ≤ 7. For A ≥ 11 allelic types, the single gene representation compares favourably to the allele-pair representation. Although inefficient, the single gene network representation may be preferable for other reasons; for example in cases where the two genes might be selected from different populations, if sensitivity to uncertainty or population structure should be investigated as in Green and Mortera (2009), or if there is additional complexity involving family relations etc. as in Mortera et al. (2003).

4

DNA mixture analysis

The analysis of a mixed trace can have different objectives depending on the context. The objective can be a quantification of the strength of evidence for a given hypothesis over another, or the objective may be a deconvolution of the trace, i.e. that one wishes to predict genotypes of unknown contributors. As a generic example we consider a trace MC15 from Gill et al. (2008), also analysed in Cowell et al. (2013). The trace is believed to contain DNA from at least three contributors, and the victim, who we shall denote K1 , is assumed present along with another contributor K2 . We shall here deal with the question of the identity of the third contributor. The peak heights from one marker are given in Table 1 along with the allele-counts for each of three genotyped individuals. The available evidence E consists of the peak heights as observed in the EPG as well as the genotypes of individuals associated with the case. It is customary to assume relevant population gene frequencies to be known.

15

Table 1: Peak heights for marker D2S1338 above threshold in trace MC15, and genotypes of associated individuals. Allele a 16 17 23 24

Peak height Za 64 96 507 524

Allele-count K1 K2 K3 0 0 1 0 0 1 1 0 0 1 2 0

Strength of evidence. We now consider two competing explanations to the trace. The prosecution hypothesis Hp : K1 &K2 &K3 claims that the trace has exactly three contributors who are identical to the three known individuals K1 , K2 , and K3 . An alternative explanation of the trace is the defence hypothesis Hd : K1 &K2 &U that the trace contains the DNA of K1 , K2 , as well as that of an unknown and unrelated individual U , whereas K3 has not contributed. The strength of the evidence is reported as a likelihood ratio: ˆ p )/L(H ˆ d ) = Pr(E | H ˆ p )/Pr(E | H ˆ d) LR = L(H ˆ i indicates that we use the maximum likelihood estimates of the parameters where H under the hypothesis Hi , see Table 2 below. Deconvolution. Under the defence hypothesis we are interested in determining the identity of the unknown contributor U . This could for example be done by finding the most probable genotypes for U given the evidence, i.e. those with the highest values of ˆ d , E). We shall return to this issue in Section 4.3 below. Pr(U | H Estimation. In order to calculate the relevant quantities for any of the above questions, we need to estimate the unknown parameters of the model. Being able to evaluate the likelihood function, this can be done by numerical maximisation. The maximum likelihood estimates and standard errors obtained under the defence hypothesis Hd and prosecution hypothesis Hp are given in Table 2. The resulting likelihood ratio is log10 (LR) = 12.12. Table 2: Maximum likelihood estimates based on MC15. Defence hypothesis Parameter Estimate ρ 26.95 η 33.86 ξ 0.086 φK1 0.823 φK2 0.055 φU 0.122 ˆ log10 L(H) -130.21

4.1

Prosecution hypothesis Parameter Estimate ρ 33.86 η 26.94 ξ 0.076 φK1 0.825 φK2 0.049 φK3 0.126 ˆ log10 L(H) -118.09

Model Diagnostics

In the assessment of forensic evidence, little attention has been devoted to demonstrate the adequacy of a proposed model used to analyse a specific case or, of equal importance, 16

to assert that data have been correctly recorded for the analysis. This may partly be due to the unavailability of useful methods for the purpose. However, we believe this aspect to be of utmost importance; in particular we find it reasonable that one should not only compare the prosecution and defence hypothesis, but there should also be an effort to demonstrate that neither hypothesis represents an implausible explanation of the trace under analysis. Previously we have introduced auxiliary variables Oa , to enable simple computation of the likelihood function (2) and representation of evidence from observed peak heights (10). We shall in the following introduce further auxiliary variables such as binary variables Da which indicate whether or not a peak was observed for allele a, and variables Qa which indicate whether a peak observed at allele a was less than a specified value. Both of these types of auxiliary variables shall prove to be useful for model validation; in addition, the variables Da can be used in an analysis which refrains from exploiting the peak heights but is based only on peak presence; see Section 4.4 below. 4.1.1

Assessing peak height distributions

First, we wish to investigate whether our model appropriately predicts the observed peak heights. Given Za ≥ C, the peak height follows a continuous distribution and thus the probability transform P(Za ≤ za | Za ≥ C) follows a uniform distribution. To express the probability in a way suitable for computation with auxiliary variables we first note that for z ≥ C we have P(Za ≤ z | Za ≥ C) =

P(Za ≤ z) − P(Za < C) . P(Za ≥ C)

Thus all we need to evaluate is the distribution function in the observed value za and at the threshold C. The distribution function n o P(Za ≤ z) = E P(Za ≤ z | na , na+1 ) (11) is the expectation of a trivial product of one factor, and to compute this we add an auxiliary variable Qa with the same parents as for Oa and with conditional probability P(Qa = 1 | na , na+1 ) = P(Za ≤ z | na , na+1 ). Similarly, we add a binary variable Da allowing the evaluation of both P(Za ≥ C) and P(Za < C). It can be of interest to consider the distribution of the peak height in the light of other observed peaks, and not just the marginal distribution of the peak itself. For instance, we can condition on the peak heights of all other alleles to get P(Za ≤ z | Zb = zb , b 6= a, Za ≥ C), or we could include this information for only the preceding alleles in the ordering to get P(Za ≤ z | Zb = zb , b ≤ a, Za ≥ C). These distributions can all be obtained simply through conditioning on relevant variables Oa as described in Section 3.4. In Figure 11, quantile-quantile plots for the conditional distribution of a peak height given observed peak heights for all other alleles are shown for Hp and Hd using trace MC15 and the associated maximum likelihood estimates in Table 2. We note that in both diagrams the points are close to the identity line and there is no indication that the peak height distributions are inadequately modelled under either of the hypotheses. 17

P(Za ≤ za | Za ≥ C) 0.0 0.2 0.4 0.6 0.8 1.0

P(Za ≤ za | Za ≥ C) 0.0 0.2 0.4 0.6 0.8 1.0

Prosecution hypothesis ●● ●●●● ● ●●● ●●●● ● ●● ● ●●● ● ●●● ●● ● ●●●● ●●●● ● ●● ● ● ● ● ●●●

0.0

0.2 0.4 0.6 0.8 Uniform quantiles

1.0

Defence hypothesis ● ●●● ● ●● ● ● ●●● ● ● ●●● ●●● ●●●● ● ●●● ● ●● ●● ●● ●●● ●● ●● ●●●●

0.0

0.2 0.4 0.6 0.8 Uniform quantiles

1.0

Figure 11: Quantile-quantile plots for the prosecution and defence hypotheses for MC15.

We can also take a closer look at the distribution of the peak height at any single allele, for example to identify outlying observations. This is illustrated in Figure 12. Boxes indicate quartiles and whiskers indicate 0.5% and 99.5% prediction limits for the conditional distributions of peak heights P(Za ≤ z | Zb = zb , b 6= a, Za ≥ C). The quantiles are found by numerical inversion of the distribution function (11).

0

Peak height 500 1000

1500

Prosecution hypothesis

15

16

17

18

19

20

21 22 Allele

23

24

25

26

27

23

24

25

26

27

0

Peak height 500 1000

1500

Defence hypothesis

15

16

17

18

19

20

21 22 Allele

Figure 12: Comparison of observed peak heights to their predictive distribution conditionally on all other observed peak heights for marker D2S1338. The bar below each peak indicates the probabilities of observing (grey) and not observing (black) a peak at this allele.

We note that although the observed peak heights at alleles 23 and 24 are somewhat lower than expected, there are no observations that are clear outliers, conforming with the quantile-quantile plots in Figure 11. Note that the prosecution hypothesis predicts complete absence of peaks at alleles 18–21 and 25-27, whereas this is not the case for 18

the defence hypothesis involving alleles from unknown contributors; hence under this hypothesis peaks are a priori possible at any allele. 4.1.2

Prequential monitoring of peak presence

Next, we wish to investigate whether our model correctly predicts absence and presence of peaks in the EPG. We use the prequential theory of Dawid (1984) with so-called prequential monitors (Seillier-Moiseiwitsch and Dawid, 1993). Using some arbitrary ordering, we consider the set of alleles across all markers and the probability that a peak has been seen for allele a given the peak heights observed on all preceding alleles, pa = P(Za ≥ C | zi , i < a) = P(Da = 1 | zi , i < a) which can be obtained by propagation as described in Section 3.4. For each allele a, we then consider the logarithmic score ( − log pa , if za ≥ C Ya = − log(1 − pa ), if za < C so that Ya is always non-negative and higher values of Ya represent a large penalty for assigning a small probability (pa or 1 − pa ) to the event that actually happens. The cumulative logarithmic score, adjusted for incremental expectations, Ma =

a X

{Yi − E(Yi | Zb , b < i)}

i=1

is a martingale with respect to the sequence of peak heights. As V(Ma − Ma−1 | Zb , b < a) = V(Ya | Zb , b < a), the distribution of the normalised cumulative score P Pa Yi − ai=1 E(Yi | Zb , b < i) i=1 pPa i=1 V(Yi | Zb , b < i) approaches a standard normal distribution as the denominator becomes infinitely large (Seillier-Moiseiwitsch and Dawid, 1993). Thus for q1−α being the 1 − α quantile of the standard normal distribution, v u a uX q1−α t V(Yi | Zb , b < i) i=1

is an approximate pointwise 1 − α upper predictive limit for the cumulative score at allele a. The cumulative score can easily be calculated using that if pa ∈ {0, 1} we have Ya = 0 and otherwise E(Ya | Zb , b < a) = −pa log pa − (1 − pa ) log(1 − pa ), V(Ya | Zb , b < a) = pa (1 − pa ) {log pa − log(1 − pa )}2 . Prequential monitor plots of the prosecution and defence hypothesis for MC15 are displayed in Figure 13. 19

●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●● ●● ● ●●● ● ●●● ●● ● ●●●●●●●●●●●●●

−5

Cumulative score 0 5 10

Prosecution hypothesis

D19S433 D21S11

D2S1338 D8S1179 D3S1358 FGA

TH01 VWA

Defense hypothesis ●●● ● ●●●●●●●●● ●●● ●● ●●●●●●●●●●●●●●●●●●● ● ●●●● ●●●●●● ●●●●●●●●● ●●●●●●●● ●●●●●●●●● ●●●●●●● ●●●●●●● ●● ●●●●●●●●●●●●●●● ● ●●●●●●

−5

Cumulative score 0 5 10 15

D16S539 D18S51

D16S539 D18S51

D19S433 D21S11

D2S1338 D8S1179 D3S1358 FGA

TH01 VWA

Figure 13: Prequential monitor plots of the prosecution and defence hypotheses for MC15. The dashed horisontal lines indicate upper 95% and 99% pointwise predictive limits based on the approximating normal distribution.

A negative jump in the score means that we have observed what the model predicts as most likely, whereas a positive jump means that we have observed the opposite of what is most likely according to the model. If it is equally likely for a peak to fall above and below the threshold, or there is only one possible outcome — i.e. if pa ∈ {0, 1/2, 1} — there is no jump. The size of an upward jump indicates the level of disagreement between model and observations. Note that for the defence hypothesis, the monitors cross the upper limits towards the end of the plot, indicating that this hypothesis may not adequately describe the pattern of observed peaks. Further investigation may reveal whether upward jumps are due to observation of rare alleles or, for example, due to recording errors in the data.

4.2

Simulation

As noted in Section 3.4, introducing evidence on the auxiliary variables Oa yields a representation of the posterior distribution of the genotypes of the unknown contributors. This in turn enables simulation of a full DNA trace including peak heights, either marginally or conditionally on relevant subsets of the observed peak heights. More generally, we have for any event B that fψ ({za }a∈A , n | B) = fψ ({za }a∈A | n, B)p(n | B). If conditioning with B can be represented by propagation in our Bayesian network, for example if B = {Zb = zb , b 6= a}, we can easily simulate from p(n | B) by standard methods (Cowell et al., 1999, Section 6.4.3). Thus to sample a full DNA trace, we just further need a method for sampling from fψ ({za }a∈A | n, B). This method of simulation can for example be used in a bootstrap analysis of the estimation uncertainty as in Graversen and Lauritzen (2013). Simulation could also be 20

relevant for assessing the discriminatory ability of the calculated likelihood ratio, for illustration of peak height variability, and other forms of model validation. Below we are exploiting simulation in the prediction of profiles of unknown contributors.

4.3

Prediction of unknown profiles

In a model involving unknown contributors it can be relevant to investigate the distribution of genotypes for each of these conditionally on the evidence. Focusing on a single or few alleles, we can explore this distribution directly. For any combination of genotypes we can compute its probability exactly by probability propagation. We can identify those of highest probability by sampling genotypes until a proportion p of the probability mass has been visited as then each of the remaining combinations of genotypes must have probability at most 1 − p. Thus the r combinations with probability strictly greater than 1 − p must be among those sampled. They can then be ranked according to their probability and constitute the list of the r most probable combinations. Here the number r depends on the probability p chosen. Considering the defence hypothesis of trace MC15, we would like to identify the genotype of the unknown contributor U . If we consider the full genotype, at all markers, we often get a very diffuse distribution as for example reported in Cowell et al. (2013). One reason for this is that, due to dropout, there are generally many unseen alleles that could be present in the mixture without giving rise to a peak. However, if we focus on explaining the peaks actually seen in the EPG we get a more concentrated distribution, as displayed in Table 3, where the total probability of the six combinations add up to one. As the table shows, the probability that the unknown contributor has at Table 3: Probabilities of genotype at marker D2S1338 for the unknown contributor U under the defence hypothesis. The defendant K3 has genotype (16,17).

16 17 23 24 D 1 1 0 0 0 0 1 0 0 1 0 2 0 0 0 0 1 0 1 0 0 1 1 0 0 1 0 0 0 1 Total probability

Prob 0.5276 0.1861 0.1697 0.0640 0.0509 0.0017 1.0000

least one allele 17 is .9983, close to certainty. There is some uncertainty concerning the second allele which can be virtually anything although it is by far most probable that the genotype is (16, 17); this genotype is that of the defendant K3 . The second most probable explanation of the trace is that the other allele has dropped out.

4.4

Strength of evidence when ignoring peak heights

Another potential application of the auxiliary variables is to calculate a likelihood ratio which only uses information about peak presence or absence. This can be done by specifying evidence for the nodes Da introduced in Section 4.1 rather than for nodes Oa . It is still necessary to specify a set of model parameters, which for example could be estimated using peak heights. Using the estimates in Table 2 we obtain a likelihood ratio 21

of log10 LR = 9.85 which is weaker than the evidence obtained with full peak height information but it is still incriminating for the defendant. Such an analysis is analogous to the one used in likeLTD as suggested by Balding (2013), where peak heights are used only to classify peaks as present, absent, or uncertain. We have used peak heights to estimate the parameters of the model. In principle parameters could also be estimated solely on the peak presence information, possibly in combination with prior information on some of these, although such estimates would be ill-determined and therefore not useful.

4.5

Multiple mixed traces

By adding more auxiliary variables to the model, we can easily extend the model to handle multiple traces, either with independent unknown contributors or where some or all unknown contributors coincide. We assume that the peak heights across mixed traces are conditionally independent given the genotypes of common contributors. Peak height distributions are allowed to vary across traces through the model parameters. The network now models the set of all unknown contributors to the mixed traces. Denote by φji the proportion of DNA that contributor i has made to trace j. Then φji = 0 corresponds to contributor i not being present in trace j. Therefore, the case where some or all contributors are distinct to a particular mixed trace is a sub-model corresponding to φji = 0 for some (i, j). An advantage of this specification of the joint model is that we do not need to make assumptions about possible common unknown contributors to the traces, but we can let the maximisation of the likelihood point to the relevant scenario. This has been used in Cowell et al. (2013) for a combined analysis of MC15 with another trace pertaining to the same case. In the case where the traces have completely independent unknown contributors, it is recommendable to represent each trace as a separate network to limit the number of unknown contributors in each network.

5

Discussion

We note that our computational methods are exact throughout under the model adopted, and that the only approximations relate to the model representing an inevitable approximation to reality, and possible imprecision of numerical methods. Nevertheless, using the efficient junction tree representations and exact compression methods as described in Section 3.5.1, we are able to handle more contributors than what has previously been possible. We have far from exhausted the flexibility and the potential of the Bayesian network model and point out that simple modifications or elaborations of the basic network can readily be used to, say, incorporate the presence of silent alleles simply by including an extra allele in the genotype representation, or to enable the direct computation of the probability that a specific peak is due to stutter or an absent peak is due to random dropout or allele absence; see Cowell et al. (2013) for this and further examples.

22

References Balding, D. (2013). Evaluation of mixed-source, low-template DNA profiles in forensic science. Proceedings of the National Academy of Sciences of the United States of America. Published online doi:10.1073/pnas.1219739110. Bill, M., Gill, P., Curran, J., Clayton, T., Pinchin, R., Healy, M., and Buckleton, J. (2005). PENDULUM – a guideline-based approach to the interpretation of STR mixtures. Forensic Science International, 148:181–189. Cowell, R. G., Dawid, A. P., Lauritzen, S. L., and Spiegelhalter, D. J. (1999). Probabilistic Networks and Expert Systems. Springer-Verlag, New York. Cowell, R. G., Graversen, T., Lauritzen, S., and Mortera, J. (2013). Analysis of DNA mixtures with artefacts. arXiv:1302:4404. Cowell, R. G., Lauritzen, S. L., and Mortera, J. (2011). Probabilistic expert systems for handling artifacts in complex DNA mixtures. Forensic Science International: Genetics, 5:202–209. Dawid, A. P. (1984). Statistical theory. The prequential approach. Journal of the Royal Statistical Society, Series A, 147:277–305. Dawid, A. P., Mortera, J., Pascali, V. L., and van Boxel, D. W. (2002). Probabilistic expert systems for forensic inference from genetic markers. Scandinavian Journal of Statistics, 29:577–595. Gill, P., Curran, J., Neumann, C., Kirkham, A., Clayton, T., Whitaker, J., and Lambert, J. (2008). Interpretation of complex DNA profiles using empirical models and a method to measure their robustness. Forensic Science International: Genetics, 2:91– 103. Graversen, T. (2013). DNAmixtures: Statistical Inference for Mixed Traces of DNA. R package version 0.1-0, dnamixtures.r-forge.r-project.org/. Graversen, T. and Lauritzen, S. (2013). Estimation of parameters in DNA mixture analysis. Journal of Applied Statistics. Published online doi:10.1080/02664763.2013.817549. Green, P. J. and Mortera, J. (2009). Sensitivity of inferences in forensic genetics to assumptions about founder genes. Annals of Applied Statistics, 3:731–763. Hugin Expert A/S (2013). Hugin API Reference Manual, Version 7.7. Hugin Expert A/S, Aalborg, Denmark. Konis, K. (2013). RHugin. R package version 7.7-5, rhugin.r-forge.r-project.org. Mortera, J., Dawid, A. P., and Lauritzen, S. L. (2003). Probabilistic expert systems for DNA mixture profiling. Theoretical Population Biology, 63:191–205. Puch-Solis, R., Rodgers, L., Mazumder, A., Pope, S., Evett, I., Curran, J., and Balding, D. (2012). Evaluating forensic DNA profiles using peak heights, allowing for multiple donors, allelic dropout and stutters. Technical report, LGC Research Report LGC/P/2012/138. 23

Seillier-Moiseiwitsch, F. and Dawid, A. P. (1993). On testing the validity of sequential probability forecasts. Journal of the American Statistical Association, 88:355–359. Tvedebrink, T., Eriksen, P. S., Mogensen, H. S., and Morling, N. (2010). Evaluating the weight of evidence by using quantitative short tandem repeat data in DNA mixtures. Applied Statistics, 59:855 – 874.

24