Aug 7, 2008 - The choice of free parameters in network models is subjective, since it depends on ... many properties as we want to monitor in a networ...

0 downloads 2 Views 129KB Size

arXiv:cond-mat/0609015v3 [cond-mat.dis-nn] 7 Aug 2008

1

Dipartimento di Fisica, Universit` a di Siena, Via Roma 56, 53100 Siena ITALY. 2 Dipartimento di Scienze Matematiche ed Informatiche, Universit` a di Siena, Pian dei Mantellini 44, 53100 Siena ITALY.

The choice of free parameters in network models is subjective, since it depends on what topological properties are being monitored. However, we show that the Maximum Likelihood (ML) principle indicates a unique, statistically rigorous parameter choice, associated to a well defined topological feature. We then find that, if the ML condition is incompatible with the built-in parameter choice, network models turn out to be intrinsically ill-defined or biased. To overcome this problem, we construct a class of safely unbiased models. We also propose an extension of these results that leads to the fascinating possibility to extract, only from topological data, the ‘hidden variables’ underlying network organization, making them ‘no more hidden’. We test our method on the World Trade Web data, where we recover the empirical Gross Domestic Product using only topological information.

In complex network theory, graph models are systematically used either as null hypotheses against which real– world networks are analysed, or as testbeds for the validation of network formation mechanisms [1]. Until now there has been no rigorous scheme to define network models. However, here we use the Maximum Likelihood (ML) principle to show that undesired statistical biases naturally arise in graph models, which in most cases turn out to be ill–defined. We then show that the ML approach constructively indicates a correct definition of unbiased models. Remarkably, it also allows to extract hidden information from real networks, with intriguing consequences for the understanding of network formation. The framework that we introduce here allows to solve three related, increasingly complicated problems. First, we discuss the correct choice of free parameters. Model parameters are fixed in such a way that the expected values (i.e. ensemble averages over many realizations) of some ‘reference’ topological property match the empirically observed ones. But since there are virtually as many properties as we want to monitor in a network, and surely many more than the number of model parameters, it is important to ask if the choice of the reference properties is arbitrary or if a rigorous criterion exists. We find that the ML method provides us with a unique, statistically correct parameter choice. Second, we note that the above ML choice may conflict with the structure of the model itself, if the latter is defined in such a way that the expected value of some property, which is not the correct one, matches the corresponding empirical one. We find that the ML method identifies such intrinsically ill–defined models, and can also be used to define safe, unbiased ones. The third, and perhaps most fascinating, aspect regards the extraction of information from a real network. Many models are defined in terms of additional ‘hidden variables’ [2, 3, 4, 5] associated to vertices. The ultimate aim of these models is to identify the hidden variables with empirically observable quantities, so that the model will provide a mechanism of network formation driven by these quantities. While for a few networks this identification has been carried out successfully [6, 7], in most cases the hidden variables are assigned ad hoc.

However, since in this case the hidden variables play essentially the role of free parameters, one is led again to the original problem: if a non–arbitrary parameter choice exists, we can infer the hidden variables from real data. As a profound and exciting consequence, the quantities underlying network organization are ‘no more hidden’. In order to illustrate how the ML method solves this three–fold problem successfully, we use equilibrium graph ensembles as an example. All network models depend on a set of parameters that we collectively denote by the vector ~θ. Let P (G|~θ) be the conditional probability of occurrence of a graph G in the ensemble spanned by the model. For a given topological property π(G) displayed by a graph G, the expected value hπiθ~ reads hπiθ~ ≡

X

π(G)P (G|~θ)

(1)

G

In order to reproduce a real–world network A, one usually chooses some reference properties {πi }i and then sets ~θ to the ‘matching value’ ~θM such that hπi iθ~M = πi (A)

∀i

(2)

Our first problem is: is this method statistically rigorous? And what properties have to be chosen anyway? A simple example is when a real undirected network A with N vertices and L undirected links is compared with a random graph where the only parameter is the connection probability θ = p. The common choice for p is such that the expected number of links hLip = pN (N − 1)/2 equals the empirical value L, which yields pM = 2L/N (N − 1). But one could alternatively choose p in such a way that the expected value hCi of the clustering coefficient matches the empirical value C, resulting in the different choice pM = C. Similarly, one could choose any other reference property π, and end up with different values of p. Therefore, in principle the optimal choice of p is undetermined, due to the arbitrariness of the reference property. However, we now show that the ML approach indicates a unique, statistically correct parameter choice. Consider a random variable v whose probability distribution

2 f (v|θ) depends on a parameter θ. For a physically realized outcome v = v ′ , f (v ′ |θ) represents the likelihood that v ′ is generated by the parameter choice θ. Therefore, for fixed v ′ , the optimal choice for θ is the value θ∗ maximizing f (v ′ |θ) or equivalently λ(θ) ≡ log f (v ′ |θ). The ML approach avoids the drawbacks of other fitting methods, such as the subjective choice of fitting curves and of the region where the fit is performed. This is particularly important for networks, often characterized by broad distributions that may look like power laws with a certain exponent (subject to statistical error) in some region, but that may be more closely reproduced by another exponent or even by different curves as the fitting region is changed. By contrast, the ML approach always yields a unique and rigorous parameter value. Examples of recent applications of the ML principle to networks can be found in [8, 9]. In our problem, the likelihood that a real network A is generated by the parameter choice ~θ is

For instance, in hidden variable models [2, 3, 4] pij is a function of a control parameter θ ≡ z and of some quantities xi , xj that we assume fixed for the moment. As a first example, consider the popular bilinear choice [2, 3, 4, 5]

λ(~ θ) ≡ log P (A|~ θ)

This shows that if we set z = z ∗ , then L isPin general different from the expected value hLiz∗ = i

and the ML condition for the optimal choice ~θ∗ is " # ~ ∂λ( θ) ∗ ~ θ~ ) = ∇λ( = ~0 ∂~ θ ~θ=θ~∗

(3)

(4)

This gives a unique solution to our first problem. For instance, in the random graph model we have P (A|p) = pL (1 − p)N (N −1)/2−L

(5)

Writing the likelihood function λ(p) = log P (A|p) and looking for the ML value p∗ such that λ′ (p∗ ) = 0 yields p∗ =

2L N (N − 1)

(6)

Therefore we find that the ML value for p is the one we obtain by requiring hLi = L. In general, different reference quantities (for instance the clustering coefficient) would not yield the statistically correct ML value. For the random graph model the above correct choice is also the most frequently used. However, more complicated models may be intrinsically ill–defined, as there may be no possibility to match expected and observed values of the desired reference properties without violating the ML condition. This is the second problem we anticipated. To illustrate it, it is enough to consider a slightly more general class of models, obtained when the links between all pairs of vertices i, j are drawn with different and independent probabilities pij (~ θ) [2, 3, 4, 5]. Now Y P (A|~θ) = (7) θ)]1−aij pij (~ θ)aij [1 − pij (~ i

where the product runs over vertex pairs (i, j), and aij = 1 if i and j are connected in graph A, and aij = 0 otherwise. Then eq.(3) becomes λ(~θ) =

X i

aij log

X pij (~ θ) log[1 − pij (~θ)] + 1 − pij (~ θ) i

(8)

pij (z) = zxi xj

(9)

Writing λ(z) = log P (A|z) as in eq.(8) and deriving yields X aij (1 − aij )xi xj λ′ (z ∗ ) = =0 (10) − z∗ 1 − z ∗ xi xj i

P

i

aij = L, the condition for z ∗ becomes L=

X

(1 − aij )

i

pij (z) =

z ∗ xi xj 1 − z ∗ xi xj

zxi xj 1 + zxi xj

(11)

(12)

Writing λ(z) and setting λ′ (z ∗ ) = 0 now yields L=

X i

z ∗ xi xj 1 + z ∗ xi xj

(13)

P which now coincides with hLiz∗ = i

3 now show that one large class of unbiased models can be constructively defined, namely the exponential random graphs traditionally used by sociologists [12, 13] and more recently considered by physicists [11, 14, 15, 16]. If {πi }i is a set of topological properties, an exponential model is defined by the probability ~

P (G|~ θ) = e−H(G|θ) /Z(~ θ)

(14)

P where H(G|~θ) ≡ i πi (G)θi is the graph Hamiltonian P and Z(~θ) ≡ G exp[−H(G|~ θ)] is the partition function [11, 14, 15, 16]. In the standard approach, one chooses the matching value ~ θM fitting the properties of a real network. In order to check whether this violates the ML principle, we need to look for the value θ~∗ maximizing the likelihood to obtain a network described by a given set {πi }i of reference properties. The likelihood function we have defined reads λ(~ θ) ≡ log P (A|~ θ) = −H(A|~θ) − ∗ ~ ~ log Z(θ) and eq.(4) gives for θ # # " " θ) 1 ∂Z(~ ∂λ(~θ) = −πi (A) − = 0 (15) ~ ∂θi ∂θi ~ ~∗ Z(θ) ~ ~∗ θ=θ

θ=θ

whose solution yields the ML condition X ~∗ πi (A) = πi (G)e−H(G|θ ) /Z(~ θ∗ ) = hπi iθ~∗

∀i

each of these models, our result (16) directly yields the unbiased parameter choice in terms of the associated reference properties. We can now address the third problem. In the cases considered so far we assumed that the values of the hidden variables {xi }i were pre–assigned to the vertices. This occurs when we have a candidate quantity to identify with the hidden variable [6, 7]. However we can reverse the point of view and extend the ML approach so that, without any prior information, the hidden variables are included in ~θ and treated as free parameters themselves, to be tuned to their ML values {x∗i }i . In this way, hidden variables will be no longer ‘hidden’, since they can be extracted from topological data. This is an exciting possibility that can be applied to any real network. Moreover, this extension of the parameter space also allows us to match N additional properties besides the overall number of links. However, the unbiased choice of these properties must be dictated by the ML principle. For instance, let us look back at the model defined in eq.(12), now considering xi and xj not as fixed quantities, but as free parameters exactly as z, to be included in ~θ. Deriving λ(~θ) = λ(z, x1 , . . . , xN ) with respect to z gives again eq.(13) with xi replaced by x∗i , and deriving with respect to xi yields the N additional equations

(16)

G

which is equivalent to eq.(2): remarkably, ~ θ∗ = ~θM and the model is unbiased. We have thus proved a remarkable result: any model of the form in eq.(14) is unbiased under the ML principle, if and only if all the properties {πi }i included in H are simultaneously chosen as the ~ The statisreference ones used to tune the parameters θ. ∗ ~ tically correct values θ of the latter are the solution of the system of (in general coupled) equations (16). There are as many such equations as the number of free parameters. This gives us the following recipe: if we are defining a model whose predictions will be matched to a set of properties {πi (A)}i observed in a real–world network A, we should decide from the beginning what these reference properties are, include them in H(G|~ θ) and define P (G|~θ) as in eq.(14). In this way we are sure to obtain an unbiased model. The random graph is a trivial special case where π(A) = L and H(G|θ) = θL with p ≡ (1 + eθ )−1 [11], and this is the reason why it is unbiased, if L is chosen as reference. The hidden–variable model defined by eq.(12) is another special case where P πi (A) = ki and H(G|~ θ) = i θi ki with xi ≡ e−θi [11], and so it is unbiased too. By contrast, eq.(9) cannot be traced back to eq.(14), and the model is biased. Once the general procedure is set out, one can look for other special cases. The field of research on exponential random graphs is currently very active[11, 14, 15, 16, 17, 18], and models including correlations and higher–order properties are being studied, for instance to explore graphs with nontrivial reciprocity [17] and clustering [18]. For

ki =

X j6=i

z ∗ x∗i x∗j 1 + z ∗ x∗i x∗j

i = 1, . . . , N

(17)

Therefore we find that the N correct reference properties P for this model are the degrees: hki iθ~∗ = j6=i pij (~θ∗ ) = ki . This is not true in general: the model (9) would imply different reference properties such that hki i = 6 ki , so that choosing the degrees as the properties to match would bias the parameter choice. Again, this difference arises because eq.(17) corresponds to eq.(16) for the exponenP tial model H(G|~θ) = i θi ki [11], while the model in eq.(9) cannot be put in an exponential form. We stress that, although eq.(17) is formally identical to the familiar expression yielding hki i as a function of {xi }i if the latter are fixed [11], its meaning here is completely reversed: the degrees ki are fixed by observation and the unknown hidden variables are inferred from them through the ML condition. This is our key result. Note that, although determining the x∗i ’s requires to solve the N + 1 coupled equations (13) and (17), the number of independent expressions is much smaller since: i) eqs.(17) automatically imply eq.(13), so we can reabsorbe z ∗ in a redefinition of x∗i and discard eq.(13); ii) all vertices with the same degree k obey equivalent equations and hence are associated to the same value x∗k . So eqs.(17) reduce to k=

X k′

P (k ′ )

x∗k x∗k′ (x∗k )2 − 1 + x∗k x∗k′ 1 + (x∗k )2

(18)

where P (k) is the number of vertices with degree k, the last term removes the self–contribution of a vertex to its

4 xi * 10000 100 1 0.01

0.0001 0.001

0.01

0.1

1

10

wi

FIG. 1: ML hidden variables (x∗i ) versus GDP rescaled to the mean (wi ) for the WTW (year 2000), and linear fit.

′

niche values directly from empirical food webs, and not from ad hoc statistical distributions [19]. Another interesting application is to gene regulatory networks, where the length of regulatory sequences and promoter regions have been shown to determine the connection probability pij [20]. Similarly, our approach allows to extract the vertex–specific quantities (such as expansiveness, actractiveness or mobility–related parameters) that are commonly assumed to determine the topology and community structure of social networks [12, 13, 21]. In all these cases, the hypotheses can be tested against real data by plugging any particular form of pij = p(xi , xj ) into eq.(8) and looking for the values {x∗i }i that solve eq.(4), i.e. X

aij − p(x∗i , x∗j ) ∂p(xi , xj ) =0 p(x∗i , x∗j )[1 − p(x∗i , x∗j )] ∂xi ~ x=~ x∗

∀i

own degree, and k and k take only their empirical values. Hence the number of nonequivalent equations equals the number of distinct degrees that are actually observed, which is always much less than N . We can test our method on the WTW data, since from the aforementioned previous study we know that the GDP of each country plays the role of the hidden variable xi , and that the real WTW is well reproduced by eq.(12) [6]. We can first use eq.(18) to find the values {x∗i }i by exploiting only topological data (the degrees {ki }i ), and then compare these values with the empirical GDP of each country i (which is independent of topological data), rescaled to its mean to factor out physical units. As shown in fig.1, the two variables ideed display a linear trend over several orders of magnitude. Therefore our method identifies the GDP as the hidden variable successfully. Clearly, our approach can be used to uncover hidden factors from other real–world networks, such as biological and social webs. An example is that of food web [19] models, where it is assumed that predation probabilities depend on hypothetical niche values ni associated to each species. Our formalism allows to extract

(19) Note that for eq.(12) one correctly recovers eq.(17). Once obtained, the values {x∗i }i can be compared with the (totally independent) empirical ones to check for significant correlations, as we have done for the GDP data. Clearly, an important open problem to address in the future is understanding the conditions under which eq.(19), and similarly eq.(18) for a generic P (k), can be solved. We have shown that the ML principle indicates the statistically correct parameter values of network models, making the choice of reference properties no longer arbitrary. It also identifies undesired biases in graph models, and allows to overcome them constructively. Most importantly, it provides an elegant way to extract information from a network by uncovering the underlying hidden variables. This possibility, that we have empirically tested in the case of the World Trade Web, opens to a variety of applications in economics, biology, and social science. After submission of this article, we got aware of later studies based on a similar idea [9, 22].

[1] Caldarelli, G. Scale–free Networks. Complex Webs in Nature and Technology (Oxford University Press, Oxford 2007). [2] G. Caldarelli, A. Capocci, P. De Los Rios and M.A. Mu˜ noz, Phys. Rev. Lett. 89, 258702 (2002). [3] B. S¨ oderberg, Phys. Rev. E 66, 066121 (2002). [4] M. Bogu˜ n´ a and R. Pastor–Satorras, Phys. Rev. E 68, 036112 (2003). [5] F. Chung and L. Lu, Ann. of Combin. 6, 125 (2002). [6] D. Garlaschelli and M.I. Loffredo, Phys. Rev. Lett. 93, 188701 (2004). [7] D. Garlaschelli, S. Battiston, M. Castri, V.D.P. Servedio and G. Caldarelli, Physica A 350, 491 (2005). [8] J. Berg and M. L¨ assig, PNAS 101(41), 14689 (2004). [9] M.E.J. Newman and E.A. Leicht, PNAS 104, 9564 (2007). [10] J. Park and M.E.J. Newman, Phys. Rev. E 68, 026112 (2003). [11] J. Park and M.E.J. Newman, Phys. Rev. E 70, 066117

(2004) and references therein. [12] P.W. Holland and S. Leinhardt, J. Amer. Stat. Assoc. 76, 33 (1981). [13] S. Wasserman and K. Faust, Social Network Analysis, Cambridge University Press, Cambridge (1994). [14] Z. Burda, J.D. Correia and A. Krzywicki, Phys. Rev. E 64, 046118 (2001). [15] J. Berg and M. L¨ assig, Phys. Rev. Lett. 89, 228701 (2002). [16] A. Fronczak, P. Fronczak and J. A. Holyst, Phys. Rev. E 73, 016108 (2006). [17] D. Garlaschelli and M. I. Loffredo, Phys. Rev. E 73, 015101(R) (2006). [18] P. Fronczak, A. Fronczak and J. A. Holyst, Eur. Phys. J. B 59, 133 (2007). [19] R.J. Williams and N.D. Martinez, Nature 404, 180 (2000). [20] D. Balcan and A. Erzan, CHAOS 17, 026108 (2007). [21] M. C. Gonz´ alez, P. G. Lind and H. J. Herrmann, Phys.

j6=i

5 Rev. Lett. 96, 088702 (2006). [22] J.J. Ramasco and M. Mungan, Phys. Rev. E 77, 036122

(2008).