Difference between revisions of "Quantification of parameter uncertainty"

From ISMOC
Jump to: navigation, search
(Procedure)
(Procedure)
Line 250: Line 250:
  
 
The next part of the protocol is the calculation of the location (μ) and scale (σ) parameters of the log-normal distribution describing the parameter uncertainty (Box 2). The function '''''CreateLogNormDist''''' performs this surprisingly non-trivial transformation of the intuitive values of the Mode and Spread into the mathematically relevant properties of the log-normal distribution.
 
The next part of the protocol is the calculation of the location (μ) and scale (σ) parameters of the log-normal distribution describing the parameter uncertainty (Box 2). The function '''''CreateLogNormDist''''' performs this surprisingly non-trivial transformation of the intuitive values of the Mode and Spread into the mathematically relevant properties of the log-normal distribution.
 +
 +
 +
<div style="background-color: #ddf5eb; border: solid thin grey;">
 +
'''Box 2: How the ''CreateLogNormDist'' function works'''
 +
 +
Using the summary table (Table 5) of the information gathered in Steps 1 and 2, the '''''CalcModeSpread''''' function calculates the Mode and Spread based on the weighted median and weighted standard deviation of the log-transformed experimental parameter values. The output of '''''CalcModeSpread''''' is used to create the final log-normal distribution in the next step, but its main use is to allow the model constructor an intuitive plausibility check of the translation process: the Mode represents the value most likely to be sampled, and the Spread by default represents a “multiplicative standard deviation”, the log-normal equivalent of the (additive) standard deviation of a normal distribution, i.e. it is calculated as a multiplicative error so that 68.27% of the sampled values will lie within the interval [Mode/Spread, Mode*Spread].
 +
 +
The '''''CalcModeSpread''''' function starts by calculating the weighted median and the weighted standard deviation of the log-transformed literature values, taking into account their uncertainty (e.g., standard deviation or multiplicative error) and reliability weighting specified in Step 2.
 +
 +
The calculations are based on the weighted median of the literature values rather than the (weighted) mean, as it is less likely to be biased by outliers and thus provide a more accurate description of the central tendency in the dataset. The algorithm implemented in '''''CalcModeSpread''''' defines the weighted median as the unique value מ with the property that exactly half of the probability weight is assigned to values in the interval [−∞, מ]. This approach, which expands earlier published algorithms for the calculation of weighted medians, circumvents the restriction that the weighted median can only be one of the values of the dataset (or the arithmetic mean of two adjacent values). The effects of different parameter weights and uncertainty on the weighted median are illustrated in Figure 2.
 +
 +
Once the weighted median and the weighted standard deviation have been calculated, they are exponentiated to yield the Mode and Spread (multiplicative standard deviation) of the final log-normal distribution.
 +
</div>
  
  

Revision as of 17:30, 24 October 2017

The quantification of parameter uncertainty is achieved by following a protocol [1] which aims to translate biological knowledge into informative probability distributions for every parameter, without the need for excessive mathematical and computational effort from the user. It assumes that a functioning kinetic model of the relevant biological system is already available and focuses on the procedures specific to the application of an ensemble modelling approach.

Overview of the protocol

The major protocol steps are the following:

  • Collection and documentation of parameter information from the literature, databases and experiments.
  • Plausibility assessment of each value collected from the literature using a standardized evidence weighting system. This includes assigning weights with regards to the quality, reliability and relevance of each experimental data point.
  • Calculation of the mode (most plausible value) and Spread (multiplicative standard deviation) using the weighted parameter data sets. These values are found using a function which calculates the weighted median and weighted standard deviation of the log transformed and normalised parameter values.
  • Calculation of the location (μ) and scale (σ) parameters of the log-normal distribution based on the previously defined mode and Spread.
  • If required, ensure the thermodynamic consistency of the model by creating multivariate distribution for interdependent parameters (e.g., the rates of forward and reverse reactions are connected via the equilibrium constant of the reaction, and must not be sampled independently).
Figure 1: Workflow of the protocol from the parameter collection to the generation of probability distributions and thermodynamic consistency for interconnected parameters. (Abbreviations: CDF: Cumulative Density Function; erf: error function; N/A: not available; SD: standard deviation)


Procedure

Step 1: Find and document parameter values

  1. Collect and document all available parameter values from literature, databases (e.g., BRENDA and BioNumbers) and experiments (hypothetical example displayed in Table 1).
    1. No values: If no values can be retrieved for a reaction occurring in a particular organism, broaden the search to include data from phylogenetically related organisms (the uncertainty should then be adjusted accordingly). When no values can be retrieved for a certain parameter, the broaden the search constraints; e.g., if the Michaelis–Menten constant, K_M, for a particular substrate is unknown, it might be reasonable to retrieve all K_M values from a database like BRENDA in order to determine the range of generally plausible values for this parameter. In cases where no parameter values are available, an indirect estimate is often possible, based on back-of-the-envelope calculations and fundamental biophysical constraints, which will have a larger uncertainty than a direct experimental measurement but will still be informative – situations where absolutely no information about a plausible parameter range is available will be extremely rare.
    2. One value: If only a single experimental value will be available for a particular parameter, use this value as the Mode (most plausible value) of the probability distribution. Sometimes, an experimental error (e.g., a standard deviation) is reported for the value and can be used to estimate the Spread of plausible values; if this is not available an arbitrary multiplicative error can be assigned, based on a general knowledge of the reliability of a particular type of experimental technology. This information, of course, can be complemented using the previous strategy of retrieving related values from databases to optimise the estimate of the Spread of plausible values and avoid over-confident reliance on a single observation. Proceed to Step 4.
    3. More than one value: Whenever possible, it is important to collect as many experimental data with some implications for the range of plausible values, including those from a diverse range of methodologies, experimental conditions and biological sources. These will be pruned and weighted in the next step. Proceed to step 2.
  2. Document the reported experimental uncertainty (if available), in addition to the actual parameter value (Table 2).
    1. Uncertainty is reported in literature: Uncertainty is often reported as an (additive) standard deviation (SD) or a standard error of the mean (SEM). As the functions employed in the subsequent steps require standard deviations as inputs, reported standard errors should be converted to standard deviations for consistency. This can be achieved by multiplying by the square root of the sample size (n), as per the equation: SD=SEM \cdot \sqrt{n}. In cases where the sample size is not reported, a conservative value of n=3 can be assumed.
    2. Uncertainty is not reported in literature:
      1. If the modeller is absolutely certain about a parameter value, the standard deviation can be set to zero. Although it is extremely rare that there is no uncertainty about a parameter at all, this option can be useful when designing the backbone topology of a model, prior to performing ensemble modelling, or when parts of the model are to be kept constant in all simulations.
      2. If the modeller has no information about the uncertainty of a reported parameter value, an arbitrary standard deviation can be assigned (e.g., the script provided for the next step assumes a default multiplicative standard deviation of 10% in such cases).

Note: Although uncertainty is usually reported in an additive form (x ± SD/SEM), theoretical biophysical considerations indicate that multiplicative errors often provide a better description about our actual uncertainty about experimental data in biology. For example, additive errors can imply that the confidence interval includes negative values. Thus, whenever possible, multiplicative standard deviations (Value */÷ SD) should be recorded. The script provided for the calculation of the probability distributions in Step 2 can handle both additive and multiplicative standard deviations. In order for the script to be able to distinguish between additive and multiplicative uncertainty, an extra column is added in the input file, where the uncertainty type is reported. This can be either 0 (additive standard deviation) or 1 (multiplicative standard deviation).

Table 1: Hypothetical example of documentation for hexokinase parameters in chicken liver cells retrieved from the literature
Parameter Name Reported values Uncertainty Sources
K_m (μM) 0.5 N/A J. Poultry 2(3),2002
4.4 ±0.2 (S.D.) Quack Res. 4(6),1998
10 N/A Antarctica Biol. 1(2),2004
20 ±1.5 (S.E.M.)(n=10) Croc. Int. 10(3),2015
Table 2: Standardized description of parameter information in Table 1, for use in Step 2
Parameter Name Reported values Standard Deviation Multiplicative/ Additive
K_m (μM) 0.5 1.1 Multiplicative (1)
4.4 0.2 Additive (0)
10 1.1 Multiplicative (1)
20 4.7434 Additive (0)

Step 2: Judge plausibility of values and assign weights

When multiple parameter values are reported in the literature, assign appropriate weights according to their source. For the construction of larger models it can be helpful to use a standardized weighting scheme, to increase the internal consistency and reproducibility of the process. The weights will still be arbitrary, but they will consistently capture the modeller’s experience regarding the relative reliability of individual types of data. For example, standard weights can be assigned by answering a standard set of questions on the origin and properties of each parameter value:

  • Is the value derived from an in vitro or in vivo experiment, and is this the same environment as in the experiment that will be simulated?
  • Is the value derived from the modelled organism, from a phylogenetically related species, or from a completely unrelated system?
  • Does the value concern the same enzyme (or protein / gene / receptor etc. depending on the case), a related class/superclass of enzyme, or a generic enzyme completely unrelated to the one used in the model?
  • Are the experimental conditions (pH, temperature) the same, closely similar, or completely different from the ones simulated in the model?

According to the answer to each question, a standard partial weight can be assigned, and these weights can be multiplied to yield a total evidence weight for each experimental value (see Tables 3 and 4).

Table 3: Documentation of the experimental conditions reported in the literature for the hypothetical example in Tables 1 and 2
Parameter values

K_m (μM)

Experimental Conditions

(In vivo, chicken, hexokinase, pH=7-7.5, 41.8±0.8^oC)

In vivo/ in vitro Organism Enzyme pH & Temperature
0.5 In vitro Chicken Hexokinase pH=8, 25^oC
4.4 In vivo Duck Glucokinase pH=7, 25^oC
10 In vitro Penguin Hexokinase pH=6.8, 40^oC
20 In vitro Crocodile Glucokinase pH=7, 41^oC


Table 4: Calculation of standardized weights from the information in Table 3
Parameter values

K_m (μM)

Partial Weights Total Weight
In vivo/ in vitro same as model? Organism (same/ related/ unrelated) Enzyme Conditions (pH & temperature)
0.5 1 (In vitro) 4 (Same) 4 (Same) 4 (Same) 64
4.4 2 (In vivo) 2 (Related) 2 (Related) 1 (Unrelated) 8
10 1 (In vitro) 2 (Related) 4 (Same) 2 (Related) 16
20 1 (In vitro) 1 (Unrelated) 2 (Related) 4 (Same) 8

Step 3: Calculate the Mode and Spread of the probability distribution

  1. Create a final input table which will include the parameter values, errors and weights that have been documented in Steps 1 and 2 (as shown in Table 5).
  2. Note: When no uncertainty estimate (e.g., standard deviation or standard error) is provided for an experimental parameter value (i.e., input for standard deviation is “NaN”), a multiplicative standard deviation of 10% (multiplicative factor 1.1) is automatically assigned to the parameter by the CalcModeSpread function, unless the uncertainty is explicitly set to zero. The standard assignment of the 10% multiplicative standard deviation is based on an extensive survey of reported error values in biology, but it remains arbitrary and can and should be modified by the user as appropriate.

    Table 5: Final input table required for Step 3 for the hypothetical example in Tables 1-4
    Parameter Name Reported values Standard Deviation Total Weight Multiplicative/ Additive
    K_m (μM) 0.5 NaN 64 1
    4.4 0.2 8 0
    10 NaN 16 1
    20 4.7434 8 0


  3. Calculate the Mode and Spread of the probability distribution best describing our knowledge of the parameter values (Box 1) using the function CalcModeSpread.


Box 1: How the CalcModeSpread function works

Using the summary table (Table 5) of the information gathered in Steps 1 and 2, the CalcModeSpread function calculates the Mode and Spread based on the weighted median and weighted standard deviation of the log-transformed experimental parameter values. The output of CalcModeSpread is used to create the final log-normal distribution in the next step, but its main use is to allow the model constructor an intuitive plausibility check of the translation process: the Mode represents the value most likely to be sampled, and the Spread by default represents a “multiplicative standard deviation”, the log-normal equivalent of the (additive) standard deviation of a normal distribution, i.e. it is calculated as a multiplicative error so that 68.27% of the sampled values will lie within the interval [Mode/Spread, Mode*Spread].

The CalcModeSpread function starts by calculating the weighted median and the weighted standard deviation of the log-transformed literature values, taking into account their uncertainty (e.g., standard deviation or multiplicative error) and reliability weighting specified in Step 2.

The calculations are based on the weighted median of the literature values rather than the (weighted) mean, as it is less likely to be biased by outliers and thus provide a more accurate description of the central tendency in the dataset. The algorithm implemented in CalcModeSpread defines the weighted median as the unique value מ with the property that exactly half of the probability weight is assigned to values in the interval [−∞, מ]. This approach, which expands earlier published algorithms for the calculation of weighted medians, circumvents the restriction that the weighted median can only be one of the values of the dataset (or the arithmetic mean of two adjacent values). The effects of different parameter weights and uncertainty on the weighted median are illustrated in Figure 2.

Figure 2: Calculation of the weighted median of two values (modes) 103 and 106 (log-transformed to 6.9078 and 13.82 respectively) with varying uncertainty and weights. The modes of the distributions are indicated by the blue lines and the weighted medians (WMedian) by the green lines. The weighted median moves away from data that are particularly uncertain and moves towards parameters with higher perceived relevance. This subtle adjustment would not be possible with previous implementations of the weighted median.

Once the weighted median and the weighted standard deviation have been calculated, they are exponentiated to yield the Mode and Spread (multiplicative standard deviation) of the final log-normal distribution.


Step 4: Calculate the location and scale parameters (μ and σ) of the log-normal distribution

The next part of the protocol is the calculation of the location (μ) and scale (σ) parameters of the log-normal distribution describing the parameter uncertainty (Box 2). The function CreateLogNormDist performs this surprisingly non-trivial transformation of the intuitive values of the Mode and Spread into the mathematically relevant properties of the log-normal distribution.


Box 2: How the CreateLogNormDist function works

Using the summary table (Table 5) of the information gathered in Steps 1 and 2, the CalcModeSpread function calculates the Mode and Spread based on the weighted median and weighted standard deviation of the log-transformed experimental parameter values. The output of CalcModeSpread is used to create the final log-normal distribution in the next step, but its main use is to allow the model constructor an intuitive plausibility check of the translation process: the Mode represents the value most likely to be sampled, and the Spread by default represents a “multiplicative standard deviation”, the log-normal equivalent of the (additive) standard deviation of a normal distribution, i.e. it is calculated as a multiplicative error so that 68.27% of the sampled values will lie within the interval [Mode/Spread, Mode*Spread].

The CalcModeSpread function starts by calculating the weighted median and the weighted standard deviation of the log-transformed literature values, taking into account their uncertainty (e.g., standard deviation or multiplicative error) and reliability weighting specified in Step 2.

The calculations are based on the weighted median of the literature values rather than the (weighted) mean, as it is less likely to be biased by outliers and thus provide a more accurate description of the central tendency in the dataset. The algorithm implemented in CalcModeSpread defines the weighted median as the unique value מ with the property that exactly half of the probability weight is assigned to values in the interval [−∞, מ]. This approach, which expands earlier published algorithms for the calculation of weighted medians, circumvents the restriction that the weighted median can only be one of the values of the dataset (or the arithmetic mean of two adjacent values). The effects of different parameter weights and uncertainty on the weighted median are illustrated in Figure 2.

Once the weighted median and the weighted standard deviation have been calculated, they are exponentiated to yield the Mode and Spread (multiplicative standard deviation) of the final log-normal distribution.


Example of defining the log-normal distribution by using the confidence interval factor. The red line is defining a log-normal distribution with \mu=1.5653 and \sigma=0.4231. The mode is equal to 4 is represented by the blue line, the mean is equal to 5.2 and is represented by the green line and the median is equal to 4.8 and is represented by the yellow line. The grey area highlights the range between 1.6 and 10 within which lie 95.45% of the values, calculated by choosing a confidence interval factor of 2.5. The value 10 and the value 1.6 have equal probability of being sampled (f(4*2.5) =f(4/2.5)=0.02066).

In order to create the probability distributions, the location and scale parameters \mu and \sigma were required. These can be easily calculated from the mean and standard deviation of the available sample data. However in many cases, there were very little or no reported values for a parameter, or there was a minimum and maximum reported value. It was therefore necessary to come up with an alternative way to derive them which at the same time would be understandable to experimentalists, without demanding complicated mathematical terms and calculations.

In order to achieve this, the mode of the log-normal distribution (global maximum) and its symmetric properties were employed. Log-normal distributions are symmetrical in the sense that values that are  x times larger than the most likely estimate, are just as plausible as values that are  x times smaller. More specifically, the mode of the distribution is the value x_0 for which the condition f(x_0 \cdot \delta)=f(x_0 / \delta) for all real numbers  \delta , (where f is the probability density function) is fulfilled. Hence, the user has to decide on a most plausible value for each parameter, which is set as the mode (global maximum) of the corresponding distribution (Probability Density Function or PDF), and on a range within which lie 95.45% of the values. The latter is linked to the mode via a multiplicative factor, which we call "Confidence Interval Factor". If the mode is multiplied or divided by the CI factor, the range within which 95.45% of the values are found is calculated. For instance, if the most plausible value for a parameter is  X and the confidence interval multiplicative factor is  y , then the mode of the distribution is set as  X the range where 95.45% of the plausible values are found is [\frac{X}{y},X\cdot y].

Based on these values, a two-by-two system of the equations containing the cumulative distribution function (CDF) and the mode is solved, in order to derive the location parameter \mu and the scale parameter \sigma of the corresponding log-normal distribution. The equations are the following:

\begin{cases}CDF(x_{max})-CDF(x_{min})=0.9545\\
Mode=e^{\mu-\sigma^{2}}\end{cases}

where CDF= \frac{1}{2}+\frac{1}{2} \mathrm{erf} \Big[\frac{lnx-\mu}{\sqrt{2}\sigma}\Big] and x_{min} and x_{max} are the lower and upper bounds of the confidence interval. By substituting these into the previous equation the final form of the system is obtained:

\begin{cases}\frac{1}{2} \mathrm{erf} \Big[\frac{lnx_{max}-\mu}{\sqrt{2}\sigma}\Big]-\frac{1}{2} \mathrm{erf} \Big[\frac{lnx_{min}-\mu}{\sqrt{2}\sigma}\Big]=0.9545\\
		Mode=e^{\mu-\sigma^{2}}\end{cases}

In this way, the \mu and \sigma parameters are obtained and from them it is easy to calculate any property in the distribution (i.e. geometric mean, variance etc.)

Parameter dependency and thermodynamic consistency

In some cases, parameters cannot be chosen separately either because they are statistically dependent, subject to thermodynamic constraints or depend on another common parameter. Therefore, thermodynamic consistency is also an important factor that needs to be considered to decide if the combinations of parameters are plausible. For instance, a very common occurrence in biological systems are forward and backward reactions. The source of dependency is the equilibrium constant, which denotes the relationship between the kinetic parameters for the "on" and "off" components of the reaction. Let’s assume a reaction that is known to have an equilibrium constant very close to 1, i.e. its standard Gibbs free energy \Delta G^o = 0. There is not much information about the rate of the reaction, so each of the two parameters is sampled from a very broad distribution. If the additional thermodynamic information is not taken into account, there will often be cases where values will be sampled from the "fast" end of the spectrum for the forward reaction rate, and from the "slow" end for the backward rate (or vice versa). Thus, inconsistent pairs of the two parameters will be generated. In this case, thermodynamic consistency requires that we discard such samples and only keep those where the two reaction rates are very similar (how similar will in turn depend on our uncertainty about the equilibrium constant).

In order to address this problem we are employing a joint probability distribution (multivariate distribution) for the two parameters (i.e. k_{on} and k_{off}), in order to ensure that each of the generated values for both of them are constrained within a specified range. Additionally, this approach ensures that their dependency on each other and on the equilibrium constant K_{D} is taken into account and quantified appropriately.

For instance, if the two marginal distributions are k_{on} and k_{on}\cdot K_D (=k_{off}), k_{off} is dependent on the values of k_{on} and K_D. The parameter with the largest geometric coefficient of variation (GSV=e^{\sigma}-1) is usually set as the dependent one. Any product of two log-normal random variables is also log-normally distributed. Therefore, for the two log-normal distributions ln k_{on}\ \sim\ \mathcal{N}(\mu_{ln k_{on}},\, \sigma^{2}_{ln k_{on}}) and ln K_D\ \sim\ \mathcal{N}(\mu_{ln K_D},\, \sigma^{2}_{ln K_D}), their product k_{off} will be the log-normal distribution ln k_{off}\ \sim\ \mathcal{N}(\mu_{ln k_{on}}+\mu_{ln K_D},\, \sigma^{2}_{ln k_{on}}+\sigma^{2}_{ln K_D}) and its parameters will be \mu_{ln k_{off}}=\mu_{ln k_{on}}+\mu_{ln K_D}, \sigma^{2}_{ln k_{off}}=\sigma^{2}_{ln k_{on}}+\sigma^{2}_{ln K_D}.

A similar strategy applies for the quotient of two log-normal distributions, although in this case the parameter \mu will be derived by the formula \mu_{quotient}=\mu_{dividend}-\mu_{divisor}. The formula for the calculation of the parameter \sigma does not change.

Thus, it becomes easy to transform the two marginal distributions k_{on} and k_{off} to normal ones, through the natural logarithm. The problem can therefore be reduced to the case of a multivariate normal distribution generated by the formula

Failed to parse (syntax error): f(x,y)= \frac{1}{2 \pi \sigma_X \sigma_Y \sqrt{1-\rho^2}} \exp\left( -\frac{1}{2(1-\rho^2)}\left[ \frac{(x-\mu_X)^2}{\sigma_X^2} + \frac{(y-\mu_Y)^2}{\sigma_Y^2} - \frac{2\rho(x-\mu_X)(y-\mu_Y)}{\sigma_X \sigma_Y}\right] \right)\\

where \rho is the correlation between X (=k_{on}) and Y (=k_{off}) and \sigma_X > 0 and \sigma_Y > 0. In this case, \boldsymbol\mu = \begin{pmatrix} \mu_X \\ \mu_Y \end{pmatrix}, \quad
\boldsymbol\Sigma = \begin{pmatrix} \sigma_X^2 & \rho \sigma_X \sigma_Y \\
\rho \sigma_X \sigma_Y  & \sigma_Y^2 \end{pmatrix} (covariance matrix).

The parameters are assumed to be independent if \rho = 0, so there is no correlation between them. Otherwise, the resulting bivariate iso-density loci plotted in the x,y-plane are ellipses. As the correlation parameter \rho increases, these loci appear to be squeezed to the following line:

 y(x)= \mathop{\rm sgn} (\rho)\frac{\sigma_Y}{\sigma_X} (x- \mu_X) + \mu_Y

Sampling strategy for maintaining thermodynamic consistency of parameter sets. In order to ensure the thermodynamic feasibility of a parameter combination in the case of interconnected parameters (i.e. forward and backward reaction, k_{on}, k_{off}) a bivariate system of distributions is created. When the two marginal distributions are non-correlated the generated points which represent parameter pairs, form a circle. When K_D is exactly known, (the parameters k_{on} and k_{off} are tightly correlated through the K_D) the points form a straight line. Finally, if K_D is approximately known (a distribution of values for K_D exists) the resulting points of the bivariate system form an ellipse. The thickness and orientation of the ellipse depend on the magnitude of the correlation between the two marginal distributions and on the degree of uncertainty on the values of k_{on}, k_{off} and K_D. This case represents the realistic scenario when modelling, as usually the parameter values are approximately known. The first case (no correlation) does not respect thermodynamic consistency and is therefore undesirable. The second case, although taking into account the dependency of the two parameters, is in most cases unrealistic since the value of a parameter is rarely exactly known.

The required parameter values are obtained by generating samples from the multivariate normal distribution and then exponentiating the results. In order to avoid errors that are introduced to the correlation matrix during the exponentiation, a matlab function called Multivariate Lognormal Simulation with Correlation (MVLOGNRAND) is used, which makes up for these errors.

References

  1. Tsigkinopoulou A., Hawari A., Uttley M., Breitling R. "Defining informative priors for ensemble modelling in systems biology" (In Press)