Abstract

Modern biological experiments are becoming increasingly complex, and designing these experiments to yield the greatest possible quantitative insight is an open challenge. Increasingly, computational models of complex stochastic biological systems are being used to understand and predict biological behaviors or to infer biological parameters. Such quantitative analyses can also help to improve experiment designs for particular goals, such as to learn more about specific model mechanisms or to reduce prediction errors in certain situations. A classic approach to experiment design is to use the Fisher information matrix (FIM), which quantifies the expected information a particular experiment will reveal about model parameters. The finite state projection-based FIM (FSP-FIM) was recently developed to compute the FIM for discrete stochastic gene regulatory systems, whose complex response distributions do not satisfy standard assumptions of Gaussian variations. In this work, we develop the FSP-FIM analysis for a stochastic model of stress response genes in S. cerevisiae under time-varying MAPK induction. We verify this FSP-FIM analysis and use it to optimize the number of cells that should be quantified at particular times to learn as much as possible about the model parameters. We then extend the FSP-FIM approach to explore how different measurement times or genetic modifications help to minimize uncertainty in the sensing of extracellular environments, and we experimentally validate the FSP-FIM to rank single-cell experiments for their abilities to minimize estimation uncertainty of NaCl concentrations during yeast osmotic shock. This work demonstrates the potential of quantitative models to not only make sense of modern biological datasets but to close the loop between quantitative modeling and experimental data collection.

1. Introduction

The standard approach to design experiments has been to rely entirely on expert knowledge and intuition. However, as experimental investigations become more complex and seek to examine systems with more subtle nonlinear interactions, it becomes much harder to improve experimental designs using intuition alone. This issue has become especially relevant in modern single-cell-single-molecule investigations of gene regulatory processes. Performing such powerful, yet complicated, experiments involves the selection from among a large number of possible experimental designs, and it is often not clear which designs will provide the most relevant information. A systematic approach to solve this problem is model-driven experiment design, in which one combines existing knowledge or experience to form an assumed (and partially incorrect) mathematical model of the system to estimate and optimize the value of potential experimental settings. In practice, such preliminary models would be defined by existing data taken in simpler or more general settings such as inexpensive bulk experiments or would be estimated from literature values conducted on similar genes, pathways, or organisms. When parameter or model structures are uncertain, these could be described according to a prior distribution, and experiments would need to be selected according to which performs best on average across the many possible model/parameter combinations.

In recent years, model-driven experiment design has gained traction for biological models of gene expression, whether in the Bayesian setting [1] or using Fisher information for deterministic models [2], and even in the stochastic, single-cell setting [37]. Despite the promise and active development of model-driven experiment design from the theoretical perspective, more general, yet biologically inspired, approaches are needed to make these methods suitable for the experimental community at large. In this work, we apply model-driven experiment design to an experimentally validated model of stochastic transcription that is activated by time-varying high osmolarity glycerol (HOG) mitogen-activated protein kinase (MAPK) induction in yeast [810]. To demonstrate a concrete and practical application of model-driven experiment design, we find the optimal measurement schedule (i.e., when measurements ought to be taken) and the appropriate number of individual cells to be measured at each time point.

In our computational analyses, we consider the experimental technique of single-molecule mRNA fluorescence in situ hybridization (smFISH), where specific fluorescent oligonucleotide probes are hybridized to mRNA of interest in fixed cells [11, 12]. Cells are then imaged, and the mRNA abundance in each cell is counted, either by hand or using automated software such as [13]. Such counting can be a cumbersome process, but little thought has been given typically to how many cells should be measured and analyzed at each time. Furthermore, when a dynamic response is under investigation, the specific times at which measurements should be taken (i.e., the times after induction at which cells should be fixed and analyzed) are also unclear. In this work, we use the newly developed finite state projection-based Fisher information matrix (FSP-FIM, [6]) to optimize these experimental quantities for osmotic stress response genes in yeast.

The first part of our current study introduces a discrete stochastic model to analyze time-varying MAPK-induced gene expression response in yeast and then demonstrates the use of FSP-based Fisher information to optimize experiments to minimize the uncertainty in model parameters. In the second part of this study, we expand upon this result to find and experimentally verify the optimal smFISH measurement times and cell numbers to minimize uncertainty about unknown environmental inputs (e.g., salt concentrations) to which the cells are subjected. In this way, we are presenting a new methodology by which one can optimally examine behaviors of natural cells to obtain accurate estimations of environmental changes.

2. Background

Gene regulation is the process by which small molecules, chromatin regulators, and general and gene-specific transcription factors interact to regulate the transcription of DNA into RNA and the translation of mRNA into proteins. Even within populations of genetically identical cells, these single-molecule processes are stochastic and give rise to cell-to-cell variability in gene expression levels. Adequate descriptions of such variable responses can only be achieved through the use of stochastic computational models [1417]. In the following sections, we first introduce a nonequilibrium discrete stochastic model of HOG1-MAPK-induced gene expression, and we then discuss how this model can be analyzed and compared to data using finite state project analyses. All analysis codes are available at https://github.com/MunskyGroup/Fox_Complexity_2020.

2.1. Discrete Stochastic Model of HOG1-MAPK-Induced Gene Expression

To motivate and demonstrate our new approach, we focus our examination on the dynamics of the HOG1-MAPK pathway in yeast, which is a model system to study osmotic stress driven dynamics of signal transduction and gene regulation in single cells [1823]. Discrete stochastic models of HOG1-MAPK-activated transcription have been used successfully to predict the variability in adaptive transcription responses across yeast cell populations [9, 10, 24]. In particular, the authors in [9] used smFISH data to fit and cross validate a number of different potential models with different numbers of gene states and time-varying parameters. They found that dynamics of two stress response genes, STL1 and CTT1, could each be described accurately by the model depicted in Figure 1(a).

In brief, the model [9] consists of transitions between four different gene states (S1, S2, S3, and S4). The probability of a transition from the to the gene state in the infinitesimal time is given by the propensity function, . Most of the rates are constant in time, except for the transition from S2 to S1, which is controlled by the time-varying level of the HOG1-MAPK signal in the nucleus, . The resulting time-varying rate is defined using a linear threshold function:where α and β set the threshold for activation/deactivation. The function was calibrated at several NaCl concentrations by fitting the HOG1-MAPK nuclear localization signals as measured using a yellow fluorescence protein reporter [10]. Figure 1(b) shows for osmotic stress responses to 0.2 M and 0.4 M NaCl, and Figure 1(c) shows the corresponding values of . In addition to the state transition rates, each state also has a corresponding mRNA transcription rate, . All mRNA molecules degrade with rate γ, independent of gene state. Further descriptions and validations of this model are given in Supplementary Note 1 and in [9, 10, 24]. All experimentally determined parameters for the STL1 and CTT1 transcription regulation models are provided in Supplemental Table S1, and experimentally determined parameters for the HOG1-MAPK signal model are listed in Supplemental Table S2 [10].

2.2. The Finite State Projection Analysis of Stochastic Gene Expression

To analyze the model described above, we apply the chemical master equation (CME) framework of stochastic chemical kinetics [25]. Combining the time-varying and constant state transition rates , transcription rates , and degradation rate γ from above, the CME can be written in matrix form as a linear ordinary differential equation, , where the time-varying matrix is known as the infinitesimal generator (see Supplementary Note 1). The CME has been the workhorse of stochastic modeling of gene expression, and it is usually analyzed using simulated sample paths of its solution via the stochastic simulation algorithm [26] or with moment approximations [8, 27]. Alternatively, the CME can also be solved with guaranteed errors using the FSP approach [28, 29], which reduces the full CME only to describe the flow of probability among the most likely observable states of the system. Details of the FSP approach to solving chemical kinetic systems are provided in Supplementary Note 1. Application of the FSP analysis to the model (Figure 1(a)) with dynamic Hog1 (Figure 1(b)) modulates time-varying rates k21 (Figure 1(c)) and predicts time-evolving probability distributions at 0.2 M and 0.4 M NaCl, as shown in Figure 1(d) [10].

2.3. Likelihood of smFISH Data for FSP Models

Recently, it has come to light that for some systems, it is critical to consider the full distribution of biomolecules across cellular populations when fitting CME models [6, 10]. To match CME model solutions to single-cell smFISH data, one needs to compute and maximize the likelihood of the data given the CME model [9, 10, 24, 30]. Fortunately, the FSP approach allows for computation of the likelihood with guaranteed accuracy bounds [28]. We assume that measurements at each time point are independent, as justified by the fact that fixation of cells for measurement precludes temporal cell-to-cell correlations. Measurements of cells can be concatenated into a matrix of the observable mRNA species at each measurement time t.

The likelihood of making the independent observations for all measured cells is the product of the probabilities of observing each cell’s measured state. For most gene expression models, however, states are only partially observable, and we define the observed state as the marginalization (or lumping) over all full states that are indistinguishable from based on the observation. For example, the model of STL1 transcription consists of four gene states (S1–S4, shown in Figure 1(a)), which are unobserved, and the measured number of mRNA, which is observed. If we let index i denote the number of mRNA, then the observed state would lump together the full states (S1, i), (S2, i), (S3, i), and (S4, i). We next define as the number of experimental cells that match at time t. Under these definitions, the likelihood of the observed data (and its logarithm) given the model can be written aswhere is the set of states observed in the data, M is a combinatorial prefactor (i.e., from a multinomial distribution) that comes from the arbitrary reordering of measured data, and is the marginalized probability mass of the observable species:

The vector of model parameters is denoted by . Neglecting the term , which is independent of the model, the summation in equation (2) can be rewritten as a product , where is the vector of the binned data and is the corresponding marginalized probability mass vector. One may then maximize equation (2) with respect to to find the maximum likelihood estimate (MLE) of the parameters, , which will vary depending on each new set of experimental data. We next demonstrate how this likelihood function and the FSP model of the HOG1-MAPK-induced gene expression system can be used to design optimal smFISH experiments using the FSP-based Fisher information matrix [6].

3. Results

3.1. The Finite State Projection-Based Fisher Information for Models of Signal-Activated Stochastic Gene Expression

The Fisher information matrix (FIM) is a common tool in engineering and statistics to estimate parameter uncertainties prior to collecting data, which allows one to find experimental settings that can make these uncertainties as small as possible [3, 4, 3134]. Recently, it has been applied to biological systems to estimate kinetic rate parameters in stochastic gene expression systems [36, 35]. In general, the FIM for a single measurement is defined aswhere the vector contains the log-probabilities of each potential observation and the expectation is taken over the probability distribution of states assuming the specific parameter set . As the number of measurements, , is increased such that maximum likelihood estimates (MLE) of parameters are unbiased, the distribution of MLE estimates is known to approach a multivariate Gaussian distribution with a covariance given by the inverse of the FIM, i.e.,

In [6], we developed the FSP-based Fisher information matrix (FSP-FIM), which allows one to use the FSP solution , and its sensitivity , to find the FIM for stochastic gene expression systems. For a general FSP model, the dynamics of the sensitivity to each kinetic parameter can be calculated according towhere . Solving equation (6) requires integrating a coupled set of ODEs that is twice as large as the original FSP system. The FSP-FIM at a single time t is then given bywhere the summation is taken over all states included in the FSP analysis (or over all observed states in the case of lumped observations). We note that the FSP computation of the FIM should be computationally tractable for problems for which the FSP solution itself is tractable. However, since the size of the FSP sensitivity matrix (equation (6)) scales exponentially with the number of species, practical applications of the presented formulation of the FSP-FIM are currently restricted to models that have, or can be reduced to have, three or fewer distinct chemical species.

The FIM for a sequence of measurements taken independently (e.g., for smFISH data) at times can be calculated as the sum across the measurement times:where is the number of cells measured at each measurement time. For smFISH experiments, the vector plays an important role in the design of the study. By optimizing over all vectors that sum to , one can find how many cells should be measured at each time point and which time points should be skipped entirely (i.e., ).

In the next section, we verify the FSP-FIM for this stochastic model with a time-varying parameter and later find the optimal for STL1 mRNA in yeast cells.

3.2. The FSP-FIM Can Quantify Experimental Information for Stochastic Gene Expression under Time-Varying Inputs

Our work in [6] was limited to models of stochastic gene expression that had piecewise constant reaction rates. Here, we extend this to time-varying reaction rates that affect the promoter switching in the system and which lead to time-varying in equation (6). For example, in the model depicted in Figure 1(a), the temporal addition of osmotic shock causes nuclear translocation of HOG1-MAPK, according to the time-varying function in equation (1).

Model parameters simultaneously fit to experimentally measured 0.2 M and 0.4 M STL1 mRNA were adopted from [10] and used as a reference set of parameters (yellow dots in Figure 2(a) and S1), which we define as . These reference parameters were used to generate 50 unique and independent simulated datasets, and each simulated dataset was fit to find the parameter set, , that maximizes the likelihood for that simulated dataset. This process was repeated for two different experiment designs, including the original intuitive design from [10] (results shown in Figure 2) and an optimized design discussed below (results shown in Figure S1). To ease the computational burden of this fitting, the four parameters with the smallest sensitivities and largest uncertainties (i.e., those parameters that had the least effect on the model predictions and which were most difficult to identify) were fixed at their baseline values. The resulting MLE estimates for the remaining five parameters were collected into a set of and are shown as yellow dots in Figures 2 and S1. Using the asymptotic normality of the maximum likelihood estimator and its relationship to the FIM (equation (5)), we then compared the 95% confidence intervals (CIs) of the inverse of the Fisher information (i.e., the Cramér–Rao bound) to those of the MLE estimates (compare the purple and orange ellipses in Figures 2(a) and S1a). We also compared the eigenvalues of the inverse of the Fisher information, , to the correspondingly ranked eigenvalues of the covariance matrix of MLE estimates, , in Figures 2(b) and S1b. For further validation, we noted that the principle directions of the ellipses in Figures 2(a) and S1a also match for the FIM and MLE analyses, as quantified by the angle between the paired FIM and eigenvectors (Figures 2(b) and S1b). For comparison, the angles between rank-matched eigenvectors of the FIM and were all less than 12°, whereas non-rank-matched eigenvectors were all greater than 79.9°. With the FSP-FIM verified for the HOG1-MAPK-induced gene expression model, we next explore how the FSP-FIM can be used to optimally allocate the number of cells to measure at each time after osmotic shock.

3.3. Designing Optimal Measurements for the HOG1-MAPK Pathway in S. cerevisiae

To explore the use of the FSP-FIM for experiment design in a realistic context of MAPK-activated gene expression, we again utilize simulated time-course smFISH data for the osmotic stress response in yeast.

We start with a known set of underlying model parameters that were taken from simultaneous fits to 0.2 M and 0.4 M data in [10] (nonspatial model) to establish a baseline parameter set that is experimentally realistic. These parameters are then used to optimize the allocation of measurements at different time points minutes after NaCl induction. Specifically, we ask what fraction of the total number of cells should be measured at each time to maximize the information about a specific subset of important model parameters. We use a specific experiment design objective criteria referred to as -optimality, which corresponds to minimizing the expected volume of the parameter space uncertainty for the specific parameters of interest [35] and which is found by maximizing the product of the eigenvalues of the FIM for those same parameters.

Mathematically, our goal is to find the optimal cell measurement allocation:where is the fraction of total measurements to be allocated at and the metric refers to the product of the eigenvalues for the total FIM (equation (8)). The fraction of cells to be measured at each time point, , was optimized using a greedy search, in which single-cell measurements were chosen one at a time according to which time point predicted the greatest improvement in the optimization criteria (see Supplementary Note 3 for more information).

To illustrate our approach, we first allocated cell measurements according to -optimality as found through this greedy search. Figure 3 shows the optimal fraction of cells to be measured at each time following a 0.2 M NaCl input and compares these fractions to the experimentally measured number of cells from [10]. While each available time point was allocated a nonzero fraction of measurements, three time points at minutes were vastly more informative than the other potential time points. To verify this result, we simulated 50 datasets of 1,000 cells each and found the MLE estimates for each subsampled dataset. We compared the spread of these MLE estimates to the inverse of the optimized FIM, shown in Figure S1.

Comparing Figure S1 with Figure 2 illustrates the extent by which the design of optimal measurement times for a 0.2 M NaCl experiment can increase information collection and reduce parameter uncertainties compared to the intuitive measurment design from [10]. In addition to providing much higher Fisher information, the optimal experiment requires measurement of only three time points compared to the 16 time points that were measured in the original experiment. Furthermore, we note that the FIM prediction of the MLE uncertainty is more accurate for the simpler optimal design, which is likely related to our observation that MLE estimates converge more easily for the optimized experiment design than they do for the original intuitive design.

Figure 4 next compares the -optimality criteria for the optimal (solid horizontal lines) and intuitive ([10], dashed horizontal lines) experiment designs to 1,000 randomly designed experiments for the 0.2 M (black) and 0.4 M (gray) conditions. To generate these random experiment designs, we selected a random subset of the measurement times and allocated the total 1,000 cells among chosen time points using a multinomial distribution with equal probability for each time point. Figure 4(a) shows that the intuitive experiment is more informative than most random experiments but is still substantially less informative than the optimal experiment.

In many practical applications, a scientist would be unlikely to have precise a priori knowledge of model parameters prior to conducting experiments. Rather, they would have some estimate of these parameters, such as rough knowledge of appropriate time scales or existing data from another type of experiment. Such estimates could come from previous analyses of the system response to simpler experimental conditions, for measurements taken on slightly different cell lines or organisms, or considering results from different genes in related regulatory pathways. To explore the importance of knowing the exact process parameters or input dynamics prior to designing the experiment, we asked how well an experiment design optimized using parameters from one gene at a given level osmotic shock (e.g., STL1 at 0.2 M NaCl) would do to estimate parameters for another gene in a different osmotic shock condition (e.g., CTT1 at 0.4 M NaCl). Figure 4(b) demonstrates the impact of such mismatched experiment designs, where each row corresponds to a different intuitive or optimized experiment design (i.e., a specific allocation of cells to be measured at each time), and each column corresponds to a specific gene and specific osmotic shock condition to which that design could be applied. In all cases, the much simpler FIM-based optimal experiment designs perform as well or better than the more difficult intuitive designs, even when these FIM designs were computed assuming different environmental conditions and assuming genes whose parameters differ considerably from one another (see Supplemental Tables 1 and 2 for parameter sets). In other words, these results suggest that if one can compute a simple yet optimal experiment design based on one well-analyzed gene in a previously studied environmental condition, then that design may be equally effective when applied to new investigations for related genes in similar biological contexts.

3.4. Using the FSP-FIM to Design Optimal Biosensor Measurements

Thus far, and throughout our previous work in [6], we have sought to find the optimal set of experiments to reduce uncertainty in the estimates of model parameters. In this section, we discuss how the FSP-FIM allows for the optimization of experiment designs to address a more general problem of inferring environmental variables from cellular responses. Toward this end, we assume a known and parametrized model (i.e., the model defined above, which was identified previously in [10]), but which is now subject to unknown environmental influences. We explore what would be the optimal experimental measurements to take to characterize these influences. Specifically, we ask how many cells should be measured using smFISH, and at what times, to determine the specific concentration of NaCl to which the cells have been subjected—or, equivalently, we ask what experiments would be best suited to measure the effective stress induction level caused by addition of an unknown solution to the cells.

Recall from above that in the HOG1-MAPK transcription model, extracellular osmolarity ultimately affects stress response gene transcription levels through the time-varying parameter (equation (1)) as illustrated in Figure 1(c) for 0.2 M and 0.4 M salt concentrations. Higher salt concentrations delay the time at which returns to its nonzero value. The function in equation (1) can be coarsely approximated by the sum of three Heaviside step functions, aswhere is the fixed delay of the time it takes for nuclear kinase levels to reach the deactivation threshold (about 1 minute or less, [9, 10]) and is the variable time it takes for the nuclear kinase to drop back below that threshold. In practice, the threshold-crossing time, , should be directly related to the salt concentration experienced by the cell under reasonable salinity levels. This relationship is shown in Figures 1(b), 1(c), and 5(b), where a 0.2 M NaCl input exhibits a shorter than does a 0.4 M input. For our analyses, we assume a prior uncertainty such that time can be any value uniformly distributed between and minutes, and our goal is to find the experiment that best reduces the posterior uncertainty in (and therefore could provide an estimate for the concentration of NaCl).

To reformulate the FSP-FIM to estimate uncertainty in given our model, the first step is to compute the sensitivity of the distribution of mRNA abundance to changes in the variable using equation (5), in which is replaced with as follows:

As is the only parameter in that depends explicitly on , all entries of are zero except for those which depend on , andand therefore is nonzero only at . Using this fact, the equation for the sensitivity dynamics is uncoupled from the FSP dynamics for and can be written simply as

If the Fisher information at each measurement time is written into a vector (noting that the Fisher information at any time is the scalar quantity, ) and the number of measurements per time point is the vector, , then the total information for a given value of can be computed as the dot product of these two vectors:

Our goal is to find an experiment that is optimal to determine the value of , given an assumed prior that is sampled from a uniform distribution between and . To find the experiment that will reduce our posterior uncertainty in , we integrate the inverse of the FIM in equation (14) over the prior uncertainty in :

For later convenience, we define the integral in equation (15) (i.e., the objective function of the minimization) by the symbol , which corresponds to the expected uncertainty about the value of for a given .

Next, we apply the greedy search from above to solve the minimization problem in equation (15) to find the experiment design that minimizes the estimation error of . Figure 6 shows examples of seven different experiments to accomplish this task, ranked according to the FSP-FIM value from most informative (top left) to least informative (bottom left), but all using the same number of measured cells. For each experiment, the FSP-FIM was used to estimate the posterior uncertainty (i.e., expected standard deviation) in the estimation of , which is shown by the orange bars in Figure 6. To verify these estimates, we then chose 64 uniformly spaced values of , which we denote as the set , and for each , we simulated 50 random datasets of 1,000 cells distributed according to the specified experiment designs. For each of the 64  50 simulated datasets, we then determined the value between and that maximized the likelihood of the simulated data according to equation (2). The root mean squared estimate (RMSE) error over all random values of and estimates, , was then computed for each of the six different experiment designs. Figure 6 shows that the FIM-based estimation of uncertainty and the actual MLE-based uncertainty are in excellent agreement for all experiments (compare purple and orange bars). Moreover, it is clear that the optimal design selected by the FIM analysis performed much better to estimate than did the uniform or random experimental designs. A slightly simplified design, which uses the same time points as the optimal, but with equal numbers of measurements at each time, performed nearly as well as the optimal design.

The set of experiment designs shown in Figure 6 includes the best design that only uses STL1 (second from top), the best design that uses only CTT1 (fourth from top), and the best design that uses some cells with CTT1 and some with STL1 (top design). To find the best experiment design for measurement of two different genes, we assumed that at each time, either STL1 mRNA or CTT1 mRNA (but not both) could be measured, corresponding to using smFISH oligonucleotides for either STL1 or CTT1. To determine which gene should be measured at each time, we compute the Fisher information for CTT1 and STL1 for every measurement time and averaged this value over the range of . For each measurement time , the gene is selected that has the higher average Fisher information for . The number of cells per measurement time was then optimized as before, except the choice to measure CTT1 or STL1 was based on which mRNA had the larger Fisher information (equation (14)) at that specific point in time. The best STL1-only experiment design was found to yield uncertainty of 10.5 seconds (standard deviation); the best CTT1-only experiment was found to yield an uncertainty of 15.2 seconds and the best mixed STL1/CTT1 experiment design was found to yield an uncertainty of 10.4 seconds. In other words, for this case, the STL1 gene was found to be much more informative of the environmental condition than was CTT1, and the use of both STL1 and CTT1 provides only minimal improvement beyond the use of STL1 alone. We note that although measurement times in the optimized experiment design were restricted to a resolution of five minutes or more, the value of could be estimated with an error of only 10 seconds, corresponding to a roughly 30-fold improvement of temporal resolution beyond the allowable sampling rate.

3.5. Experimental Validation for FSP-FIM-Based Designs of Biosensor Measurements

To experimentally validate our FSP-FIM-based approach to design optimal measurement times, we next examined experimental smFISH data taken for the STL1 and CTT1 genes at different times following yeast osmotic shock [10]. These data include a total of 535–4808 cells measured at each of 16 time points following osmotic shocks of 0.2 M or 0.4 M NaCl. We asked how well could we identify the concentration of the osmotic shock from the experimental data using only 75 individual cells per experiment. We again proposed the six different potential experiments depicted in Figure 6, including the optimal STL1 and CTT1 design, the optimal STL1 design, the simplified STL1 design with 15 cells for each of the optimal five time points, the optimal CTT1 design, the uniform STL1 design, and the random STL1 design. For each design, we created 1,000 different experimental replica datasets, each consisting of 100 cells randomly chosen from the original data. For each replica dataset, we then used the CME model (Supplementary Note 1) with a parametrized form of the HOG1-MAPK nuclear localization signal (Supplementary Note 2) to find the NaCl concentration that maximizes the likelihood of the data given the model.

Figure 7 shows the resulting histograms for the estimated NaCl concentrations for each of the six experiment designs, when the cells were actually subjected to experimental osmotic shocks of 0.2 M NaCl (Figure 7(a)) or 0.4 M NaCl (Figure 7(c)). From Figures 7(a) and 7(c), it is clear that the FSP analysis provides an accurate estimate for the level of the osmotic shock input using a relatively small number of cells, despite the fact that producing such estimates was not an intended use of the model in its original formulation or parameter inference [9, 10]. Figures 7(b) and 7(d) show the uncertainty (standard deviation) in the experimental estimate of NaCl concentration (light bars), when cells are collected according to the six specific experiment designs, and compare these results to the FSP-FIM uncertainty estimates (dark bars) using the simplified step input function (equation (10)). With the exception of the suboptimal CTT1-only design, the close matches between the relative trends of the variance in experimental estimation of NaCl and the variance predicted by the FSP-FIM analysis with the approximated step-function input give further experimental validation that the FSP-FIM approach can be used to choose more informative experiment designs, even in cases where the FSP analyses use inexact assumptions for model kinetics. The single discrepancy in trends led us to more closely examine the model and experimental data for CTT1 expression at the 35-minute time point that dominates the CTT1-only design. By examining Supplemental Figure S7 from [10], we found that this specific combination of CTT1 at 35 minutes following 0.4 M NaCl osmotic shock showed a greater discrepancy between model and data than any of the other 63 combinations of 16 times, two genes, and two conditions, yet it is unclear if that difference was an artifact of the experiment or an actual transient effect that only affected that specific combination of gene, time, and environmental condition.

4. Discussion

The methods developed in this work present a principled, model-driven approach to allocate how many snapshot single-cell measurements should be taken at each time during analysis of a time-varying stochastic gene regulation system. We demonstrate and verify these theories on a well-established model of osmotic stress response in yeast cells, which is activated upon the nuclear localization of phosphorylated HOG1 [9, 10]. For this system, we showed how to optimally allocate the number of cells measured at each time so as to maximize the information about a subset of model parameters. We found that the optimal experiment design to estimate model parameters for the STL1 gene only required three time points. Moreover, these three time points ( minutes, highlighted by blue in Figure 3(b)) are at biologically meaningful time points. At and 15 minutes, the system is increasing to maximal expression, and the probability to measure a cell with elevated mRNA content is high, which helps reduce uncertainty about the parameters in the model that control maximal expression. Similarly, at the final experiment time of minutes, the system is starting to shut down gene expression, and therefore this time is valuable to learn about the time scale of deactivation in the system as well as the mRNA degradation rate. These effects are clearly illustrated in Figure 3(a), which shows that times and minutes provide the most information about parameters , , and , whereas measurements at minutes provide the most information about γ. Because γ is the easiest parameter to estimate (e.g., its information is greater), not as many cells are needed at minutes to constrain that parameter. Similarly, because is the most difficult parameter to estimate (e.g., it has the lowest information across all experiments) and because minutes is one of the few time points to provide information about , the optimal experimental design selects a large number of cells at the time minutes. This analysis demonstrates that the optimal experiment design can change depending upon which parameters are most important to determine (e.g., γ or in this case), a fact that we expect will be important to consider in future experiment designs.

Because we constrained all potential experiment designs to be within the subset of experiments performed in our previous work [10], we are able to compare the information of optimal experiment designs to intuitive designs that have actually been performed. We found that while the intuitive experiments were almost always better than could be expected by random chance, they still provided several orders of magnitude lower Fisher information than would be possible with optimal experiments (Figure 4(a)). Moreover, in our analyses, we found that optimal designs could require far fewer time points than those designed by intuition (e.g., only three time points were needed in Figure 3), and therefore these designs can be much easier and less expensive to conduct. We also found that utility of optimal experiment designs could be relatively insensitive to variation in the experimental conditions or the specific model parameters used for the experiment design. For example, we found that experiments optimized for one gene at one level of osmotic shock were still at least as good—and in most cases better—than intuitive designs, even when conducted using different genes and at a different level of osmotic shock (Figure 4(b)). In practice, this fact would allow for effective experiment designs despite inaccurate prior assumptions.

In addition to suggesting optimal experiments to identify model parameters, we showed that the FSP approach could be used to infer parameters of fluctuating extracellular environments from single-cell data and that the FSP-FIM combined with an existing model could be used to design optimal experiments to improve this inference (Figures 5 and 6). We experimentally verified this potential by examining many small sets of single-cell smFISH measurements for different genes and different measurement times, and we showed that an FSP-FIM analysis could correctly rank which experiment designs would give the best estimates of osmotic shock environmental conditions. Along a very similar line of reasoning, one can also adapt the FSP-FIM analysis to learn what biological design parameters would be optimal to reduce uncertainty in the estimate of important environmental variables. For example, Figure 8 shows the expected uncertainty in as a function of the degradation rate of the STL1 gene assuming that 50 cells could be measured at each experimental measurement time minutes using the smFISH approach. We found that the best choice for STL1 degradation rate to most accurately determine the extracellular fluctuations would be mRNA/min, which is about half of the experimentally determined value of from [10]. This result is consistent with our earlier finding that the faster degrading STL1 mRNA is a much better determinant of the HOG1 dynamics than the slower-degrading CTT1 mRNA and suggests that other less stable mRNA could be more effective still. We expect that similar, future applications of the FSP-based Fisher information will be valuable in other systems and synthetic biology contexts where scientists seek to explore how different cellular properties affect the transmission of information between cells or from cells to human observers. Indeed, similar ideas have been explored recently using classical information theory in [3639], and recent work in [7, 40] has noted the close relationship between Fisher information and the channel capacity of biochemical signaling networks.

We expect that computing optimal experiment designs for time-varying stochastic gene expression will create opportunities that could extend well beyond the examples presented in this work. Modern experimental systems are making it much easier for scientists and engineers to precisely perturb cellular environments using chemical induction [4143] or optogenetic control [4446]. Many such experiments involve stochastic bursting behaviors at the mRNA or protein level [810, 45], and precise optimal experiment design will be crucial to understand the properties of stochastic variations in such systems. A related field that is also likely to benefit from such approaches is biomolecular image processing and feedback control, for which one may need to decide in real time which measurements to make and in what conditions.

Data Availability

All data and codes associated with this article are available at https://github.com/MunskyGroup/Fox_Complexity_2020.

Disclosure

The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding agencies.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

ZRF and BM were supported by National Institutes of Health (R35 GM124747). ZRF was also supported by the Agence Nationale de la Recherche (ANR-18-CE91-0002, CyberCircuits). GN was supported by the National Institutes of Health (DP2 GM11484901 and R01GM115892) and Vanderbilt Startup Funds. The presented analyses used the computational resources of the WM Keck High Performance Computing Cluster supported under a WM Keck Foundation Award.

Supplementary Materials

Supplementary note 1: stochastic model of yeast stress response. Supplementary note 2: nuclear localization of HOG-MAPK. Supplementary note 3: optimization of cell measurements. Table I: HOG-MAPK model parameters. Table II: HOG-signaling model parameters. Figure 1: verification of the FSP-FIM for the time-varying HOG-MAPK model. (Supplementary Materials)