National Bioterrorism Syndromic Surveillance Demonstration Project 
Supported by the National Center for Infectious Diseases
Centers for Disease Control and Prevention
(Grant UR/CCU115079)

Centers for Disease Control and Prevention | Harvard Pilgrim Health Care | Channing Laboratory | Harvard Medical School | American Association of Health Plans



 Request Access
 Administration
 Publications
 Home
 

Statistical Modeling

This surveillance system detects unusual clusters of illness in single and contiguous 3- and 5-digit zip code areas in the United States. Each data-providing organization’s data are modeled separately. Two different signal detection methods are used, one based on generalized linear mixed models (known as the "SMART score" method) and the other based on a model-adjusted spatial and temporal scan statistical technique (known as the "SaTScan" method). These are described briefly below. Both make use of the "recurrence interval" to express the degree of statistical aberration associated with each cluster, this will be explained first.

 

The Recurrence Interval

The recurrence interval (RI) is the metric, represented in days, used in this syndromic surveillance system to represent the probability of an event occurring. It is a model-based estimate of the length of time that surveillance would have to be done in order to expect exactly one event as rare or rarer than the observed. The chart below depicts this concept.

Graphical explanation of Recurrence Interval

The left side of the chart depicts that a frequently observed count, event, is by definition small, on the order of 1 encounter or even 0. This observance is expected to be seen by us at least once per day, that is, the recurrence of this event is expected to occur once in every day with a probability of it occuring approaching 100.0%.

The highlighted RI and probability in the chart correspond to a count with a model-based probability of 0.01, which has by definition a 1% chance of occurring on any particular day, just by chance. In any one-hundred-day period of continuous surveillance, we would expect one event of such a small probability, or smaller, to occur. An example.

Please Note: A given count will correspond to different RIs, depending on the syndrome, zip code, day of the week, season, and data provider.

 

The Statistical Approach

Overview
Each data-providing entity supplies historical data at the beginning of their participation in the syndromic surveillance system. We fit a model to this data and use the results to estimate distributional parameters for each day of the year. Using these distributional parameters, we calculate the probability of seeing as many more cases as were seen on each day. We then correct for multiple testing. or the same parameters are also used to generate expected counts serving as input to SaTScan.

Model
The model is a generalized linear or generalized linear mixed model. The part of the model pertaining to the day of the observation includes:

  • A secular linear trend over time
  • Sine and cosine effects for seasonality
  • Month indicators (11) for non-trigonometric effects of season
  • Day-of-week indicators (6) for day-to-day variability
  • Holiday and day-after holiday indicators
No assertion is made that this is an exhaustive or parsimonious list of useful covariates.
Currently, this is a Poisson generalized linear model with a different intercept for each region (5-digit or 3-digit zip code). Model assessment continues.

Distributional parameter estimation
To find the distributional parameters, we calculate the value of the linear predictor from the model for each area for each day. Then we invert the link function (e.g. we exponentiate, for the Poisson) to find the estimated parameter (mean, for Poisson) from the distribution. Our approach does not incorporate information regarding the variability of this estimate.

Probability assessment
We use the distribution to straightforwardly calculate the probability of seeing each number of cases. Then 1 less the sum of the probability of fewer cases is the probability of as many or more.

Correction for multiple testing
Since many areas are tested, the nominal probability generated in the previous step is misleading. The expected number of times a given probability is expected to occur is equal to (probability * number of tests)-1. This can also be expressed as the number of days that doing that many tests would be required in order to expect exactly one count with a probability equal to the observed.

SatScan analysis
The space-time scan statistic uses a cylindrical window with a circular geographic base and with height corresponding to time, where both the base and height vary in size, with a maximum geographical size of 50% of the population at risk and a maximum height of three days. The cylindrical window is then moved in space and time, so that for each possible geographical location and size, it also visits each possible time period. In effect, we obtain an infinite number of overlapping cylinders of different size and shape, jointly covering the entire study region, where each cylinder reflects a possible outbreak.

The estimated Poisson mean from the modeling described above is used as the expected count for each region and day. This corrects the SaTScan for seasonal, day-of-week, and other patterns observed in the historical period. <\p>

For each cylinder, the observed and expected numbers of cases are noted, and these are used to calculate a Poisson-based log likelihood ratio reflecting how 'unusual' it is to observe what was observed. The p-values are adjusted for the multiple testing inherent in the many cylinders considered. <\p>

For a full technical discussion and details, please contact the head statistician for the project.

Go to Top Email the Webmaster

© Copyright 2000-2010 Channing Laboratory
All Rights Reserved.
Anonymous User: ('Anonymous',)
last modified: Nov 20, 2005 5:46 pm
Please note that your IP address (38.107.191.98) and all your activity on this website is logged.
If this is not acceptable, please disconnect immediately