**UFBA**

# Statistics for Epidemiology-Nicholas P. Jewell-1584884339-CRC-2003-352-$94

(Parte **4** de 7)

The incidence proportion is the proportion of a defined population, all of whom are at risk for the disease at the beginning of a specified time interval, who become new cases of the disease before the end of the interval. Since this quantity includes all individuals who become cases over the entire interval, it is sometimes referred to as the cumulative incidence proportion. To be “at risk” can mean that an individual has previously been unaffected by the disease, or that susceptibility has been regained after previously contracting the disease and recovering (e.g., as with the common cold to which no sufferer becomes fully immune). There are situations where certain individuals cannot be affected by a disease, e.g., women cannot develop prostate cancer, and so are never at risk.

Figure 2.1 demonstrates schematically the calculation of the incidence proportion and point prevalence in a contrived example of a population of 100 individuals, 6 of whom

become cases of a disease during the time period from t0 to t1

. Using data from the figure, the point prevalence at time t is either 4/100 or 4/9, depending on whether case 4 is considered to be at risk of the disease at t or not, respectively. The incidence proportion in the interval [t0 , t1

] is 4/98, since cases 1 and 4 are not at risk for the disease at the beginning of the interval. This simple scenario reflects that calculations of disease occurrence vary according to definitions of who is “at risk”; take care to compute these quantities according to the appropriate definition!

Neither prevalence (or interval prevalence) nor an incidence proportion carries any units—they are all proportions, sometimes expressed as percentages, that must lie between 0 and 1. The simplest use of these measures of disease occurrence is their comparison across subgroups that have experienced different levels of exposure. For example, one might compare the prevalence of lung cancer among adult males who have smoked at any point in their lives against adult males who have never smoked.

Measures of Disease Occurrence 1

Table 2.1 Prevalence and incidence data (proportions) on CHD in males

Incidence (10 year) Prevalence Cholesterol CHD No CHD CHD No CHD

High 85 (75%) 462 (47%) 38 (54%) 371 (52%) Low 28 (25%) 516 (53%) 3 (46%) 347 (48%)

Source: Friedman et al. (1966).

The principal disadvantage with the use of prevalence measures to investigate the etiology of a disease is that they depend not only on initiation, but also on the duration of disease. That is, a population might have a low disease prevalence when (1) the disease rarely occurs or (2) it occurs with higher frequency, but affected individuals stay diseased for only a short period of time (either because of recovery or death). This complicates the role of risk factors, because duration may be influenced by many factors (such as medical treatment) that are unrelated to those that cause the disease in the first place. In addition, risk factors may change during the risk interval and so may assume different values at various times. Thus, prevalence difference across subgroups of a population is often difficult to interpret.

These points are well illustrated by coronary heart disease (CHD), from which a significant proportion of cases has high early mortality. Data (from the Framingham Heart Study, Friedman et al., 1966) relating levels of cholesterol and CHD for men aged 30 to 59 years are shown in Table 2.1. Here, incidence data refer to a group of men, initially free of CHD, whose cholesterol was measured at the beginning of a 10year follow-up period, during which incident cases of CHD were counted. Cholesterol levels were categorized into four quartiles (“high” and “low” in the table refer to the highest and lowest quartiles). Soon we will discuss methods of analyzing such data with regard to the issue of whether, and by how much, the higher cholesterol group suffers from an elevated risk for CHD. Yet even without analysis, it is immediately clear from the incidence data that there is a substantially larger fraction of CHD cases in the high cholesterol group as compared with the low cholesterol group. This is not apparent in the prevalence data, where cholesterol and CHD measurements were taken at the end of the 10-year monitoring period. This discrepancy in the two results might then arise if high cholesterol is associated only with those CHD cases who suffered rapid mortality (dying before the end of the interval) and thus were not included in the prevalence analysis. An alternative explanation is that surviving CHD patients modified their cholesterol levels after becoming incident cases so that their levels at the end of the follow-up period became more similar to the levels of the CHD-free men. (The latter possibility is supported by a more detailed analysis of the Framingham data [Friedman et al., 1966].) This example illustrates the dangers of using prevalence data in attempts to establish a causal association between an exposure and initiation of a disease.

While the statistical methods introduced apply equally to prevalence and incidence data, for most of the discussion and examples we focus on incidence proportions. Why? Because if causality is of prime concern, it is almost always necessary to use incidence, rather than prevalence, as a measure of disease occurrence.

2.2 Disease rates

Before leaving this brief introduction to disease occurrence measures, it is worth broadening the discussion to introduce the concept of a rate. If the time interval underlying the definition of an incidence proportion is long, an incidence proportion may be less useful if, for some groups, cases tend to occur much earlier in the interval than for other groups. First, this suggests the need for a careful choice of an appropriate interval when incidence proportions will be calculated. It does not make sense to use an age interval from 0 to 100 years if we want to compare mortality patterns. In the other direction, a study of food-related infections showed much higher mortality effects when individuals were followed for a full year after the time of infection, rather than considering only acute effects (Helms et al., 2003; see Question 5.5). Second, with long risk intervals, there may be substantial variation in risk over the entire period; for example, a person over 65 is at a far higher risk of mortality than an 8 year old. Third, when time periods are long, not all individuals may be at risk over the entire interval; when studying breast cancer incidence, for example, it may make little sense to include premenarcheal childhood years as time at risk. To address the latter point, rates that adjust for the amount of time at risk during the interval are often used.

Specifically, the (average) incidence rate of a disease over a specified time interval is given by the number of new cases during the interval divided by the total amount of time at risk for the disease accumulated by the entire population over the same interval. The

Figure 2.2 is a schematic that allows comparison of the computation of a point prevalence, incidence proportion, and incidence rate. If we assume that the disease under study is chronic in the sense that there is no recovery, then

• | The point prevalence at t=0 is 0/5=0; at t=5 it is 1/2=0.5. |

• | The incidence proportion from t=0 to t=5 is 3/5=0.6. |

• | The incidence rate from t=0 to t=5 is 3/(5+1+4+3+1)=3/14=0.21 cases per year. |

Figure 2.2 Schematic illustrating calculation of an incidence proportion, point prevalence, and incidence rate. Population of 5; the symbols represent: O, death; x, incident case of disease. Here, lines represent time alive.

Measures of Disease Occurrence 13

If population size and follow-up periods are unknown or unclear, or if multiple events per person are possible, the above incidence rate is often referred to by 0.21 cases per personyear, or 0.21 cases per person per year. Note that if the disease is acute, with the property that individuals who recover immediately return to being at risk, then the incidence rate would be 3/(5+1+5+3+4.5)=0.16 cases per year.

2.2.1 The hazard function

Returning to the second point of the first paragraph of Section 2.2, if either the population at risk or the incidence rate changes substantially over the relevant time interval, it will be necessary to consider shorter subintervals in order to capture such phenomena. That is, both an incidence proportion—a cumulative measure—and an incidence rate—an average rate over the risk interval—are summary measures for the entire interval and, as such, conceal withininterval dynamics. Unfortunately, limited population size often means that there will be very few incident events in short intervals of time. Nevertheless, in sufficiently large populations it may be possible to measure the incidence rate over smaller and smaller intervals. Such calculations yield a plot of incidence rate against, say, the midpoint of the associated interval on which the incidence rate was based. This kind of graph displays the changes in the incidence rate over time, much as a plot of speed against time might track the progress of an automobile during a journey. This process, in the hypothetical limit of ever smaller intervals, yields the hazard function, h(t), which is thus seen as an instantaneous incidence rate.

Figure 2.3 shows a schematic of the hazard function for human mortality among males, where the time variable is age. Looking at mortality hazard curves may feel morbid, particularly for those who find themselves on the right-hand incline of Figure 2.3. However, it is worth remembering that, as Kafka said, the point of life is that it ends. (I have always thought that one of the fundamental points of The Odyssey is also that the finiteness of life is what imbues it with meaning.) In Figure 2.3, the hazard function, plotted on the Y-axis, yields the mortality rate (per year) associated with a given age. In a population of size N, a simple interpretation of the hazard function at time t is that the number of cases expected in a small and unit increment of time is Nh(t). For example, if N=1000 and the hazard function at time t is 0.005/year, then we roughly anticipate five cases in a year including the time t somewhere near the middle.

If we write the time interval of interest as [0, T], there is a direct link between the hazard function, h(t), for 0!t!T, and the incidence proportion over the interval [0, t], which we denote by I(t). (We assume here that an incident case is no longer at risk after contracting the disease, and that this is the only way in which an individual ceases to be at risk.) The plot of I(t) against t is necessarily increasing—as I(t) measures the cumulative incidence proportion up to time t—and therefore has a positive slope at any time t. It can be shown that

Figure 2.3 Schematic of the hazard function based on mortality data for Caucasian males in California in 1980.

where dI(t)/dt represents the slope of I(t) at time t. Note that the term in the denominator accounts for the proportion of the population still at risk at time t. This relationship provides a way of uniquely linking any cumulative incidence proportion, I(t), to a specific hazard function h(t), and vice versa. Oops—I promised no calculus, but this correspondence between incidence and hazard is worth an exception.

When the outcome of interest is disease mortality rather than incidence, we often look at the function S(t), simply defined by S(t)=1!I(t). Known as the survival function, S(t) measures the proportion of the population that remains alive at age t. Figure 2.4 shows the survival function corresponding to the hazard function of Figure 2.3.

One of the appeals of hazard functions is the ease with which we can extract dynamic information from a plot of the hazard function, as compared with corresponding plots of the incidence proportion or survival function, against time. Even though Figures 2.3 and 2.4 contain exactly the same information, the hazard curve is considerably easier to interpret. For example, to quantify mortality risk for males in the first year of life, observe in Figure 2.3 that this level is roughly the same (yearly) mortality risk faced by 60-year-old males. This comparison is extremely difficult to extract from Figure 2.4. Note the steep increase in mortality risk after age 65 that restricts the chance of extremely long lives. While this phenomenon can be inferred from the graph of the survival function, where a drop occurs at later ages, specific comparative information regarding mortality risks is harder to interpret from the survival function than the hazard function.

Measures of Disease Occurrence 15

Figure 2.4 Schematic of the survival function (for mortality) among Caucasian males in California in 1980.

2.3 Comments and further reading

While we do not consider this matter further here, we cannot understate the value of effective assessment of exposures. With common exposures—tobacco and alcohol consumption, for example—there is substantial literature from previous studies that provides detailed methodology and examples. As a general guide, we refer to the book by Armstrong et al. (1994). In the context of a specific example, there is an excellent discussion of many relevant issues regarding the measurement of exposure to environmental tobacco smoke in the National Academy report on this topic (National Research Council, 1986). For nutritional exposures, the book by Willett (1998) is very useful. Some exposures can only be exactly measured by complex and perhaps invasive procedures so that accurate proxies must often be sought. There is considerable expertise available in questionnaire design and interview procedures to assess exposures determined from a survey instrument.

The definition of an incidence proportion assumes a closed population; that is, no new individuals at risk are allowed to enter after the beginning of the risk period. That this restriction is relaxed when using an incidence rate is one of its principal advantages. That said, it is still possible to estimate an incidence proportion when some individuals enter the population during the risk period. This is sometimes referred to as delayed entry, or left truncation. An important example arises in studying risks during pregnancy, where the risk period commences at conception but most study participants do not begin observation until a first prenatal visit, or at least until a pregnancy has been detected. The issue of closed vs. open populations will be echoed in our description of study designs in Chapter 5. Variations in incidence rate by exposure are not studied further here, but such data are widely available. The use of Poisson regression models is an attractive approach to rate data, corresponding to the use of regression models for incidence proportions that are studied extensively in Chapters 12 to 15. For further discussion of Poisson regression, see Selvin (1996) and Jewell and Hubbard (to appear).

Comments about exposure assessment are extremely cursory here, and fail to demonstrate many of the complex issues in classifying individuals into differing levels of exposure. Exposure information is often only available in proxy form including self-reports, job records, biomarkers, and ecologic data. To some degree, all exposure measurements are likely to only approximate levels of a true biological agent, even when using standard exposures such as smoking or alcohol consumption histories. Often, it may be valuable to obtain several proxies for exposure, at least on a subgroup of study participants. Validation information—for example, using an expensive but highly accurate exposure measurement on a subset of sampled individuals—is often a crucial component for estimating the properties of exposure measurement error.

(Parte **4** de 7)