**UFBA**

# Statistics for Epidemiology-Nicholas P. Jewell-1584884339-CRC-2003-352-$94

(Parte **5** de 7)

In the following chapters, a fixed level of exposure over time is generally assumed.

However, chronic exposures, including smoking and occupational conditions, accumulate, implying that the risk of disease will also change over the risk period for this if for no other reason. Composite summary measure of exposure, like pack-years of smoking, should be used with extreme care, since they only capture average exposure information. For example, in studies of fetal alcohol syndrome, episodes of binge drinking may be a better measure of alcohol exposure than average consumption measures. Exposures that vary over time are often best handled with hazard models, which are examined in Chapter 17, where regression methods for incidence proportions are briefly compared with those commonly used for hazard functions. Further details on estimation and inference for hazard functions, and related topics in survival analysis, can be found in Kalbfleisch and Prentice (2002), Hosmer and Lemeshow (1999), and Collett (1994).

2.4 Problems

Question 2.1

Figure 2.5 illustrates observations on 20 individuals in a study of a disease D. The time (i.e., horizontal) axis represents age. The lines, one per individual, represent the evolution of follow-up: the left endpoint signifies the start of follow-up, while the right endpoint indicates the age at onset of D, except in cases of withdrawal for a variety of reasons, marked with a W. For example, the first subject (the lowest line above the axis) started follow-up before his 50th birthday and developed the disease early in his 67th year, i.e., just after turning 6. Calculate for this small population:

Measures of Disease Occurrence 17

Figure 2.5 Schematic showing onset of disease at different ages in population of 20 individuals.

1 | The incidence proportion for the disease between the ages 50 and 60 (assume that |

2 | The incidence proportion between (a) ages 60 and 70, and (b) ages 70 and 75. |

3 | The incidence rate for the intervals (a) ages 50 to 60, (b) ages 60 to 70, and (c) ages |

the disease is chronic so that those with the disease are no longer at risk). 70 to 75.

Comment on your findings.

Question 2.2

Indicate whether each of the following computed indices should be considered a point prevalence, incidence proportion, or an incidence rate:

1 | The number of children born with congenital heart defects in California in |

2 | The number of persons who resided in California on January 1, 2002, and |

3 | The number of myopic children under the age of 13 in California on July 1, |

2002, divided by the number of live births in California in 2002. who developed colon cancer during 2002, divided by the total number of disease-free persons who were California residents on January 1, 2002. 2002, divided by the total number of children under the age of 13 in California on July 1, 2002.

4. The number of 60 to 64-year-old California residents who had a stroke in 2002, divided by the total number of 60 to 64-year-old residents on July 1, 2002.

Question 2.3

Describe plausible disease scenarios, with the relevant risk intervals, that suggest (1) an increasing hazard function; (2) a decreasing hazard function; (3) initially increasing hazard, followed by decreasing hazards; and (4) initially decreasing hazard, followed by increasing hazards.

CHAPTER 3 The Role of Probability in Observational Studies

As indicated in the introduction, it is assumed that the reader is familiar with the basic concepts and manipulation of probability and random variables. In this chapter the use of probabilistic terms in epidemiological studies is discussed and some of the simplest ideas are reviewed. One goal of this chapter is to understand what we mean by the risk or probability of a disease. Does it mean that there is some random mechanism inside our bodies that decides our fate? While rejecting that notion, the source of randomness in epidemiological investigations, a key step in quantifying the uncertainty inherent in such studies, is also described. First, some basic understanding of the language and meaning surrounding a probability statement is needed.

Two fundamental components necessary to describe the probability of an occurrence are (1) a random experiment and (2) an event. A random experiment is a process that produces an identifiable outcome not predetermined by the investigator. An event is a collection of one or more distinct possible outcomes. An event occurs if the observed outcome of the experiment is contained in the collection of outcomes defining the event. For example, in tossing a coin one usually thinks of only two possible outcomes —“heads” and “tails.” Here, the experiment is the toss of a coin, and an event might be that the coin comes up heads. A qualitatively similar situation occurs with the administration of a specific therapy to a patient with a certain disease. Here, the random experiment is application of treatment, with possible events being “patient cured” or “patient not cured”; note that “not cured” may be defined in terms of combinations of simple outcomes such as blood pressure reading or amount of inflammation.

What is the probability that a tossed coin comes up heads? More generally, in any random experiment, what is the probability of a particular event? Denote the probability of an event A by the term P(A). A heuristic definition of probability is as follows:

In a random experiment, P(A) is the fraction of times the event A occurs when the experiment is repeated many times, independently and under the exact same conditions.

To express this formally, suppose that a random experiment is conducted K times and the event A occurs in KA of the total K experiments. As K grows larger and larger, the fraction of times the event A occurs, KA /K, approaches a constant value. This value is P(A), the probability of A occurring in a single experiment.

3.1 Simple random samples

A special type of random experiment is illustrated by the sampling of individuals from a population that contains N distinct members; suppose you wish to select a sample of n of these members for study. If the members are selected “at random,” then the sample is called a random sample. As in a random experiment, “at random” implies that although the investigator sets the sample size n, she does not predetermine which n individuals will be selected. An important kind of random sample is the simple random sample, with the defining property that every possible sample of size n is equally likely to occur, or is chosen with equal probability. Consider the special case n=1; since there are N members in the population, there are exactly N samples of size n=1, each distinct sample containing one particular member of the population. Since the N samples are equally likely to occur, each sample occurs with probability 1/N. We refer to this below as random selection of a population member.

To fix the ideas, consider a simple example. Imagine drawing a single card at random from a deck of 52 playing cards. We wish to take a simple random sample of size n=1 (one card) from a population of N=52 cards. Now we wish to calculate the probability of a particular event associated with our sample of one card. For example, suppose we wish to compute the probability that the selected card is from the clubs suit—we will call this event A. Using the “long run” definition of probability above, we must ask what fraction of K independent simple random samples of size n=1 will produce the outcome of a club card. Since each possible sample (i.e., each of 52 distinct playing cards) is equally likely, on each repetition we would expect each distinct playing card to appear 1/52 times in the long run, and thus K/52 times in K draws. Thirteen of the cards are clubs; therefore, in the long run, the fraction of K draws that produces a club will be ((K/52)"13)/K=1/4. This is thus the probability of drawing a club on a single draw.

It is useful to draw a general conclusion from this example. When randomly selecting a single object from a group of N objects, the probability that the selected object has a specific attribute (event A) is just the fraction of the N objects that possess this attribute. We illustrate this by referring to Table 3.1, which lists the vital status of all births in the United States in 1991, 1 year after date of birth, and categorized by the marital status of the mother at the birth and the birthweight of the infant. By convention, low birthweight is defined as a birth where the newborn’s weight is less than 2500 g. Suppose our “experiment” is now the random selection of one of these 4,1,059 births. If the event A refers to the death of this randomly chosen infant within 1 year of its birth, then P(A) is the probability of an infant death in 1991 and is the fraction of the entire population of infants who died in their first year. The population information derived from the table shows this probability to be 35,496/4,1,059=0.0086, or 8.6 deaths per 1000 births. Similarly, if event B refers to a randomly chosen infant with normal birth-weight, then P(B) is the probability of a normal birthweight infant, given by the fraction of all births that have a normal birthweight, that is, 3,818,736/4,1,059 = 0.93.

Table 3.1 1991 U.S. infant mortality by mother’s marital status and by birthweight

Mother’s Marital Status Infant Mortality Unmarried Married Total

Death 16,712 18,784 35,496

Live at 1 year 1,197,142 2,878,421 4,075,563 Total 1,213,854 2,897,205 4,1,059

The Role of Probability in Observational Studies 21

Birthweight

Infant Mortality Low Birthweight Normal Birthweight Total

Death 21,054 14,442 35,496 Live at 1 year 271,269 3,804,294 4,075,563

Total 292,323 3,818,736 Source: National Center for Health Statistics.

3.2 Probability and the incidence proportion

These simple probability ideas can now be related to the measures of disease occurrence introduced in Chapter 2. As a general example, consider the incidence proportion for a disease. Each member of the relevant population at risk can be assigned the characteristic of being an incident case in the specified time interval—underlying the definition of the incidence proportion—or not. For convenience, label this characteristic D. By definition, the incidence proportion is just the fraction of the population that possesses characteristic D. As discussed in the last section, if an individual is drawn at random from the population, the probability that he will have characteristic D is P(D) that is, the incidence proportion. Thus the incidence proportion can either be interpreted as the proportion of a population who are incident cases in a given interval, or, equivalently, as the probability that a randomly chosen member of the population is an incident case.

Similar statements can, of course, be made if the characteristic D is defined in terms of a point or interval prevalence. However, for reasons discussed in Section 2.1, we assume that D refers to incident cases so that P(D) is interpreted as an incidence proportion. Further, we also use the terminology P(E) to reflect the probability that a randomly selected individual from a population has an exposure characteristic labeled by E; E might be a qualitative or quantitative measure of exposure or risk. For convenience, with

any event A, we sometimes use | to refer to the event “not A.” |

In the following, we often use the terms P(D) and P(E) to refer explicitly or implicitly to the “probability of being diseased” or the “probability of being exposed.” This language may mislead us into thinking that a disease or exposure characteristic is random in some sense. In fact, as seen above, the randomness referred to in these statements arises entirely from random sampling from a population. Although it is possible that some disease outcomes or exposures might have a purely random component (for example, a disease dependent on the occurrence of a specific random mutation, or exposure to certain kinds of infectious agents), this issue is not germane to our treatment that is based solely on the randomness introduced by taking a random sample from a population. Specifically, in the following chapters, techniques are introduced that use sample quantities to convey information on population characteristics of interest such as P(D) and P(E), the incidence proportion and population exposure frequency, respectively. Understanding random sampling allows quantification of the uncertainty inherent in using samples to infer properties of a larger population from which the sample is drawn. The next section briefly reviews estimation of the probability of a single characteristic based on a simple random sample and the uncertainty that is inherent to this sample estimator.

2 Statistics for Epidemiology 3.3 Inference based on an estimated probability

We rarely observe an entire population with appropriate risk factor information as was possible for the infant mortality data reported in Table 3.1. Our strategy will instead be to draw a (simple) random sample that will provide us with appropriate data we can use to estimate a population probability or proportion. We now briefly review methods to compute confidence intervals, one approach to describing the uncertainty associated with a sample estimator of a population proportion.

Suppose we are interested in estimating P(A), the probability of a characteristic A in a given population. We draw a simple random sample of size n from the population and let nA denote the number in the sample with characteristic A. For simplicity, write p=P(A).

An obvious estimate of p is the proportion of the sample that shares

characteristic A. From sample to sample, the random number nA follows a binomial sampling distribution with expectation (or mean) given by nP and variance by np(1!p).

This is just the sampling distribution of nA . (In fact, this statement assumes we sample with replacement or that the population is infinitely large; this is only an issue when n represents a substantial fraction of the population.) For a sufficiently large n, this sampling distribution is close to a Normal distribution with the same expectation and

variance. Thus, has an approximate Normal sampling distribution with expectation p and variance p(1!p)/n. The variance can be estimated from our sample data by plugging in

for p, yielding an approximate Normal sampling distribution with expectation p and variance

We now have a simple method to construct a confidence interval for the unknown proportion p using our sample. Using the approximate sampling distribution,

where is the [1!(!/2)]th percentile of a standard Normal distribution. Note that the probability in this statement refers to the experiment of repeatedly drawing simple random samples of size n from the population.

Equation 3.1 tells us, probabilistically, how close the sample estimate is to the unknown p. But knowing this is the same as knowing how close p is to That is,

The Role of Probability in Observational Studies 23

The right-hand part of Equation 3.2 then defines a 100 (1!!)% confidence interval for p.

For example, suppose that a random sample of 100 births were drawn from infants born in the U.S. in 1991 (see the data in Table 3.1) and that in this sample 35 births were associated with unmarried mothers. The estimate of the population proportion of unmarried mothers is then 35/100=0.35, and a simple 95% confidence interval is

We might report this 95% confidence interval as (0.26, 0.4); it clearly makes no sense to report results to the third decimal place here, given the size of the margin of error ±0.093. (The true population probability from Table 3.1 is 0.295.) Recall that the correct interpretation of this (random) confidence interval is that it includes the true value, 0.295, with probability 0.95.

(Parte **5** de 7)