**UFBA**

# Statistics for Epidemiology-Nicholas P. Jewell-1584884339-CRC-2003-352-$94

(Parte **6** de 7)

A popular interpretation of a confidence interval is that it provides values for the unknown population proportion that are “compatible” with the observed data. But we must be careful not to fall into the trap of assuming that each value in the interval is equally compatible. In the example of the above paragraph, the value p=0.35 is much more plausible than the value p=0.4, although the data do not allow us to definitively rule out that possibility. Computing (and reporting) two confidence intervals with differing confidence coefficients reinforces this point. Using the data above, the 90% confidence interval for the population proportion of unmarried mothers is

We have been using an approximate sampling distribution for that is effective when the sample size n is “large.” Most investigators agree that the question of whether n is large

enough can be checked by ensuring that and are both greater than 5. In

cases with sample sizes that fail to meet this criterion, the estimate is very close to either 0 or 1, and the confidence interval described in Equation 3.2 should not be used.

The simple technique for small or large is not useful, particularly since it introduces the possibility that the confidence interval will stretch beyond the allowable interval (from 0 to 1) for a probability or proportion. Never fear—there exists a more complex method for calculating the interval that avoids this concern. Return for a moment to the

approximate sampling distribution for given by the Normal distribution with expectation p and variance p(1!p)/n. Without trying to estimate the variance, this approximation tells us that

24 Statistics for Epidemiology Now

The second inequality of this statement is quadratic in p and can be solved to yield

This provides a more accurate confidence interval for p that can never give values outside of the range [0, 1]. For the example used above to illustrate the simple method, with

and n=100, we have 95% confidence limits given by or, more appropriately, (0.26, 0.45) to two decimal places. In this calculation, since n is

reasonably large and is well away from zero, the complex method gives almost exactly the same result as the simpler technique.

Both methods have the advantage that they can be computed without specialized software. Many computing packages now offer the possibility of an exact confidence

interval for p, that is, one that uses the exact sampling distribution of based on the binomial distribution. STATA® calculates for our example an exact 95% confidence

interval corresponding to the estimate as (0.257, 0.452), or (0.26, 0.45) to two decimal places. The exact interval is always more effective than the simple method when p is close to 0 or 1, or when the sample size, n, is small, and precludes the need for a continuity correction that improves the Normal approximation to the binomial (see Moore and McCabe, 2002, Chapter 5.1).

3.4 Conditional probabilities

In Section 3.1, we discussed the probability that a randomly selected infant dies within 1 year of birth for the population summarized in Table 3.1. But what if we want to know this probability for important subgroups of the population? For example, we may be

The Role of Probability in Observational Studies 25 interested in the probability of death within a year of birth for a birth associated with an unmarried mother or for a normal birthweight infant. In either case we are looking for the conditional probability of the outcome. The conditional probability of event A given that event B occurs, notated as P(A|B), is the “long run” fraction of times that event A occurs, the fraction being restricted to only those events for which B occurs. For example, if A represents the event that a randomly chosen infant dies within a year from birth, and B is the event that a randomly chosen birth is associated with an unmarried mother, then P(A|B) is the probability that a randomly chosen infant dies within a year of birth, given that this infant has an unmarried mother.

To further understand a conditional probability, let us look at the heuristic definition of probability given earlier in this chapter. To compute the conditional probability P(A|B), we have to calculate the long run fraction of times the event A occurs among events where B occurs. In a series of K independent equivalent experiments, let KA&B denote the number of those experiments where both events A and B occur. Then the fraction of times that event A occurs amongst those experiments where event B occurs is just KA&B /KB

Thus, the conditional probability P(A|B) is the “longrun” value of KA&B /KB

. But, by dividing both the numerator and denominator of this expression by K, we see that this conditional probability is given by the “long run” value of KA&B /K divided by the “long run” value of KB /K. More simply,

An immediate consequence of this expression is a formulation for the probability of the composite event, A and B:

(3.4) By reversing the roles of the events A and B, it follows that P(B|A)=P(A&B)/P(A) so that

For conditional probabilities, there is an analog to the statement in Section 3.1 that when we select a random member of a population, the probability that the selected individual has a specific characteristic is just the population proportion of individuals with this attribute. That is, for a randomly selected individual, the conditional probability that an individual has characteristic A, given that they possess characteristic B, namely, P(A|B), is the population proportion of individuals with characteristic A amongst the subpopulation who have characteristic B. Thus, referring to Table 3.1, if A denotes death in the first year and B denotes having an unmarried mother, then P(A|B), the conditional probability of death within a year of birth, given the infant has an unmarried mother, is

16,712/(1,197,142+16,712)=16,712/1,213,854 = 0.014, or 14 per 1,0 births. This conditional probability can also be calculated using Equation 3.3: since P(A&B)=16,712/4,1,059=0.0041, and P(B)= 1,213,854/4,1,059=0.295, we derive P(A|B)=0.0041/0.295=0.014.

It is important to note that the conditional probability of A|B is quite different from that of B|A. If A is infant death within the first year and now B is a normal birth-weight infant, then the conditional probability P(A|B) is the probability of an infant death, given that the child has normal birthweight. From Table 3.1, this conditional probability is given by 14,442/3,818,736=0.0038, or 3.8 deaths per 1,0 births. On the other hand, the conditional probability P(B|A) is the probability that an infant had normal birthweight, given that the infant died within 1 year from birth. Again, the data in Table 3.1 show that this conditional probability is 14,442/35,496=0.41. These two conditional probabilities are quite different.

3.4.1 Independence of two events

A natural consequence of looking at the conditional probability of an infant death within a year of birth, given that the mother is unmarried, is to examine the same conditional probability for married mothers. Many questions then follow: Are these two conditional probabilities the same in this population? If not, how different are they? The two

conditional probabilities P(A|B) and being identical reflects that the frequency of event A is not affected by whether B occurs or not. When A and B are defined in terms of the presence of certain characteristics, this can be interpreted as the two characteristics being unrelated. Formally, an event A is said to occur independently of an event B if the conditional probability P(A|B) is equal to the (unconditional) probability of the event A, P(A). That is, event A is independent of event B if and only if

In other words, an infant’s 1-year mortality is independent of its mother’s marital status if the probability of an infant’s death in the first year of life is not influenced by the marital status of the mother.

From Equation 3.4, it follows that if the event A is independent of event B, then

P(A&B)=P(A)"P(B). On the other hand, if P(A&B)=P(A)"P(B), then, from the expression for the conditional probability P(A|B), we have P(A|B)= P(A&B)/P(B)= [P(A)"P(B)]/P(B)=P(A), so that the event A is independent of event B. Also, by reversing the roles of the events A and B, we can see that the event B is independent of event A if and only if the event A is independent of event B. Hence, events A and B are independent if and only if P(A&B)=P(A)"P(B). For infant mortality and a mother’s mortality status, recall that P(infant death)=0.0086 and P(unmarried mother)=0.295. If these two characteristics were independent, then P(unmarried mother and infant death)=0.0086"0.295=0.0025. In fact, Table 3.1 yields P(unmarried mother and infant death)=16,712/4,1,059=0.0041. In this population, therefore, the two characteristics are clearly not independent. Here, the two characteristics, unmarried mother and infant death, occur together much more frequently than would be predicted if they were independent.

Finally, we note some other useful identities. First, if the events A and B are independent, P(A or B)=P(A)+P(B)"(P(A)"P(B)). Second, for any two events A and B,

Third, again for any two events A and B,

The Role of Probability in Observational Studies 27

P(A|B)=P(B|A)P(A)/P(B); this relationship is known as Bayes’ formula and allows us to link the probabilities of the two distinct events A|B and B|A. These identities follow directly from the definitions of probability and conditional probability.

3.5 Example of conditional probabilities—Berkson’s bias

In the earlier part of the last century it was believed that a possible cause or promoter of diabetes was a condition known as cholecystitis, or inflammation of the gall bladder. At one point, some physicians were removing the gall bladder as a treatment for diabetes. Berkson (1946) considered whether hospital or clinic records

Table 3.2 Cholecystitis and diabetes in hypothetical hospital records using refractive error as reference group

Diabetes Refractive Error

Cholecystitis C 626 9,504 10,130 not C 6,693 192,060 198,753 Total 7,319 201,564 208,833

Source: Berkson (1946).

Table 3.3 Cholecystitis and diabetes in hypothetical population using refractive error as reference group

Diabetes Refractive Error

Cholecystitis C 3,0 29,700 32,700 not C 97,0 960,300 1,057,300 Total 100,0 990,0 1,090,0

Source: Berkson (1946).

could be used to investigate the association between the presence of cholecystitis and diabetes. In particular, it seemed useful to look at the occurrence of cholecystitis amongst diabetic patients as compared to those without diabetes. Recognizing that cholecystitis might be associated with other diseases that lead to clinic use, he suggested that a condition known to be unrelated to cholecystitis be the reference group, and so he used individuals who came to the clinic with various refractive errors in an eye.

Berkson constructed some hypothetical data to illustrate possible findings. Table 3.2 shows information on the prevalence of cholecystitis amongst diabetic patients and in individuals with refractive error. For convenience we have labeled the three conditions, cholecystitis, diabetes, and refractive error, as C,D, and RE, respectively. If our population of interest is individuals with hospital records, then

P(C|RE)=9,504/201,564=0.0472, and P(C|D)=626/7,319=0.0855. If we assume that P(C|RE) is a reasonable proxy for the population value of P(C), these two probabilities suggest that diabetes and cholecystitis are not independent, with cholecystitis occurring much more frequently in individuals with diabetes than in nondiabetic patients.

However, Table 3.3 expands Berkson’s hypothetical data to the entire population. Note that the data of Table 3.2 are contained in Table 3.3. In the general population, we now see that P(C|D)=3,0/100,0=0.0300, and P(C|RE)= 29,700/990,0=0.0300, also. So, here we see no evidence of association between cholecystitis and diabetes. This apparent contradiction has arisen because of variation in clinic/hospitalization use depending on the various factors affecting an individual.

Table 3.4 Hypothetical data on cholecystitis, diabetes, and refractive error in both a hospitalized and general population

C and DC and RENot C and DNot C and RE

H 626 9,504 6,693 192,060 Not H 2,374 20,196 90,307 768,240 Total 3,0 29,700 97,0 960,300

Source: Berkson (1946). Note: H refers to the existence of hospitalization records; C, D, and RE to the presence of cholecystitis, diabetes, and refractive error, respectively.

For each of four combinations of conditions (cholecystitis or not by diabetes and refractive error), Table 3.4 restates the information in Tables 3.2 and 3.3 in terms of the frequency of hospitalization or clinic use, labeled by H. Note that the hospitalization probabilities are quite different: P(H|C&D)=626/3,0=0.21, P(H|C&RE)=9,504/29,700=0.32, P(H|not C&D)=6,693/97,0=0.07, and P(H|not C&RE)=192,060/960,300=0.20.

This example illustrates the danger of using hospital or clinic users as a population to study the association of characteristics. It is interesting to note that the varied hospitalization rates represented in Table 3.4 were generated using different hospitalization frequencies separately for each of the three conditions—cholecystitis, diabetes, and refractive error—and then combining such probabilities with the assumption that hospitalization for any condition is independent of hospitalization from another. That is, no additional likelihood of hospitalization was assumed to arise for a subject suffering from more than one condition than expected from having each condition separately. Additional distortion inevitably arises when hospitalization rates for individuals with multiple conditions are greater than predicted from looking at each condition separately. Roberts et al. (1978) give an example of Berkson’s fallacy with real data arising from household surveys of health care utilization.

3.6 Comments and further reading

The discussion in this and subsequent chapters is based on the assumption that data arise from a simple random sample of the population. There are other, often more effective sampling techniques, including stratified and cluster sampling, that are used to obtain

The Role of Probability in Observational Studies 29 estimates of a population proportion or probability. In more complex sampling schemes, the basic philosophy for constructing interval estimates remains the same, but expressions for both proportion estimators and their associated sampling variability must be modified to incorporate relevant sampling properties. We refer to Levy and Lemeshow (1999) for an extensive discussion of the advantages and disadvantages of various sampling schemes, with thorough descriptions of the appropriate estimation and confidence interval methods for a specific sampling design. A particular kind of stratified sampling forms the basis of matched studies (as covered in Chapter 16).

In many nonexperimental studies, participants are often not selected by any form of random sampling. Nevertheless, confidence intervals are usually calculated using the exact same techniques, with the tacit assumption that the data are being treated as if they arose from a simple random sample. This is a risky assumption to rely on consistently, since factors influencing a participant’s selection are often unknown and could be related to the variables of interest. Such studies are thus subject to substantial bias in estimating probabilities and proportions. Of special concern is when study subjects are self-selected, as in volunteer projects. In restricted populations, sometimes all available population members are selected for study, with the same confidence interval procedures used. Here, confidence intervals are used to apply findings to a conceptually larger population, not to describe characteristics of a highly specified population. In fact, when the entire population is sampled, there is no sampling error, but the data may not be of broad interest. We return to this issue again when we discuss study designs in Chapter 5.

(Parte **6** de 7)