# Statistics for Epidemiology-Nicholas P. Jewell-1584884339-CRC-2003-352-\$94

(Parte 2 de 7)

17.5.1 Time-dependent exposures 310 17.5.2 Differential loss to follow-up 311

17.6 Comments and further reading 312 17.7 Problems 313 18 Epilogue: The Examples 315

References 317 Glossary of Common Terms and Abbreviations 326 Index 331

Acknowledgments

The material in this book has grown out of a graduate course in statistical methods for epidemiology that I have taught for more than 20 years in the School of Public Health at Berkeley. I wish to express my appreciation for the extraordinary students that I have met through these classes, with whom I have had the privilege of sharing and learning simultaneously. My thanks also go to Richard Brand, who first suggested my teaching this material, and to Steve Selvin, a lifelong friend and colleague, who has contributed enormously both through countless discussions and as my local S-Plus expert. The material on causal inference depended heavily on many helpful conversations with Mark van der Laan. Several colleagues, especially Alan Hubbard, Madukhar Pai, and Myfanwy Callahan, have selflessly assisted by reading parts or all of the material, diligently pointing out many errors in style or substance. I am forever grateful to Bonnie Hutchings, who prepared the earliest versions of handouts of some of this material long before a book was ever conceived of, and who has been a constant source of support throughout. I also owe a debt of gratitude to Kate Robertus for her incisive advice on writing issues throughout the text.

Finally, my enjoyment of this project was immeasurably enhanced by the love and support of my wife, Debra, and our daughter, Britta. Their presence is hidden in every page of this work, representing the true gift of life.

CHAPTER 1 Introduction

In this book we describe the collection and analysis of data that speak to relationships between the occurrence of diseases and various descriptive characteristics of individuals in a population. Specifically, we want to understand whether and how differences in individuals might explain patterns of disease distribution across a population. For most of the material, I focus on chronic diseases, the etiologic processes of which are only partially understood compared with those of many infectious diseases. Characteristics related to an individual’s risk of disease will include (1) basic measures (such as age and sex), (2) specific risk exposures (such as smoking and alcohol consumption), and (3) behavioral descriptors (including educational or socioeconomic status, behavior indicators, and the like). Superficially, we want to shed light on the “black box” that takes “inputs”—risk factors such as exposures, behaviors, genetic descriptors—and turns them into the “output,” some aspect of disease occurrence.

1.1 Disease processes

Let us begin by briefly describing a general schematic for a disease process that provides a context for many statistical issues we will cover. Figure 1.1, an adapted version of Figure 2.1 in Kleinbaum et al. (1982), illustrates a very simplistic view of the evolution of a disease in an individual.

Note the three distinct stages of the disease process: induction, promotion, and expression. The etiologic process essentially begins with the onset of the first cause of the resulting disease; for many chronic diseases, this may occur at birth or during fetal development. The end of the promotion period is often associated with a clinical diagnosis. Since we rarely observe the exact moment when a disease “begins,” induction and promotion are often considered as a single phase. This period, from the start of the etiologic process until the appearance of clinical symptoms, is often called the latency period of the disease. Using AIDS as an example, we can define the start of the process as exposure to the infectious agent, HIV. Disease begins with the event of an individual’s infection; clinical symptoms appear around the onset and diagnosis of AIDS, with the expression of the disease being represented by progression toward the outcome, often death. In this case, the induction period is thought to be extremely short in time and is essentially undetectable; promotion and expression can both take a considerable length of time. Epidemiological study of this disease process focuses on the following questions:

• Which factors are associated with the induction, promotion, and expression of a disease? These risk factors are also known as explanatory variables, predictors,

Figure 1.1 Schematic of disease evolution.

covariates, independent variables, and exposure variables. We will use such terms interchangeably as the context of our discussion changes.

• In addition, are certain factors (not necessarily the same ones) associated with the duration of the induction, promotion, and expression periods?

For example, exposure to the tubercule bacillus is known to be necessary (but not sufficient) for the induction of tuberculosis. Less is known about factors affecting promotion and expression of the disease. However, malnutrition is a risk factor associated with both these stages. As another example, consider coronary heart disease. Here, we can postulate risk factors for each of the three stages; for instance, dietary factors may be associated with induction, high blood pressure with promotion, and age and sex with expression. This example illustrates how simplistic Figure 1.1 is in that the development of coronary heart disease is a continuous process, with no obvious distinct stages. Note that factors may be associated with the outcome of a stage without affecting the duration of the stage. On the other hand, medical treatments often lengthen the duration of the expression of a chronic disease without necessarily altering the eventual outcome.

Disease intervention is, of course, an important mechanism to prevent the onset and development of diseases in populations. Note that intervention strategies may be extremely different depending on whether they are targeted to prevent induction, promotion, or expression. Most public health interventions focus on induction and promotion, whereas clinical treatment is designed to alter the expression or final stage of a disease.

1.2 Statistical approaches to epidemiological data

Rarely is individual information on disease status and possible risk factors available for an entire population. We must be content with only having such data for some fraction of our population of interest, and with using statistical tools both to elucidate the selection of individuals to study in detail (sampling) and to analyze data collected through a particular study. Issues of study design and analysis are crucial because we wish to use sample data to most effectively make applicable statements about the larger population from which a sample is drawn. Second, since accurate data collection is often expensive and time-consuming, we want to ensure that we make the best use of available resources. Analysis of sample data from epidemiological studies presents many statistical challenges since the outcome of interest—disease status—is usually binary. This book is intended to extend familiar statistical approaches for continuous outcome data—for example, population mean comparisons and regression—to the binary outcome context.

Introduction 3

1.2.1 Study design

A wide variety of techniques can be used to generate data on the relationship between explanatory factors and a putative outcome variable. I mention briefly only three broad classes of study designs used to investigate these questions, namely, (1) experimental studies, (2) quasi-experimental studies, and (3) observational studies. The crucial feature of an experimental study is the investigator’s ability to manipulate the factor of interest while maintaining control of other extraneous factors. Even if the latter is not possible, control of the primary risk factor allows its randomization across individual units of observation, thereby limiting the impact of uncontrolled influences on the outcome. Randomized clinical trials are a type of experimental study in which the main factor of interest, treatment type, is under the control of the investigator and is randomly assigned to patients suffering from a specific disease; other influencing factors, such as disease severity, age, and sex of the patient, are not directly controlled.

Quasi-experimental studies share some features of an experimental study but differ on the key point of randomization. Although groups may appear to differ only in their level of the risk factor of interest, these groups are not formed by random assignment of this factor. For example, comparison of accident fatality rates in states before and after the enactment of seat-belt laws provides a quasi-experimental look at related safety effects. However, the interpretation of the data is compromised to some extent by other changes that may have occurred in similar time periods (did drivers increase their highway speeds once seat belts were required?). A more subtle example involved an Austrian study of the efficacy of the PSA (prostate specific antigen) test in reducing mortality from prostate cancer; investigators determined that, within 5 years, the death rate from prostate cancer declined 42% below expected levels in the Austrian state, Tirol, the only state in the country that offered free PS A screening. Again, comparisons with other areas in the country are compromised by the possibility there are other health-related differences between different states other than the one of interest. Many ecologic studies share similar vulnerabilities. The absence of randomization, together with the inability to control the exposure of interest and related factors, make this kind of study less desirable for establishing a causal relationship between a risk factor and an outcome.

Finally, observational studies are fundamentally based on sampling populations with subsequent measurement of the various factors of interest. In these cases, there is not even the advantage of a naturally occurring experiment that changed risk factors in a convenient manner. Later in the book we will focus on several examples including studies of the risk of coronary heart disease where primary risk factors, including smoking, cholesterol levels, blood pressure, and pregnancy history, are neither under the control of the investigator nor usually subject to any form of quasi-experiment. Another example considers the role of coffee consumption on the incidence of pancreatic cancer, again a situation where study participants self-select their exposure categories.

In this book, we focus on the design and analysis of observational epidemiological studies. This is because, at least in human populations, it is simply not ethical to randomly assign risk factors to individuals. Although many of the analytic techniques are immediately applicable and useful in randomized studies, we spend a considerable amount of effort dealing with additional complications that arise because of the absence of randomization.

1.2.2 Binary outcome data

In studying the relationship between two variables, it is most effective to have refined measures of both the explanatory and the outcome variables. Happily, substantial progress is now being made on more refined assessment of the “quantity” of disease present for many major diseases, allowing a sophisticated statistical examination of the role of an exposure in producing given levels of disease. On the other hand, with many diseases, we are still unable to accurately quantify the amount of disease beyond its presence or absence. That is, we are limited to a simple binary indicator of whether an individual is diseased or not.

Similarly, in mortality studies, while death is a measurable event, the level and quality of health of surviving individuals are notoriously elusive, thus limiting an investigator use of the binary outcome, alive or not. For this reason, we focus on statistical techniques designed for a binary outcome variable. On the other hand, we allow the possibility that risk factors come in all possible forms, varying from binary (e.g., sex), to unordered discrete (e.g., ethnicity), to ordered discrete (e.g., coffee consumption in cups per day), to continuous (e.g., infant birthweight). However, we assume that risk factors or exposures have a fixed value and therefore do not vary over time (although composite values of time-varying measurements, such as cumulative number of cigarette pack-years smoked, are acceptable). Methods to accommodate exposures that change over time, in the context of longitudinal data, provide attractive extensions to the ideas of this book and, in particular, permit a more effective examination of the causal effects of a risk factor. We briefly touch on this again in Chapter 17, and also refer to Jewell and Hubbard (to appear) for an extensive discussion of this topic.

Statistical methodology for binary outcome data is applicable to a wide variety of other kinds of data. Some examples from economics, demography, and other social sciences and public health fields are listed in Table 1.1. In these examples, the nature of a risk factor may also be quite different from traditional disease risk factors.

Table 1.1 Examples of binary outcomes and associated risk factors

Binary Outcome Possible Risk Factors

Use/no use of mental health services in calendar year 2003 Cost of mental health visit, sex Moved/did not move in calendar year 2003 Family size, family income Low/normal birthweight of newborn Health insurance status of mother Vote Democrat/Republican in 2004 election Parental past voting pattern Correct/incorrect diagnosis of patient Place and type of medical training Covered/not covered by health insurance Place of birth, marital status

1.3 Causality

As noted in Section 1.2.1, observational studies preclude, by definition, the randomization of key factors that influence the outcome of interest. This may severely limit our ability to attribute a causal pathway between a risk factor and an outcome

Introduction 5 variable. In fact, selecting from among the three design strategies discussed in Section 1.2.1 hinges on their ability to support a causal interpretation of the relationship of a risk factor or intervention with a disease outcome. This said, most statistical methods are not based, a priori, on a causal frame of thinking but are designed for studying associations between factors not necessarily distinguished as input “risk factors” or outcome; for example, the association between eye color and hair color. In short, observational data alone can rarely be used to separate a causal from a noncausal explanation. Nevertheless, we are keenly interested in establishing causal relationships from observational studies; fortunately, even without randomization, there are simple assumptions, together with statistical aspects of the data, that shed light on a putative causal association. In Chapter 8, we introduce much recent work in this regard, including the use of counterfactuals and causal graphs. As noted above, longitudinal observational studies provide greater possibilities for examining causal relationships.

(Parte 2 de 7)