**UFBA**

# Statistics for Epidemiology-Nicholas P. Jewell-1584884339-CRC-2003-352-$94

(Parte **3** de 7)

1.4 Overview

Our goal is to introduce current statistical techniques used to collect and analyze binary outcome data (sometimes referred to as categorical data) taken from epidemiological studies. The first 1 chapters set the context of these ideas and cover simple methods for preliminary data analysis. Further chapters cover regression models that can be used for the same data. Chapter 16 discusses the special design technique known as matching and describes the particular analytic methods appropriate for matched data.

I assume that readers are familiar with basic ideas from a first course in statistics including random variables and their properties (in particular, expectation and variance), sampling, population parameters, and estimation of a population mean and proportion. Of particular importance is the concept of the sampling distribution of a sample estimator that underpins the ideas of interval estimation (confidence intervals) and hypothesis testing (including Type I and Type I errors, p-values, and power). I further anticipate that readers have previously encountered hypothesis tests to compare population means and proportions (for example, the various t-tests and, at least, the one degree of freedom x2 test). Familiarity with the binomial, normal, and x2 distributions is expected, and experience with the techniques associated with multiple linear regression, while not essential, will make Chapters 12 to 15 much easier to follow. Moore and McCabe (1998) provides an excellent source to review these topics. While mathematical proofs are eschewed throughout, some algebra is used where it can bolster insight and intuition. Fear not, however. Readers are not assumed to have knowledge of techniques that use calculus. The overall goal here is to give some basic driving lessons, not to get under the hood and tinker with the mechanics of the internal combustion engine!

Regression models, found in the second half of the book, can be used to incorporate the simpler analyses from earlier chapters. Some readers may be tempted to jump directly to these methods, arguing that the earlier stratification methods are really only of historical interest. My view is that basic tabulations with related analyses are important not only to develop a general basis for understanding more complex regression models, but also for gaining a sense of what a particular data set is “saying” before launching a full-scale regression analysis.

I want to say a few words here about developing a personal philosophy about data analysis.

Although statistical methodology has an inevitable feel of mathematics, it is more than simply the application of a set of mathematical rules and recipes. In fact, having the perfect recipe is a wonderful advantage, but it does not guarantee a perfect meal. It is crucial that each data analyst construct his own “artistic” principles that can be applied when unraveling the meaning of data. Asking the right questions and having a deep understanding of the context in which a study is designed and implemented are, of course, terrific help as you begin a data analysis. But some general feel for numbers and how to manipulate and illustrate them will also bear considerable fruit. For instance, a sense for the appropriate level in precision in numerical quantities is a valuable tool. As a rough rule of thumb, I do not pay attention to discrepancies between two quantities that are less than 10% of their size. This is not useful in some contexts—knowing a telephone number to within 10% does not get you too far— but in epidemiological studies, this level of difference is often much less than the size of random, let alone other systematic, error. Focusing on such comparisons in the presence of substantial imprecision is putting the priority in the wrong place; it is my statistical version of “don’t sweat the small stuff!” Each reader needs a personal style in deciding how best to approach and report a data analysis project. Many other statistical rules of thumb can be found in van Belle (2002), which includes an entire section on epidemiology.

As a brief add-on to the last paragraph, results of data analyses in this book are frequently given as numerical quantities to the second or third decimal place. This is to allow the reader to reconstruct numerical computations and is not meant to reflect how these quantities should be reported in publications or other forms of dissemination.

This book uses a case study approach to illustrate the statistical ideas in everexpanding generality. Three primary examples are used: the Western Collaborative Group Study of risk factors for coronary heart disease in men (Rosenman et al., 1975), a case-control study of coffee drinking and pancreatic cancer (MacMahon et al., 1981), and a matched pair case-control study of pregnancy and spontaneous abortion in relation to coronary heart disease in women (Winkelstein et al., 1958). There is nothing particular to these choices; most similar examples would be equally effective pedagogically. Because of the use of case studies, the book is intended to be read through chapter by chapter. While much can be gleaned by a quick glance at an isolated chapter, the material is deliberately constructed so that each chapter builds on the previous material.

Analyzing the same examples repeatedly allows us to see the impact of increasingly complex statistical models on our interpretation and understanding of a single data set. The disadvantage is that readers may not be interested in the specific topics covered in these examples and prefer to see the generality of the methods in the context of a wide range of health issues. Therefore, as you follow the ideas, I encourage you to bring an epidemiological “sketchbook” in which you can apply the ideas to studies of immediate interest, and in which you can note down questions—and perhaps some answers—that arise from your reading of the epidemiological literature. How did an investigator sample a population? How did they measure exposure? How did they deal with other relevant factors? Was matching involved? How were the risk factors coded in a regression model? What assumptions did these choices involve? How is uncertainty reported? What other issues might affect the accuracy of the results? Has causality been addressed effectively?

Introduction 7

This is an appropriate time to emphasize that, like most skills, statistical understanding is gained by “doing” rather than “talking.” At the end of each chapter, a set of questions is posed to provide readers an opportunity to apply some of the ideas conveyed during the chapter. To return to the analogy above, these assignments give us a chance to take the car out for a spin after each lesson! They should help you differentiate which points you understand from those you might like to review or explore further. They also give illustrative examples that expand on ideas from the chapter.

1.4.1 Caution: what is not covered

The material in this book is loosely based on a one-semester class of interest to beginning graduate students in epidemiology and related fields. As such, the choice of topics is personal and limited. Inevitably, there is much more to implementing, analyzing, and interpreting observational studies than covered here. We spend little or no time on the conceptual underpinnings of epidemiological thinking or on crucial components of many field investigations, including disease registries, public databases, questionnaire design, data collection, and design techniques. Rothman and Greenland (1998) provides a superb introduction to many of these topics.

There are also many additional statistical topics that are not explored in this book. We will not spend time on the appropriate interpretation of p-values, confidence intervals, and power. The Bayesian approach to statistical inference, though particularly appealing with regard to interpretation of parameter uncertainty, is not discussed here. Nor will we delve much into such issues as the impact of measurement error and missing data, standardization, sample size planning, selection bias, repeated observations, survival analy sis, spatial or genetic epidemiology, meta-analysis, or longitudinal studies. At the end of each chapter, we include a section on further reading that provides extensions to the basic ideas.

1.5 Comments and further reading

The material in this book has been influenced by three excellent books on the analysis of epidemiological data: Fleiss (1981), Breslow and Day (1980), and Rothman and Greenland (1998). Fleiss (1981) covers the material through Chapter 9, but does not include any discussion of regression models. Breslow and Day (1980) is a beautifully written account of all the methods we discuss, albeit targeted at the analysis of case-control studies. The statistical level is higher and assumes a familiarity with likelihood methods and the theory thereof. Rothman and Greenland (1998) provides an overview of epidemiological methods extending far bey ond the territory visited here, but spends less time on the development and underpinnings of the statistical side of the subject.

Hosmer and Lemeshow (2000) discusses in detail the logistic regression model, but does not include simpler stratification analysis techniques. Collett (2002) is at a similarly high level, and both books approach the material for a general rather than epidemiological application. Schlesselman (1982) has considerably more material on other aspects of epidemiological studies, but a slimmer account of the analysis techniques that are the primary focus here. Kleinbaum et al. (1982) covers all the topics we consider and more, and is encyclopedic in its treatment of some of these ideas. In addition to the above books, more advanced topics can be found in Selvin (1996) and Breslow and Day (1987). Two recent books on similar topics are Woodward (1999) and Newman (2001). A plethora of statistical packages exist that analyze epidemiological data using methods described in this book. Some of these, including SAS®, SPSS®, GLIM®, and BMDP®, are well known and contain many statistical techniques not covered here; other programs, such as EGRET®, are more tailored to epidemiological data. Collett (2002) contains a brief comparison of most of these packages. A free software package, Epi Info 2000, is currently available online from the Centers for Disease Control. In this book, all data analyses were performed using STATA® or S-Plus®. All data sets used in the text and in chapter questions are available online at http://www.crcpress.com/e_products /downloads/.

I love teaching this material, and whenever I do, I take some quiet time at the beginning of the term to remind myself about what lies beneath the numbers and the formulae. Many of the studies I use as examples, and many of the epidemiological studies I have had the privilege of being a part of, investigate diseases that are truly devastating. Making a contribution, however small, to understanding these human conditions is at the core of every epidemiological investigation. As a statistician, it is easy to be immersed in the numbers churned up by data and the tantalizing implications from their interpetation. But behind every data point there is a human story, there is a family, and there is suffering. To remind myself of this, I try to tap into this human aspect of our endeavor each time I teach. One ritual I have followed in recent years is to either read or watch the Pulitzer prize-winning play, Wit by Margaret Edson (or W;t: a Play, the punctuation being key to the subject matter). A video/DVD version of the film starring Emma Thompson is widely available. The play gives forceful insight to a cancer patient enrolled in a clinical trial, with the bonus of touching on the exquisite poetry of John Donne.

CHAPTER 2 Measures of Disease Occurrence

A prerequisite in studying the relationship between a risk factor and disease outcome is the ability to produce quantitative measures of both input and output factors. That is, we need to quantify both an individual’s exposure to a variety of factors and his level of disease.

Exposure measurement depends substantially on the nature of the exposure and its role in the disease in question. For the purposes of this book, we assume that accurate exposure or risk factor measurements are available for all study individuals. See Section 2.3 for literature on exposure assessment. On the other hand, though fine levels of an outcome variable enhance the understanding of disease and exposure associations, most disease outcomes—and all here—are represented as binary; quantifying a continuous level of disease can involve invasive methods and therefore might be impractical or unethical in human studies. As a consequence, epidemiologists are often reduced to assessing an outcome as disease present or absent. As with risk factors, we assume that accurate binary measurements are available for the disease of interest. Underlying this approach is the simplistic assumption that a disease occurs at a single point of time, so that before this time the disease is not present, and subsequently it is. Disease in exposed and unexposed subgroups of a population is usually measured over an interval of time so that disease occurrences can be observed. This allows for a variety of definitions of the amount of disease in subgroups.

In light of these introductory comments, it is important to note that error in measurement of either exposure or disease or both will compromise the statistical techniques we develop. If such errors are present, this effect must be addressed for us to retain a valid assessment of the disease-exposure relationship. Fortunately, substantial advances have been made in this area and, although beyond the scope of this book, we point out available literature when possible.

2.1 Prevalence and incidence

Disease prevalence and incidence both represent proportions of a population determined to be diseased at certain times. Before we give informal definitions, note that the time scale used in either measurement must be defined carefully before calculation of either quantity. Time could be defined as (1) the age of an individual, (2) the time from exposure to a specific risk factor, (3) calendar time, or (4) time from diagnosis. In some applications, a different kind of time scale might be preferred to chronological time; for example, in infectious disease studies, a useful “time” scale is often defined in terms of the number of discrete contacts with an infectious agent or person.

Figure 2.1 Schematic illustrating calculation of an incidence proportion and point prevalence. Six cases of disease in a population of, say, 100 individuals are represented. Lines represent the duration of disease.

The point prevalence of a disease is the proportion of a defined population at risk for the disease that is affected by it at a specified point on the time scale. Interval prevalence, or period prevalence, is the proportion of the population at risk affected at any point in an interval of time.

(Parte **3** de 7)