Mathematical Statistics

Mathematical Statistics

(Parte 1 de 6)

Mathematical Statistics: A Unified Introduction

George R. Terrell Springer

Teacher’s Preface

Why another textbook? The statistical community generally agrees that at the upper undergraduate level, or the beginning master’s level, students of statistics should begin to study the mathematical methods of the field. We assume that by then they will have studied the usual two-year college sequence, including calculus through multiple integrals and the basics of matrix algebra. Therefore, they are ready to learn the foundations of their subject, in much more depth than is usual in an applied, “cookbook,” introduction to statistical methodology.

There are a number of well-written, widely used textbooks for such a course.

These seem to reflect a consensus for what needs to be taught and how it should be taught. So, why do we need yet another book for this spot in the curriculum?

I learned mathematical statistics with the help of the standard texts. Since then,

I have taught this course and similar ones many times, at several different universities, using well-thought-of textbooks. But from the beginning, I felt that something was wrong. It took me several years to articulate the problem, and many more to assemble my solution into the book you have in your hand.

You see, I spend the rest of my day in statistical consulting and statistical research. I should have been preparing my mathematical statistics students to join me in this exciting work. But from seeing what the better graduating seniors and beginning graduate students usually knew, I concluded that the standard curriculum was not teaching them to be sophisticated citizens of the statistical community. These able students seemed to be well informed about a set of narrow, technical issues and at the same time embarrassingly lacking in any understanding of more fundamental matters. For example, many of them could discourse learnedly on which sources of variation were testable in complicated linear models. But they became tongue-tied when asked to explain, in English, what the presence of some interaction meant for the real-world experiment under discussion! vi Teacher’s Preface

What went wrong? I have come to believe that the problem lies in our history.

The first modern textbooks were written in the 1950s. This was at the end of the Heroic Age of statistics, roughly, the first half of the twentieth century. Two bodiesofmagnificentachievementsmarkthatera.Thefirst,identifiedwithStudent, Fisher, Neyman, Pearson, and many others, developed the philosophy and formal methodology of what we now call classical inference. The analysis of scientific experiments became so straightforward that these techniques swept the world of applications. Many of our clients today seem to believe that these methods are statistics.

The second, associated with Liapunov, Kolmogorov, and many others, was the formal mathematicization of probability and statistics. These researchers proved precisecentrallimittheorems,stronglawsoflargenumbers,andlawsoftheiterated logarithm (let me call these advanced asymptotics). They axiomatized probability theory and placed distribution theory on a rigorous foundation, using Lebesgue integration and measure theory.

By the 1950s, statisticians were dazzled by these achievements, and to some extent we still are. The standard textbooks of mathematical statistics show it. Unfortunately, this causes problems for teachers. Measure theory and advanced asymptotics are still well beyond the sophistication of most undergraduates, so we cannot really teach them at this level. Furthermore, too much classical inference leads us to neglect the preceding two centuries of powerful but less formal methods, not to mention the broad advances of the last 50 years: Bayesian inference, conditional inference, likelihood-based inference, and so forth.

So the standard textbooks start with long, dry, introductions to abstract probability and distribution theory, almost devoid of statistical motivations and examples (pokerproblems?!).Thenthereisafranticrush,againlargelyunmotivated,tointroduce exactly those distributions that will be needed for classical inference. Finally, two-thirds of the way through, the first real statistical applications appear—means tests, one-way ANOVA, etc.—but rigidly confined within the classical inferential framework. (An early reader of the manuscript called this “the cult of the t-test.”) Finally, in perhaps Chapter 14, the books get to linear regression. Now, regression is 200 years old, easy, intuitive, and incredibly useful. Unfortunately, it has been made very difficult: “conditioning of multivariate Gaussian distributions” as one cultist put it. Fortunately, it appears so late in the term that it gets omitted anyway.

We distort the details of teaching, too, by our obsession with graduate-level rigor. Large-sample theory is at the heart of statistical thinking, but we are afraid to touch it. “Asymptotics consists of corollaries to the central limit theorem,” as another cultist puts it. We seem to have forgotten that 200 years of what I shall call elementary asymptotics preceded Liapunov’s work. Furthermore, the fear of saying anything that will have to be modified later (in graduate classes that assume measure theory) forces undergraduate mathematical statistics texts to include very little real mathematics.

As a result, most of these standard texts are hardly different from the cookbooks, with a few integrals tossed in for flavor, like jalapeno bits in cornbread. Others are spiced with definitions and theorems hedged about with very technical conditions,

Teacher’s Preface vii which are never motivated, explained, or applied (remember “regularity conditions”?). Mathematical proofs, surely a basic tool for understanding, are confined to a scattering of places, chosen apparently because the arguments are easy and “elegant.” Elsewhere, the demoralizing refrain becomes “the proof is beyond the scope of this course.”

How is this book different? In short, this book is intended to teach students to do mathematical statistics, not just to appreciate it. Therefore, I have redesigned the course from first principles. If you are familiar with a standard textbook on the subject and you open this one at random, you are very likely to find either a surprising topic or an unexpected treatment or placement of a standard topic. But everything is here for a reason, and its order of appearance has been carefully chosen.

First, as the subtitle implies, the treatment in unified. You will find here no artificial separation of probability from statistics, distribution theory from inference, or estimation from hypothesis testing. I treat probability as a mathematical handmaiden of statistics. It is developed, carefully, as it is needed. A statistical motivation for each aspect of probability theory is therefore provided.

Second, I have updated the range of subjects covered. You will encounter introductions to such important modern topics as loglinear models for contingency tables and logistic regression models (very early in the book!), finite population sampling, branching processes, and small-sample asymptotics.

More important are the matters I emphasize systematically. Asymptotics is a major theme of this book. Many large-sample results are not difficult and quite appropriate to an undergraduate course. For example, I had always taught that with “large n, small p” one may use the Poisson approximation to binomial probabilities. Then I would be embarrassed when a student asked me exactly when this worked. So we derive here a simple, useful error bound that answers this question. Naturally, a full modern central limit theorem is mathematically above the level of this course. But a great number of useful yet more elementary normal limit results exist, and many are derived here.

I emphasize those methods and concepts that are most useful in statistics in the broad sense. For example, distribution theory is motivated by detailed study of the most widely useful families of random variables. Classical estimation and hypothesis testing are still dealt with, but as applications of these general tools. Simultaneously,Bayesian,conditional,andotherstylesofinferenceareintroduced as well. The standard textbooks, unfortunately, tend to introduce very obscure and ab- stractsubjects“cold”(wheredidahorribleexpressionlike 1√ 2π e−x /2 comefrom?), then only belatedly get around to motivating them and giving examples. Here we insist on concreteness. The book precedes each new topic with a relevant statistical problem. We introduce abstract concepts gradually, working from the special to the general. At the same time, each new technique is applied as widely as possible. Thus, every chapter is quite broad, touching on many connections with its main topics.

The book’s attitude toward mathematics may surprise you: We take it seriously. Ourstudentsmaynotknowmeasuretheory,buttheydoknowanenormousamount viii Teacher’s Preface of useful mathematics. This text uses what they do know and teaches them more. We aim for reasonable completeness: Every formula is derived, every property is proved (often, students are asked to complete the arguments themselves as exercises). The level of mathematical precision and generality is appropriate to a serious upper-level undergraduate course.

At the same time, students are not expected to memorize exotic technicalities, relevant only in graduate school. For example, the book does not burden them with the infamous “triple” definition of a random variable; a less obscure definition is adequate for our work here. (Those students who go on to graduate mathematical statistics courses will be just the ones who will have no trouble switching to the more abstract point of view later.) Furthermore, we emphasize mathematical directness: Those short, elegant proofs so prized by professors are often here replaced by slightly longer but more constructive demonstrations. Our goal is to stimulate understanding, not to dazzle with our brilliance.

What is in the book? These pedagogical principles impose an unconventional order of topics. Let me take you on a brief tour of the book:

The “Getting Started” chapter motivates the study of statistics, then prepares the student for hands-on involvement: completing proofs and derivations as well as working problems.

Chapter 1 adopts an attitude right away: Statistics precedes probability. That is, models for important phenomena are more important than models for measurement and sampling error. The first two chapters do not mention probability. We start with the linear data-summary models that make up so much of statistical practice: one-way layouts and factorial models. Fundamental concepts such as additivity and interaction appear naturally. The simplest linear regression models follow by interpolation. Then we construct simple contingency-table models for counting experiments and thereby discover independence and association. Then we take logarithms, to derive loglinear models for contingency tables (which are strikingly parallel to our linear models). Again, logistic regression models arise by interpolation. In this chapter, of course, we restrict ourselves to cases for which reasonable parameter estimates are obvious.

Chapter2showshowtoestimateANOVAandregressionmodelsbytheancient, intuitive method of least squares. We emphasize geometrical interpolation of the method—shortestEuclideandistance.Thismotivatessamplevariance,covariance, and correlation. Decomposition of the sum of squares in ANOVA and insight into degrees of freedom follow naturally.

That is as far as we can go without models for errors, so Chapter 3 begins with a conventional introduction to combinatorial probability. It is, however, very concrete: We draw marbles from urns. Rather than treat conditional probability as a later, artificially difficult topic, we start with the obvious: All probabilities are conditional. It is just that a few of them are conditional on a whole sample space. Then the first asymptotic result is obtained, to aid in the understanding of the famous “birthday problem.” This leads to insight into the difference between finite population and infinite population sampling.

Teacher’s Preface ix

Chapter 4 uses geometrical examples to introduce continuous probability models. Then we generalize to abstract probability. The axioms we use correspond to how one actually calculates probability. We go on to general discrete probability, and Bayes’s theorem. The chapter ends with an elementary introduction to Borel algebra as a basis for continuous probabilities.

Chapter 5 introduces discrete random variables. We start with finite population sampling, in particular, the negative hypergeometric family. You may not be familiar with this family, but the reasons to be interested are numerous: (1) Many common random variables (binomial, negative binomial, Poisson, uniform, gamma, beta, and normal) are asymptotic limits of this family; (2) it possesses in transparent ways the symmetries and dualities of those families; and (3) it becomes particularly easy for the student to carry out his own simulations, via urn models. Then the Fisher exact test gives us the first example of an hypothesis test, for independence in the 2 × 2 tables we studied in Chapter 1. We introduce the expectation of discrete random variables as a generalization of the average of a finite population. Finally, we give the first estimates for unknown parameters and confidence bounds for them.

Chapter 6 introduces the geometric, negative binomial, binomial, and Poisson families. We discover that the first three arise as asymptotic limits in the negative hypergeometric family and also as sequences of Bernoulli experiments. Thus, we have related finite and infinite population sampling. We investigate just when the Poisson family may be used as an asymptotic approximation in the binomial and negative binomial families. General discrete expectations and the population variance are then introduced. Confidence intervals and two-sided hypothesis tests provide natural applications.

Chapter 7 introduces random vectors and random samples. Here is where marginal and conditional distributions appear, and from these, population covariance and correlation. This tells us some things about the distribution of the sample mean and variance, and leads to the first laws of large numbers. The study of conditional distributions permits the first examples of parametric Bayesian inference.

Chapter 8 investigates parameter estimation and evaluation of fit in complicated discrete models. We introduce the discrete likelihood and the log-likelihood ratio statistic. This turns out often to be asymptotically equivalent to Pearson’s chisquaredstatistic,butitismuchmoregenerallyuseful.Thenweintroducemaximum likelihood estimation and apply it to loglinear contingency table models; estimates are computed by iterative proportional fitting. We estimate linear logistic models by maximum likelihood, evaluated by Newton’s method.

(Parte 1 de 6)