# Statistics

## Metadata
- Author: David Freeman, Robert Pisani, and Roger Purves
- Full Title: Statistics
- Category: #books
## Highlights
- Investigators will make generalizations from the part to the whole. In more technical language, they make inferences from the sample to the population. (Page 351)
- A systematic tendency on the part of the sampling procedure to exclude one kind of person or another from the sample is called selection bias. (Page 353)
- When a selection procedure is biased, taking a large sample does not help. This just repeats the basic mistake on a larger scale. (Page 353)
- non-response bias (Page 354)
- Non-respondents can be very different from respondents. When there is a high non-response rate, look out for non-response bias. (Page 354)
- Some samples are really bad. To find out whether a sample is any good, ask how it was chosen. Was there selection bias? nonresponse bias? You may not be able to answer these questions just by looking at the data. (Page 354)
- This process is called simple random sampling: tickets have simply been drawn at random without replacement. At each draw, every ticket in the box has an equal chance to be chosen. The interviewers have no discretion at all in whom they interview, and the procedure is impartial—everybody has the same chance to get into the sample. Consequently, the law of averages guarantees that the percentage of Democrats in the sample is likely to be close to the percentage in the population. (Page 357)
- quota sampling guarantees that the percentage of men in the sample will be equal to the percentage of men in the population. With probability sampling, we can only say that the percentage of men in the sample is likely to be close to the percentage in the population: certainty is reduced to likelihood. (Page 360)
- estimate = parameter + bias + chance error. (Page 367)
- 1. A sample is part of a population. 2. A parameter is a numerical fact about a population. Usually a parameter cannot be determined exactly, but can only be estimated. 3. A statistic can be computed from a sample, and used to estimate a parameter. A statistic is what the investigator knows. A parameter is what the investigator wants to know. 4. When estimating a parameter, one major issue is accuracy: how close is the estimate going to be? 5. Some methods for choosing samples are likely to produce accurate estimates. Others are spoiled by selection bias or non-response bias. When thinking about a sample survey, ask yourself: • What is the population? the parameter? • How was the sample chosen? • What was the response rate? 6. Large samples offer no protection against bias. 7. In quota sampling, the sample is hand picked by the interviewers to resemble the population in some key ways. This method seems logical, but often (Page 371)
- gives bad results. The reason: unintentional bias on the part of the interviewers, when they choose subjects to interview. 8. Probability methods for sampling use an objective chance process to pick the sample, and leave no discretion to the interviewer. The hallmark of a probability method: the investigator can compute the chance that any particular individuals in the population will be selected for the sample. Probability methods guard against bias, because blind chance is impartial. 9. One probability method is simple random sampling. This means draw ing subjects at random without replacement. 10. Even when using probability methods, bias may come in. Then the esti mate differs from the parameter, due to bias and chance error: estimate = parameter + bias + chance error. Chance error is also called “sampling error,” and bias is “non-sampling error.” (Page 372)
## New highlights added September 8, 2022 at 2:39 PM
- percentage in sample = percentage in population + chance error. (Page 376)
- A statistician would call this inference from the sample to the population. (Page 393)
- The bootstrap. When sampling from a 0–1 box whose composition is unknown, the SD of the box can be estimated by substituting the fractions of 0’s and 1’s in the sample for the unknown fractions in the box. The estimate is good when the sample is reasonably large. (Page 396)
- This is a confidence interval for the population percentage, with a confidence level of about 95% (Page 399)
- The fraction of 1’s in the box (translation—the fraction of Democrats among the 25,000 eligible voters) is unknown, but can be estimated by 0.573, the fraction of Democrats in the sample. Similarly, the fraction of 0’s in the box is estimated as 0.427. So the SD of the box is estimated by the bootstrap method as √ 0.573 × 0.427 ≈ 0.5. The SE for the number of Democrats in the sample is 1,600 × 0.5 = 20. The 20 gives the likely size of the chance error estimated as in the 917. Now convert to percent, relative to the size of the sample: √ 20 1,600 × 100% = 1.25%. The SE for the percentage of Democrats in the sample is 1.25%. The percentage of Democrats in the sample is likely to be off the percentage of Democrats in the population, by 1.25 percentage points or so. A 95%-confidence interval for the percentage of Democrats among all 25,000 eligible voters is 57.3% ± 2 × 1.25%. (Page 400)
- That is the answer. We can be about 95% confident that between 54.8% and 59.8% of the eligible voters in this town are Democrats. (Page 400)
- The chances are in the sampling procedure, not in the parameter. (Page 402)
- A confidence interval is used when estimating an unknown parameter from sample data. The interval gives a range for the parameter, and a confidence level that the range covers the true value. (Page 403)
- Confidence levels are a bit difficult, because they involve thinking not only about the actual sample but about other samples that could have been drawn. (Page 403)
- Interpreting confidence intervals. The 95%-confidence interFigure 1. val is shown for 100 different samples. The interval changes from sample to sample. For about 95% of the samples, the interval covers the population percentage, marked by a vertical line.8 (Page 403)
- Probabilities are used when you reason forward, from the box to the draws; confidence levels are used when reasoning backward, from the draws to the box (Page 404)
- A sample percentage will be off the population percentage, due to chance error. The SE tells you the likely size of the amount off. (Page 404)
- Tags: [[purple]]
- In statistics, as in old-fashioned capitalism, the responsibility is on the consumer. (Page 406)
- 7. SUMMARY 1. With a simple random sample, the sample percentage is used to estimate the population percentage. 2. The sample percentage will be off the population percentage, due to chance error. The SE for the sample percentage tells you the likely size of the amount off. 3. When sampling from a 0–1 box whose composition is unknown, the SD of the box can be estimated by substituting the fractions of 0’s and 1’s in the sample for the unknown fractions in the box. This bootstrap estimate is good when the sample is large. 4. A confidence interval for the population percentage is obtained by going the right number of SEs either way from the sample percentage. The confidence level is read off the normal curve. This method should only be used with large samples. 5. In the frequency theory of probability, parameters are not subject to chance variation. That is why confidence statements are made instead of probability statements. 6. The formulas for simple random samples may not apply to other kinds of samples. If the sample was not chosen by a probability method, watch out: SEs computed from the formulas may not mean very much. (Page 412)