Statistical Inference


The main idea of statistical inference is to take a random sample from a population and then to use the information from the sample to make inferences about particular population characteristics such as the mean (measure of central tendency), the standard deviation (measure of spread) or the proportion of units in the population that have a certain characteristic. Sampling saves money, time, and effort.


Additionally, a sample can, in some cases, provide as much information as a corresponding study that would attempt to investigate an entire population-careful collection of data from a sample will often provide better information than a less careful study that tries to look at everything. Because a sample examines only part of a population, the sample mean will not exactly equal the corresponding mean of the population. Thus, an important consideration for those planning and interpreting sampling results, is the degree to which sample estimates, such as the sample mean, will agree with the corresponding population characteristic.


In practice, only one sample is usually taken (in some cases such as "survey data analysis" a small "pilot sample" is used to test the data-gathering mechanisms and to get preliminary information for planning the main sampling scheme). However, for purposes of understanding the degree to which sample means will agree with the corresponding population mean, it is useful to consider what would happen if 10, or 50, or 100 separate sampling studies, of the same type, were conducted. How consistent would the results be across these different studies? If we could see that the results from each of the samples would be nearly the same (and nearly correct!), then we would have confidence in the single sample that will actually be used. On the other hand, seeing that answers from the repeated samples were too variable for the needed accuracy would suggest that a different sampling plan (perhaps with a larger sample size) should be used.


A sampling distribution is used to describe the distribution of outcomes that one would observe from replication of a particular sampling plan.


Estimates computed from one sample will be different from estimates that would be computed from another sample.

·        Understand that estimates are expected to differ from the population characteristics (parameters) that we are trying to estimate, but that the properties of sampling distributions allow us to quantify, probabilistically, how they will differ.

·        Understand that different statistics have different sampling distributions with distribution shapes depending on (a) the specific statistic, (b) the sample size, and (c) the parent distribution.

·        Understand the relationship between sample size and the distribution of sample estimates.

·        Understand that the variability in a sampling distribution can be reduced by increasing the sample size.


See that in large samples, many sampling distributions can be approximated with a normal distribution.


Many problems in analysing data involve describing how variables are related. The simplest of all models describing the relationship between two variables is a linear, or straight-line, model. The simplest method of fitting a linear model is to "eye-ball'' a line through the data on a plot. A more elegant, and conventional method is that of "least squares", which finds the line minimising the sum of distances between observed points and the fitted line.


Realise that fitting the "best'' line by eye is difficult, especially when there is a lot of residual variability in the data.


Know that there is a simple connection between the numerical coefficients in the regression equation and the slope and intercept of regression line.


Know that a single summary statistic like a correlation coefficient does not tell the whole story. A scatter plot is an essential complement to examining the relationship between the two variables.