Margin of error as popularly understood overstates the reliability of research results in at least three key ways.

First, those interpreting margin of error forget an important caveat. The results are estimates and typically vary within a narrow range around the actual value that would be calculated by completing a census of everyone in a population. On occasion (1 out of 20 times, at a 95% confidence level), the results from a particular question may be completely outside the interval of error. So in a 20-question survey, one question might have no relationship to the true value, according to probability sampling theory, and such an outlier is to be expected.

Second, many other types of error occur: generally categorized as non-sampling error, these include mistakes in how the question was asked (e.g., leading questions, incomplete choice lists) or interpreted (e.g., being misread or misheard by the respondent), among many others. Total Survey Error recognizes that multiple sources of error can reduce the validity of survey research: besides sampling error, the five types of non-sampling error include specification error, frame error, nonresponse error, measurement error, and processing error. All of which means that the results for many questions may be outside the interval of the margin of error.

Third, many survey questions are about intent rather than attitude or past behavior. Such questions require interpretation through the development of voting models and purchase-likelihood models, which introduce their own sources of error.

So why the fixation on the margin of error?

The margin of sampling error is widely reported in public opinion surveys because it is the only error that can be easily calculated.

The average absolute error can be only estimated by comparing the results to a known quantity – predicting consumer demographics that have been researched by a national census or predicting a field already collected in a house list. But many businesses conduct so few surveys that they can’t develop empirical observations of absolute error. And many consumer surveys are about aspects of topics that haven’t been researched before.

While probability sampling is great for e-commerce companies, which have a record of every customer and can therefore conduct rigorous research, most organizations acquire some mix of customers offline or even through third parties (e.g., resellers, franchisees, etc.). In such cases, the total population can’t be probability sampled, and the margin of error can’t be calculated.

The fact that margin of error can only be calculated for probability sampling doesn’t keep organizations from trying to calculate it, despite the protestations of AAPOR.

In fact, many researchers will just “do the math” to calculate sampling error, ignoring the fact that the assumptions behind the calculation aren’t being met. As my mentor Reg Baker pointed out, “MOE is pretty meaningless with online non-probability samples. It is based on the distribution of values one would get on a particular measure by drawing repeated samples from the same sample frame and assumes that the frame contains all or mostly all of the target population. Online samples almost never have near 100% coverage.”

Other organizations (including my own) will distract us with credibility intervals, despite the additional protestations of AAPOR.

Credibility-interval calculations are often proprietary, because the model of the target population being used is proprietary. With the calculation a black box, it becomes difficult to assess the reliability and validity of a credibility interval.

Sometimes aspects of the workings inside this black box are discussed, and those individual aspects can be reviewed. For instance, one firm relies on an implementation of the bootstrap confidence interval that takes 1,000 random re-samples of the responses collected (so some responses might be included multiple times and others none). Sounds impressive, until faced with a low-incidence population. For instance, suppose no one says that they did not complete high school (this group is commonly underrepresented in online surveys); the bootstrap confidence interval on this becomes plus or minus zero percentage points, as every random re-sample reports that 100% have completed high school.

Regardless of how much of the gears of the black box are discussed, another way to evaluate a credibility interval is to calculate the margin of sampling error given the sample size. If this range is higher than the credibility interval, then that interval is not credible. It is in fact bovine excrement. No academic research suggests that nonprobability samples would be more accurate than probability samples for the same sample size – and plenty of evidence points to the opposite.

Full confession: we publish credibility intervals for our work using nonprobability samples. Our customers expect us to have some estimate of the validity of our results. But we make sure our credibility interval does not exaggerate the validity obtainable through nonprobability sampling and always produces a higher range than a margin-of-error calculation would produce.

The fixation on margin of error, as popularly understood, erodes faith in public opinion research and market research. Given the many sources of potential error, I’d argue that margin of error is bad for reporting the results of probability samples, as well. I’d rather see the entire industry move to credibility intervals that estimate absolute error instead, as incredible as that may seem.

*Jeffrey Henning, IPC has been the chief research officer of Researchscape International for the past 12 years, plus or minus 1 month.*

*Originally published 2017-10-13.*