Sampling (Statistics)
Sampling is a statistical methodology that uses a portion of a total population to represent the full population. As a subject, sampling considers the different methodologies one could use to survey a portion of the population and seeks to find a sample that is most indicative of the overall population. A sample's "representativeness" is also known as "generalizability," how a specific set of a population's members can be generalized to the whole population. To achieve a highly generalizable result, sampling can weigh certain factors or certain respondents differently to try and better represent the overall population. For instance, a highly generalized sample of an elementary school would not just include responses from students in the first grade, but from students in all grades, and from the appropriate proportion of the overall school. If there are $100$ students in an elementary school, and $15$ are in first grade, then the sample should include no more than $15\%$ responses from first graders.
A population to be sampled could be any number of things. It could be a literal population of people sampled for political opinion; a sample could be taken of a plant species to determine the prevalence of a pest; a population could be the entire clothing production of a factory to be checked for quality; or it could be checking the outputs of a given computer simulation to determine if the simulation is a fair approximation of reality. Studies that include the entire population are referred to as census studies, not samples. Such comprehensive studies are rarely carried out, as the cost (in time or money) might be too high, or the feasibility might be too low. For instance, collecting data on the detailed religious views of every citizen in the United States would be very expensive and maybe unfeasible, or determining the average age of bears in Siberia might be impossible to handle (as it would require determining where every bear is in a vast geography).
Sampling is a key tool in the scientific method, allowing researchers to provide evidence that generalizable theories are true simply from a sample of specific facts. Sampling also allows mathematicians to resolve complex paradoxes, for instance, the St. Petersburg paradox, and is used commonly in political polling to understand current social, philosophical, and moral trends.
Contents
Probability versus Non-probability Sampling
Sampling methods are either categorized as a probability or a non-probability sample. In probability samples, every member of a population has a known non-zero chance of being included in the sample. This allows researchers to calculate and report the sampling error, or the degree to which the sample might stray from the overall population. Whereas in non-probability samples, some members have a zero percent chance or an unknown percentage of being included.
Imagine a polltaker visits a supermarket at 7 pm every Thursday to conduct a survey of the careers that the shoppers hold. By its very nature, this poll will exclude, at some unknown probability, those people who are working at that time. This could be bartenders, night time TV news anchors, nurses with evening shifts, etc. Or it might exclude people of a certain ethnic group who prefer to shop at their ethnic grocer. If the goal was to represent the population of people who buy groceries from that type of store at that time, it could have a low sample bias. But if its purpose was to represent the entire population in that geographic region, it would be a non-probability sample with an unknown (possibly high) sample bias.
Part of the reason for choosing one type of sampling methodology over another is cost. As a general rule, the larger the population to be sampled from, the larger the sample population must be (to ensure that it is representative) and the harder it can be to eliminate bias. If a researcher is trying to select $5$ people randomly from a classroom of $20$ people, there are many easy ways to do this that will ensure the sample is random and representative. If that same researcher is trying to select from a population of $200$ million, they're unlikely to select $\frac{1}{4}$ of the population, and any size sample would be considerably more expensive in time and money.
One thing that is rarely spoken to is that samples are usually not randomly accounting for time. That is, the sample is taken at whatever time the researchers gathered the data. Ten years earlier or ten years later, that population could change.
Probability Sampling
Random sampling is the most basic type of sampling, taking a truly random $x\%$ of the population a researcher wishes to study. The challenge is that true randomness is difficult to implement: to ensure that it's actually random, that is, every single member of the population has an equal probability of being selected. In theory, a random sample would look something like every member of the population is given a single raffle ticket, and then the sample is picked from the available raffle tickets. However, the larger the population gets, the harder it is to ensure that each member of the population is given the same chance of being selected as any other. That is, it can be hard to assign each member one raffle ticket and pick from all raffle tickets with equal probability.
It can be relatively easy to check if a sample is truly random by comparing the sample to known data about the tested population (provided such data is available). For example, a sample of the population of Australia should be $50\%$ male and $50\%$ female as the entire population is 50-50.
Systematic sampling is a method for implementing random sampling, essentially as a means to select every $n^\text{th}$ member of the sample, where $n$ is some integer or some random integer, or some randomly generated number. For instance, a bakery could choose to select every $99^\text{th}$ cake for testing and quality control, or it could pick every $n^\text{th}$ cake where $n$ is a randomly generated number between $50$ and $200$. Note: this only works if the population is itself in a random order, or where the count begins randomly at some number between $1$ and $n$. If the population is somehow ordered, then selecting every $99^\text{th}$ one will not be random.
Stratified sampling is a method to help ensure randomness in larger populations. The entire population is divided into strata, or different groups, based on some particular criteria. For instance, if a surveyor wanted to sample an entire city population, he might divide the city into geographic strata, or occupational ones, educational, wealth, etc. The number of members of the sample selected from the strata is proportional to the entire population. For instance, if District A represents $10\%$ of the city's population, and a researcher wants a total sample of $1000$, then they will select $100$ people from District A. Within the strata, sample members are selected randomly. For instance, it might be possible to randomly select a house and randomly select one member from the household to sample.
Cluster sampling is a method whereby researchers pick random clusters for ease of sampling. For instance, a study of students in a school district might study every student in each of 20 schools. This is as opposed to random sampling which would likely end up with a few students from every school. Researchers may choose to study clusters for the ease of implementation (for instance it might be easier to get 20 principals of schools to agree to some minor school-day disruption than to get every principal to agree).
$$ $$ $$ Multistage sampling is a form of other sampling techniques whereby the researchers winnow down the final sample from larger and larger samples. For instance, in the above example of studying students in a school district, researchers used cluster sampling to select $20$ schools. They could then conduct a second stage and use cluster sampling again to select $15$ classrooms in each school, followed by a third stage where they use random sampling to choose $10$ students in each class. Again, this helps to decrease costs but introduces opportunities at each stage for the researchers to deviate from a truly random or representative sample of the total population. For instance, if they were studying student performance and they happened to choose the two best performing or two worst performing schools in the population, then $10\%$ of their sample are the biggest outliers in the total population.
Non-probability Sampling
Convenience sampling, also known as opportunity sampling, refers to a method where a researcher uses a non-random sample to approximate the truth. This is often used at the beginning of a study, or as a means of testing if a study is worthwhile, as a convenient sample; for instance, the first five people a researcher finds walking down the street may be much cheaper than conducting a truly random sample.
Judgement sampling is a form of convenience sampling where the researcher applies some judgment in selecting a convenient sample. For instance, a researcher studying coffee shop patrons could justifiably conclude that the population of people in their nearest coffee shop is representative of all the populations in all the coffee shops in the area they're studying, because that coffee shop is located in a centrally located, high-traffic commuter area, with wide swaths of education level, gender, occupation, wealth, ethnicity, and sexual orientation represented in this one coffee shop.
Snowball sampling is a type of non-probability sampling that occurs when members of the desired population are hard to find, such as patients with a rare disease, holders of an opinion that few like to publicly profess, or rare species. Researchers might ask members of the sample to recommend others to be surveyed or follow a rare species to find other kin. This methodology potentially introduces bias, as the sample may not represent a truly random section of the overall population, but it does so at the benefit of reducing costs significantly.
Sampling Errors
Errors in sampling usually manifest themselves as selection bias or random sampling bias. Such errors can occur
- when the researchers conducting the sampling have accidentally or deliberately constructed the study to be misrepresentative of the total population, or
- because the researchers used a truly random methodology and that random sample happened to not be representative of the overall population.
The key here is that there is an error because the sample does not represent the population.
Other sorts of errors are a product not of design, but of execution. These include things like non-responses by surveyed members of the population, data entry and processing errors, and measurement errors occurring when questions posed to respondents are misunderstood or tests conducted on sample members do not measure what they intended to measure. For instance, a question of "How much do you like this product?" seems like a good question, but may not actually collect the data the surveyor wanted. It may indicate how unwilling the respondent is to say anything negative or might inflate small issues they have with the product. A better question would be, "How likely are you to recommend this product to a friend or family member?" which is a commonly used question in surveying, known as the Net Promoter Score. The key with these errors is that even though the sample is random, the data processed may not accurately reflect the sample or the population.
In both cases, it can be difficult to determine if the sample is statistically valid. One solution is to reverse-engineer the results to see if they match the intended methodology (not necessarily if they match any particular result). For instance, one could see a sample with 75% women and think this is an error (that it should be closer to 50%). They could then test what the probability is that their sampling methodology produces that result.
Calculating Sampling Error
Sampling error can also refer to a calculation of error in a sample, also known as the margin of error. For instance, most published scientific studies that do experimentation include margin of error. They might say that with a $95\%$ confidence interval the margin of error is $5\%$, meaning that they can say with $95\%$ certainty that their results are $\pm 5\%$ accurate.
The basic formula for sample error is $z\sqrt\frac{p(1-p)}{n},$ where $z$ is the z-score for the study's confidence interval, sometimes expressed as $z^*$, or the critical value. $p$ is the portion of the sample who has the factor you're testing for, and $n$ is the total sample population. Sometimes this formula is expressed as $1.96 \sqrt\frac{p(1-p)}{n},$ which assumes a confidence interval of 95%.
Suppose you are a researcher who has conducted a survey in a sample of 1,000 respondents. You're evaluating them to determine whether they are pro-their congressman or not. You have a confidence interval of 95%, and 700 respondents say they do not approve of their congressman. What is your margin of error?
We have $z\sqrt\frac{p(1-p)}{n} = 1.96 \sqrt\frac{(0.700)(0.300)}{1000} = 2.84\%,$ which means that you can say, with 95% confidence, that 700 out of 1,000 Americans do not approve of their congressman, plus or minus 2.84%. $_\square$
A $z$-score table for other confidence intervals is below:
Confidence Interval | $z$-score |
80% | 1.28 |
90% | 1.645 |
95% | 2.33 |
99% | 2.58 |
99.9% | 3.29 |
The margin of error increases significantly as you decrease the sample size and attempt to increase the confidence interval.
Suppose you are a pollster for a major political organization and you want to say with 99.9% certainty that the majority of the country disapproves of the current administration, but that you don't have very much time.
So you conduct a study with $50$ respondents. $35$ of them say they do not approve. This seems like great news, that $70\%$ of the sample does not approve. Even if these $50$ are a truly random sample of the population, what's your sampling error rate? What's your margin of error on a $99.9\%$ confidence interval?
Express your answer as a decimal. For instance, $10\%$ would be $0.1.$
References
- Kernier, D. simple-random-sampling. Retrieved june 1 2016, from https://en.wikipedia.org/wiki/File:Simple_random_sampling.PNG
- Kernier, D. systematic-sampling. Retrieved june 1 2016, from https://en.wikipedia.org/wiki/File:Systematic_sampling.PNG
- Kernier, D. stratified-sampling. Retrieved june 1 2016, from https://en.wikipedia.org/wiki/File:Stratified_sampling.PNG
- Kernier, D. cluster-sampling. Retrieved june 1 2016, from https://en.wikipedia.org/wiki/File:Cluster_sampling.PNG