5 minute read

Practical Statistics for Data Scientists: Sampling, Bias, and Sampling Distribution(Central Limit Theorem)

A6EDFC05-568C-4C4A-890B-B2976BFD4DB0

Book Source: Practical Statistics for Data Scientists, 2nd Edition, by Peter Bruce, Andrew Bruce, Peter Gedeck / Released May 2020 / Publisher(s): O’Reilly Media, Inc.

A common misunderstanding is that the era of big data renders sampling unnecessary. In reality, the vast amount of data—often inconsistent in quality and relevance—underscores the critical importance of sampling. This approach is essential for effectively managing diverse datasets and reducing bias.

Traditional statistics focused on analyzing entire populations based on theories that rely on strong assumptions about those populations. Conversely, modern statistics has transitioned to sample analysis, which does not necessitate these assumptions.

Random Sampling and Sample Bias

Key Terms for Random Sampling

  • Sample
    • A subset from a more extensive data set.
  • Population
    • The more extensive data set or idea of a data set.
  • N(n)
    • The size of the population
  • Random Sampling
    • Drawing elements into a sample at random.
  • Stratified Sampling
    • Dividing the population into strata and randomly sampling from each stratum.
  • Stratum (pl., strata)
    • A homogeneous subgroup of a population with common characteristics.
  • Simple Random Sample
    • The sample results from random sampling without stratifying the population.
  • Bias
    • Systematic error.
  • Sample Bias
    • A sample that misrepresents the population.

Self-selection samples might not always reflect the actual situation accurately, but they can help compare similar places as the same biases affect both.

Bias

Statistical bias refers to errors in measurement or sampling arising from the systems used or the measurement and sampling processes themselves.

It is crucial to differentiate between errors arising from random chance and those stemming from bias.

Bias manifests in various forms, sometimes being visible while at other times remaining hidden. When a result indicates bias, such as through comparison with a benchmark or actual values, it’s often a sign that a statistical or machine learning model has been improperly specified or that a key variable has been omitted.

Random Selection

Random sampling poses difficulties because of diverse situations and conditions. Therefore, clearly defining an accessible population is crucial. Let’s say we need to conduct a pilot customer survey.

  1. Define who a customer is (e.g., those who make purchases once a month, twice a month, etc.).
  2. Specify a sampling procedure (e.g., randomly select 100 customers or visitors from 10 am to 10 pm).

The population is divided into strata in stratified sampling, and random samples are drawn from each stratum.

Size vs. Quality

  • Surprisingly, there are instances when smaller is better because the outliers may contain helpful information.
  • It’s better to have a larger dataset when dealing with large, sparse data.
    • We must also evaluate the costs and effectiveness of managing big data.
    • Data quality is typically much more important than data quantity, and random sampling can help lower bias and promote quality enhancements that would otherwise be excessively challenging or expensive.

Sample Mean vs. Population Mean

  • $\bar{x}$ (“x-bar”) represents the mean of a sample from a population, while μ denotes the mean of a population.

Selection Bias

Selection bias occurs when data is chosen selectively, resulting in misleading or temporary conclusions.

Key Terms for Selection Bias

  • Selection Bias
  • Bias occurred due to the method of selecting observations.
  • Data Snooping
    • Extensive hunting through data in search of something interesting.
  • Vast Search Effect
    • Bias or unreliability can stem from repeated data modeling or using numerous predictor variables.
    • As repeated analysis of extensive data sets is crucial in data science, selection bias is essential.
    • You will likely uncover interesting findings if you consistently test different models and ask various questions using a large dataset. However, are your discoveries genuinely significant or just a random outlier?

Regression To the Mean

Regression to the mean describes a phenomenon involving consecutive measurements of a specific variable: extreme observations are likely to be succeeded by more central ones.

  • This outcome arises from a specific type of selection bias. Skills and luck likely play a role when we choose the rookie with the best performance. In the following season, while the skill remains, luck often does not, leading to a decline in performance—essentially, a regression.
  • Regression to the mean, which means “to go back,” differs from the statistical modeling method of linear regression, where a linear relationship is estimated between predictor variables and an outcome variable.

Sampling Distribution of a Statistic

The sampling distribution of a statistic is defined as the distribution of a sample statistic across numerous samples taken from the population.

Key Terms for Sampling Distribution

  • Sample Statistic
    • A measurement derived from a sample of data collected from a larger population.
  • Data Distribution
    • The frequency distribution of individual values within a data set.
  • Sampling Distribution
    • The frequency distribution of a sample statistic across numerous samples or resamples.
  • Central Limit Theorem
    • The sampling distribution assumes a normal shape as the sample size increases.
  • Standard Error
    • A sample statistic's variability (standard deviation) across several samples should not be confused with the variability of individual data values, which pertains to the standard deviation.

A sample is often taken to measure or model something. Since our estimate or model is based on this sample, it can have errors and might change if we took another sample. Therefore, the sampling variability is significant. We need to understand how much it can vary, which is called sampling variability. If we had lots of data, we could take more samples to see how a sample statistic behaves. Usually, we use as much data as possible to make our estimate or model.

The distribution of a sample statistic, like the mean, is usually more regular and bell-shaped than the data distribution. This becomes truer with larger samples, and the distribution also narrows as the sample size increases.



Central Limit Theorem

The central limit theorem states that means from multiple samples resemble a bell-shaped normal curve, even if the source population isn’t normally distributed, provided the sample size is large enough and the deviation from normality is minimal. This theorem allows normal approximation formulas like the t-distribution to calculate sampling distributions in inference, including confidence intervals and hypothesis tests.

Standard Error

The standard error is a measure that sums up the variability in the sampling distribution of a statistic. It can be estimated using the sample standard deviation, $s$, and the sample size $n$. As sample size increases, standard error decreases.

$\text{Standard Error} = SE = \frac{s}{\sqrt{n}}$

The central limit theorem makes the standard error formula valid. However, the following steps will help us understand the concept of standard error.

  1. Gather several fresh samples from the population.

  2. For each sample, compute the statistic (like the mean).

  3. Determine the standard deviation of the statistics obtained in the previous step; this will serve as your standard error estimate.

In reality, gathering new samples to calculate standard error can be impractical and statistically inefficient. Fortunately, we don’t have to collect completely new samples; bootstrap resamples provide a valid alternative.



Leave a comment