Day113 - STAT Review: Data & Sampling Distributions (3)
Practical Statistics for Data Scientists: t-Dist, Binomial, Chi-Square, F-Dist, and Poisson Distribution.
Long-Tailed Distributions
Although the normal distribution has historically been significant in statistics, data is typically not normally distributed, contrary to what its name might imply.
Key Terms for Long-Tailed Distributions
- Tail
- The extended narrow portion of a frequency distribution, where relatively extreme values occur at low frequency.
- Skew
- Where one tail of a distribution is longer than the other.
- Where one tail of a distribution is longer than the other.
Key Essentials
Image Source: Medium- Solving For the Long Tail of Intent Distribution by Cobus Greyling

Although the normal distribution is frequently suitable and beneficial for error and sample statistics, it generally doesn't accurately represent the raw data distribution.
Distributions can be skewed (asymmetric), like income data, or discrete, like binomial data. Both symmetric and asymmetric distributions may have long tails reflecting extreme values. Recognizing and guarding against long tails is essential in practical work.
Unlike a normal distribution, the data points are significantly lower for low values and much higher for high values. This suggests that the data does not follow a normal distribution, indicating a greater likelihood of encountering extreme values than anticipated in a normal distribution.
Student’s t-Distribution
The t-distribution resembles a normal distribution with thicker tails. It’s commonly used for sample statistics. Sample means typically follow a t-distribution that changes shape with sample size, approaching a normal shape as size increases.
Key Terms for Student’s t-Distribution
- n
- Sample Size
- Degrees of Freedom
- A parameter that allows the t-distribution to adjust to different sample sizes, statistics, and number of groups.
“What is the sampling distribution of the mean of a sample drawn from a larger population?– The t-distribution is valuable for answering this question.
Image Source: Gosset’s resampling experiment results and fitted t-curve (from his 1908 Biometrika paper)

Different statistics can be compared, after standardization, to the t-distribution to estimate confidence intervals considering sampling variation. Consider a sample of size $n$, for which the sample mean \( $\bar{x}$) has been calculated.
The t-distribution has been used as a reference for the distribution of a sample mean, the difference between two samples means, regression parameters, and other statistics.
Additionally, the t-distribution’s accuracy in reflecting the behavior of **a sample statistic requires that the distribution of that statistic for the sample approximates a normal distribution.**
T-Distribution vs. Normal Distribution
Image & Explanation Source: Investopia: t-Distribution

Normal distributions are used when the population distribution is assumed to be normal. The t-distribution resembles the normal distribution but has fatter tails. Both distributions require a normally distributed population. Consequently, t-distributions exhibit higher kurtosis than normal distributions. The likelihood of obtaining values far from the mean is greater with a t-distribution than with a normal distribution.
Limitations of Using a t-Distribution
The t-distribution may compromise accuracy compared to the normal distribution, with its limitations becoming apparent only when perfect normality is required. It should be used exclusively when the population standard deviation is unknown. Conversely, if the population standard deviation is known and the sample size is large enough, the normal distribution should be employed for more reliable results.
Binomial Distribution
Yes/no outcomes are fundamental to analytics as they reflect decisions, such as buy/don’t buy or click/don’t click. The binomial distribution involves a series of trials, each with two possible outcomes and specific probabilities.
In statistics, it is customary to refer to the outcome “1” as the success outcome, and it is also common to assign “1” to the less frequent outcome.
Key Terms for Binomial Distribution
- Trial
- An event with a discrete outcome (e.g., a coin flip).
- Success
- The outcome of interest for a trial
- = $1$ (as opposed to “$0$”)
- Binomial
- Having two outcomes
- = yes/no, 0/1, binary
- Binomial Trial
- A trial with two outcomes
- = Bernoulli Trial
- Binomial Distribution
- Distribution of number($n$) of successes in $x$ trials with specified probability ($p$).
- = Bernoulli Distribution
Example Code Snippet
Q: If the probability of a click resulting in a sale is 0.02, what is the probability of having no sales in 200 clicks?
-
In R,
dbinom
calculates binomial probabilities.dbinom(x=2, size=200, p=0.02)
It would return 0.0176.
-
Often, we are interested in determining the probability of $x$ or fewer successes in $n$- trails. In this case, we use
pbinom(2, 5, 0.1)
It would return 0.9914, the probability of observing two or fewer successes in five trials, where the probability of success for each trial is 0.1
-
In Python, the
scipy.stats
module offers various statistical distributions. Usestats.binom.pmf
andstats.binom.cdf
for the binomial distribution:stats.binom.pmf(2, n=5, p=0.1) stats.binom.cdf(2, n=5, p=0.1)
Key Essentials
- The mean of binomial distribution is: $n \times p$
- The variance is: $n \times p(1-p)$
Chi-Square Distribution
A key statistical concept is departure from expectation, especially in category counts. Expectation signifies “nothing unusual in the data” (no correlations or patterns) and is called the “null hypothesis” or “null model.” The chi-square statistic measures deviation from the null hypothesis of independence by calculating the difference between observed and expected values, divided by the square root of the expected value squared, summed across categories. This process standardizes the statistic for comparison with a reference distribution. Essentially, the chi-square statistic evaluates how well observed values fit a specified distribution (a “goodness-of-fit” test), useful for determining if multiple treatments (an “A/B/C… test”) differ in effects.
F-Distribution
In scientific experimentation, testing multiple treatments—like various fertilizers on field blocks—resembles an A/B/C test in chi-square distribution but uses measured continuous values instead of counts. In this case, we are interested in the extent to which differences among group means are more significant than we might expect under normal random variation.
The F-statistic measures this and is the ratio of the variability among the group means to the variability within each group. This comparison is termed an analysis of variance (ANOVA).
The distribution of the F-statistic represents the frequency distribution of all values generated by randomly permuting data where all group means are equal. There are various F-distributions linked to different degrees of freedom.
The F-statistic is also used in linear regression to compare the variation the regression model explains to the total variation in the data. (In R and Python, it is produced automatically as a part of regression and ANOVA)
Poisson and Related Distributions
Many processes produce events randomly at a given overall rate—like visitors arriving at a website or cars arriving at a toll plaza (events spread over time).
Key Terms for Poisson and Related Distributions
- Lambda
- The rare (per unit of time or space) at which events occur.
- Poisson Distribution
- The frequency distribution of the number of events in sampled units of time or space.
- Exponential Distribution
- The frequency distribution of the time or distance from one event to the next event.
- Weibull Distribution
- A generalized version of the exponential distribution in which the event rate is allowed to shift over time.
- A generalized version of the exponential distribution in which the event rate is allowed to shift over time.
Poisson Distributions
We can estimate the average number of events per unit of time and space from prior aggregate data. We also want to analyze variations between units of time/space.
The Poisson distribution shows the event distribution per time or space unit when sampling multiple units.
It answers questions like, “What capacity ensures 95% processing of all internet traffic to a server in five seconds?”
The essential parameter in a Poisson Distribution is $\lambda$ (lambda), representing the average number of events occurring in a specific time or space interval. Additionally, the variance of a Poisson Distribution is equal to $\lambda$ as well.
- In R, the
rpois
function provides.
rpois(100, lambda=2)
- In Python, we use
stats.possion.rvs
atscipy
function.
stats.poisson.rvs(2, size=100)
This code generates 100 random numbers from a Poisson distribution with λ=2. For instance, if customer service calls average two per minute, it simulates 100 minutes, showing the number of calls each minute.
Exponential Distribution
In addition to $\lambda$ used in the Poisson Distribution, we can also model the time distribution between events, such as the time between visits to a website or the arrival of cars at a toll plaza.
- In R
rexp(n=100, rate=0.2)
- In Python, the
scipy
implementation specifies the exponential distribution usingscale
instead of rate.
stats.expon.rvs(scale=1/0.2, size=100)
stats.expon.rvs(scale=5, size=100)
This code generates 100 random numbers from an exponential distribution with a mean of 0.2. It simulates 100 intervals in minutes between service calls at an average rate of 0.2 per minute.
A key assumption in simulation studies for Poisson or exponential distributions is that the rate, λ, remains constant. For instance, traffic on roads or data networks varies by time of day and week. However, periods or areas can often be segmented into sufficiently homogeneous parts, allowing valid analysis or simulation within them.
Estimating the Failure Rate
In many applications, the event rate, $\lambda$ can be known or estimated from prior data, but not necessarily for rare events.
Aircraft engine failures are notably rare, which means there may be limited data available to estimate the time between failures for a specific engine type. Nevertheless, we can make some educated assumptions: if no failures occur after 20 hours, it is reasonable to conclude that the failure rate is not one per hour. By using simulations or directly calculating probabilities, we can evaluate various hypothetical event rates and determine threshold values beneath which the occurrence rate is quite unlikely.
Weigbull Distribution
When the event rate varies during an interval, exponential (or Poisson) distributions become ineffective. This is often true in the context of mechanical failure, as the likelihood of failure grows over time. The Weibull distribution is an extension of the exponential distribution in which the event rate is allowed to change.
-
In R, three arguments are required to implement the Weibull distribution:
n
(the number of values to be generated),shape
, andscale
.reweibull(100, 1.5, 5000)
This code generates 100 random values from a Weibull distribution, with a shape parameter of 1.5 and a characteristic life of 5,000.
-
In Python, we use
stats.weibull_min.rvs
as below.stats.weibull_min.rvs(1.5, scale=5000, size=100)
Leave a comment