Day116 - STAT Review: Statistical Experiments and Significance Testing (3)
Practical Statistics for Data Scientists: t-Tests, Multiple Testing, False Discovery, and Degrees of Freedom
t-Tests
Different types of significance tests exist based on the nature of the data—whether it is count data or measured data—along with the number of samples and the measurement focus. A frequently used test is the t-test, derived from Student’s t-distribution, to estimate the distribution of a single sample mean.
Key Terms for t-Tests
- Test Statistic
- A measure of the difference or effect of interest.
- t-statistic
- A typical representation of frequently used test statistics like means.
- t-distribution
- A reference distribution, derived from the null hypothesis, serves as a comparison for the observed t-statistic.
All significance tests require a defined test statistic that quantifies the effect of interest and helps determine whether the observed effect falls within the range of typical random variation. (In a resampling test, the scale of the data is not an issue. We generate the reference (null hypothesis) distribution directly from the data and use the test statistic in its original form.)
A classic statistics text shows formulas incorporating Gosset’s distribution and demonstrates standardizing data for comparison with the t-distribution.
We will keep discussing the example codes for session times per page from the previous post.
-
In R, the function is
t.test
t.test(Time ~ Page, data=session_times, alternative='less') --- Welch Two Sample t-test data: Time by Page t = -1.0983, df = 27.693, p-value = 0.1408 alternative hypothesis: true difference in means is less than 0 95 percent confidence interval: -Inf 19.59674 sample estimates: mean in group Page A mean in group Page B 126.3333 162.0000
-
In Python, the function
scipy.stats.ttest_ind
can be used.res = stats.ttest_ind(session_times[session_times.Page == 'Page A'].Time, session_times[session_times.Page == 'Page B'].Time, equal_var=False) print(f'p-value for single sided test: {res.pvalue / 2:.4f}'
The alternative hypothesis suggests that page A’s mean page time is shorter than page B’s. The p-value of $0.1408$ is reasonably close to the permutation test p-values of $0.121$ and $0.126$.
In resampling mode, we structure the solution to reflect the observed data and the hypothesis tested without considering data type, sample balance, variances, or other factors. Before the advent of computers, resampling tests were impractical, and statisticians used standard reference distributions.
A test statistic could then be standardized and compared to the reference distribution. One such widely used standardized statistic is the *t-statistic.*
Multiple Testing
A saying in statistics goes: “Torture the data long enough, and it will confess.” This suggests that you will likely uncover a statistically significant effect by analyzing the data from various perspectives and numerous questions.
For example, suppose you have $20$ predictor variables and one outcome variable, all randomly generated. In that case, at least one predictor will likely appear to be statistically significant when you conduct $20$ significance tests at the $\text{alpha} = 0.05$ level. This mistake is called a Type 1 error. Finding the probability of this error, the chance that a single predictor tests nonsignificant is $0.95,$ so for all $20$, it’s $0.95$ multiplied by itself $20$ times, which equals $0.36$. Therefore, the chance that at least one predictor shows a false significant result is the opposite of this: 1 – probability of all being nonsignificant, which equals 0.64. This effect is known as alpha inflation. Adding more variables or models increases the chance of something appearing “significant” by chance.
Key Terms for Multiple Testing
- Type 1 Error
- Incorrectly determining that an effect holds statistical significance.
- False Discovery Rate
- The rate of making a Type 1 error across multiple tests.
- Alpha inflation
- The multiple testing phenomenon occurs when alpha, the probability of making a Type 1 error, increases as more tests are conducted.
- Adjustment of p-values
- Accounting for doing multiple tests on the same data.
- Overfitting
- Fitting the noise.
- Fitting the noise.
In supervised learning, a holdout set allows models to be evaluated on unseen data, reducing this risk. The risk of concluding based on statistical noise continues in statistical and machine learning tasks that do not involve a labeled holdout set.
In a clinical trial, we should examine the results of the therapy at various stages. In each instance, we are posing multiple questions.
Adjustment procedures in statistics tackle this by setting a stricter threshold for significance than usual for a single test. They often involve “dividing the alpha” by the number of tests, resulting in a smaller alpha for each test. The Bonferroni adjustment, for example, divides the alpha by the number of comparisons.
This test assesses the maximum difference among group means against a benchmark using the t-distribution. It is similar to shuffling values and resampling groups of the exact sizes to find this maximum difference.
However, the issue of multiple comparisons extends beyond structured cases and relates to repeated data “dredging,” leading to the adage about torturing data.
With more data and increased journal articles, there are many opportunities to uncover insights, including multiplicity issues, as illustrated below.
- Checking for multiple pairwise differences across groups.
- Examining multiple subgroup results (“we found no significant treatment effect overall, but we did find an effect for unmarried individuals”).
- Trying various statistical models,
- Including a wide range of variables in the models.
- Asking numerous different questions (different possible outcomes).
False Discovery Rate
The term “false discovery rate” refers to how often hypothesis tests incorrectly identify significant effects. It became valuable in genomic research with numerous statistical tests in gene sequencing. A false discovery relates to a hypothesis test result (e.g., comparing samples). Researchers sought to set parameters to control the false discovery rate at a specified level.
More research doesn’t always lead to better outcomes, mainly due to the overarching challenge of “multiplicity.” In any case, the adjustment procedures for highly defined and structured statistical tests are too specific and inflexible to be of general use to data scientists. Therefore, Data Scientists always note that;
- Cross-validation and holdout samples mitigate the risk of illusory predictive models that appear effective due to random chance.
- For procedures without a labeled holdout set, we rely on the following.
- There is an awareness that querying and manipulating data increases the role of chance, and resampling and simulation heuristics provide random benchmarks for comparing observed results.
- There is an awareness that querying and manipulating data increases the role of chance, and resampling and simulation heuristics provide random benchmarks for comparing observed results.
Degrees of Freedom
The concept, commonly encountered in statistics, applies to statistics derived from sample data and refers to the number of values within a dataset that are free to vary without impacting the overall results.
For instance, if you have the mean of a sample of 10 values, there are 9 degrees of freedom. Knowing 9 sample values allows us to calculate the 10th, which cannot vary independently. The degrees of freedom parameter influences the shape of various probability distributions.
When sample statistics are standardized for use in conventional statistical formulas, degrees of freedom are involved in the standardization calculation to ensure that the standardized data corresponds to the appropriate reference distribution (such as the t-distribution, F-distribution, etc.).
The concept of degrees of freedom underlies the conversion of categorical variables into $n – 1$ indicator or dummy variables during regression analysis (to prevent multicollinearity).
Think about the variable “day of week.” There are seven days in the week, but only six options we can use to represent them. For example, if we know it’s not Monday to Saturday, we can conclude it must be Sunday. If we include indicators for Monday to Saturday, adding Sunday as well would cause a problem in regression analysis because of multicollinearity.
Leave a comment