Day117 - STAT Review: Statistical Experiments and Significance Testing (4)
Practical Statistics for Data Scientists: ANOVA(One & Two-Way), F-statistic, and Chi-Square Test
ANOVA
Imagine that, rather than conducting an A/B test, we assessed several groups—A, B, C, and D—each containing numeric data. The statistical method used to determine whether there is a significant difference between these groups is known as analysis of variance, or ANOVA.
Key Terms for ANOVA
- Pairwise Comparison
- A hypothesis test (e.g., of means) between two groups among multiple groups.
- Omnibus Test
- A single hypothesis test of the overall variance among multiple group means.
- Decomposition of Variance
- Separating components of an individual
- F-Statistic
- A standardized statistic that measures how much the differences between group means exceed expectations based on a chance model.
- SS
- “Sum of Squares”, which” refers to deviations from an average value.
Let’s assume we have a table displaying the number of seconds each visitor spent on four pages. The four pages are rotated so each web visitor receives one randomly. Each page has five visitors in the table, and each column represents an independent data set.
– | Page 1 | Page 2 | Page 3 | Page 4 |
---|---|---|---|---|
164 | 178 | 175 | 155 | |
172 | 191 | 193 | 166 | |
177 | 182 | 171 | 164 | |
156 | 185 | 163 | 170 | |
195 | 177 | 176 | 162 | |
Average | 172 | 185 | 176 | 162 |
Grand Average | 173.75 |
Comparing two groups was simple; with four means, there are six comparisons between them.
- Page 1 compared to page 2
- Page 1 compared to page 3
- Page 1 compared to page 4
- Page 2 compared to page 3
- Page 2 compared to page 4
- Page 3 compared to page 4
Instead of pairwise comparisons, we can conduct an overall test to determine whether all pages share the same underlying stickiness. Differences in stickiness arise from a random allocation of page times among them.
The procedure used to test this is ANOVA. It can be demonstrated using the following resampling method for the A/B/C/D web page stickiness test:
-
Combine all data into a box.
-
Shuffle and draw four resamples of five values each.
-
Record the mean of each group.
-
Calculate the variance among the means.
-
Repeat steps 2–4 many times (e.g., 1,000).
The p-value indicates the proportion of the time the resampled variance exceeded the observed variance. This permutation test is more complicated than the one we discussed in earlier posts. (Check out here: Day114 - STAT Review: …Permutation Test)
-
In R, we can use the
aovp
function in thelmPerm
package.> library(lmPerm) > summary(aovp(Time ~ Page, data=four_sessions)) --- [1] "Settings: unique SS " Component 1 : Df R Sum Sq R Mean Sq Iter Pr(Prob) Page 3 831.4 277.13 3104 0.09278 . Residuals 16 1618.4 101.15 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value
Pr(Prob)
is $0.09278$. This means that, with the same underlying stickiness, $9.3 \%$ of the time, the response rate among the four pages could differ significantly from what was observed.The column
Iter
indicates the number of iterations completed in the permutation test. The remaining columns relate to a standard ANOVA table. -
In Python, we can compute the permutation test as follows.
observed_variance = four_sessions.groupby('Page').mean().var()[0] print('Observed Means:', four_sessions.groupby('Page').mean().values.ravel()) print('Variance:', observed_variance) def perm_test(df): df = df.copy() df['Time'] = np.random.permutation(df['Time'].values) return df.groupby('Page').mean().var()[0] perm_variance = [perm_test(four_sessions) for _ in range(3000)] print('Pr(Prob)', np.mean([var > observed_variance for var in perm_variance]))
F-Statistic
Similar to how the t-test is an alternative to a permutation test for comparing the means of two groups</u>, a statistical test for ANOVA utilizes the F-statistic.
The F-Statistic measures the variance ratio between group means (treatment effect) to residual error variance. A higher ratio indicates a more significant result.
-
In R, we utilize the
aov
function to generate an ANOVA table.> summary(aov(Time ~ Page, data=four_sessions)) --- Df Sum Sq Mean Sq F value Pr(>F) Page 3 831.4 277.1 2.74 0.0776 . Residuals 16 1618.4 101.2 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
-
In Python, we use the
statsmodels
package.model = smf.ols('Time ~ Page', data=four_sessions).fit() aov_table = sm.stats.anova_lm(model) aov_table
DF
= degrees of freedom, Sum Sq
= sum of squares, Mean Sq
= mean squared deviations, F value
= F-statistic. For the grand average, the sum of squares is the squared departure from $0$, multiplied by $20$ (the number of observations). The degrees of freedom for the grand average is defined as $1$.
Treatment means have $3$ degrees of freedom (once three values are set and the grand average is determined, one treatment mean remains constant). The sum of squares for treatment means is the total of squared deviations from the grand average.
Residuals have $16$ degrees of freedom ($20$ observations, with $16$ varying after the grand mean and treatments). SS is the sum of squared differences between individual observations and treatment means. Mean Squares (MS) is the sum of squares divided by degrees of freedom.
The F-statistic is the ratio MS(Treatment)/MS(Error). It compares this ratio to the F-distribution to assess if treatment mean differences exceed random chance variation.
Two-Way ANOVA
The described A/B/C/D test is a “one-way” ANOVA with one varying factor. Adding a second factor, like “weekend versus weekday,” would create a “two-way ANOVA.” (group A weekend, group A weekday, etc.)
We will address it similarly to a one-way ANOVA by identifying the “interaction effect.”
We separate weekend and weekday observations for each group, then **calculate the difference** between their averages and the treatment average.
Chi-Square Test
The chi-square test is applied to count data to determine how well it aligns with an expected distribution.
Web testing often goes beyond A/B testing to evaluate multiple treatments simultaneously. The chi-square statistic is most commonly used in statistical practice with $r \times c$ contingency tables to determine if the null hypothesis of independence among variables is valid.
Key Terms for Chi-Square Test
- Chi-Square Statistic
- A measure of how much observed data differs from what is expected.
- Expectation or expected
- How we expect the data to perform under certain assumptions, typically the null hypothesis.
- How we expect the data to perform under certain assumptions, typically the null hypothesis.
Chi-Square Test: A Resampling Approach
Imagine testing three distinct headlines—A, B, and C—each presented to 1,000 visitors, with the results displayed in the table below.
– | Headline A | Headline B | Headline C |
---|---|---|---|
Click | 14 | 8 | 12 |
No-Click | 986 | 992 | 988 |
A resampling procedure tests if click rates differ beyond chance. This test requires the “expected” distribution of clicks, based on the null hypothesis that all three headlines have the same click rate, resulting in a total click rate of 34 out of 3,000.
– | Headline A | Headline B | Headline C |
---|---|---|---|
Click | 0.792 | -0.990 | 0.198 |
No-click | -0.085 | 0.106 | -0.021 |
The chi-square statistic is calculated using the sum of the squared Pearson residuals:
where $r$ and $c$ are the number of rows and columns, respectively. The chi-square statistic for this example is 1.666.
We can test with this resampling algorithm:
- Create a box with 34 clicks and 2,966 zeros.
- Shuffle, sample three groups of 1,000, and count clicks.
- Calculate squared differences between shuffled and expected counts, then sum them.
- Repeat this process multiple times (i.e.,1,000 times.)
- Determine how often the resampled sum of squared deviations exceeds the observed value to find the p-value.
-
In R,
chisq.test
function is used.> chisq.test(clicks, simulate.p.value=TRUE) --- Pearson's Chi-squared test with simulated p-value (based on 2000 replicates) data: clicks X-squared = 1.6659, df = NA, p-value = 0.4853
-
In Python
box = [1] * 34 box.extend([0] * 2966) random.shuffle(box) def chi2(observed, expected): pearson_residuals = [] for row, expect in zip(observed, expected): pearson_residuals.append([(observe - expect) ** 2 / expect for observe in row]) # return sum of squares return np.sum(pearson_residuals) expected_clcks = 34 / 3 expected_noclicks = 1000 - expected_clicks expected = [34/3, 1000-34/3] chi2observed = chi2(clicks.values, expected) def perm_fun(box): sample_clicks = [sum(random.sample(box, 1000)), sum(random.sample(box, 1000)), sum(random.sample(box, 1000))] sample_noclicks = [1000 - n for n in sample_clicks] return chi2([sample_clicks, sample_noclicks], expected) perm_chi2 = [perm_fun(box) for _ in range(2000)] resampled_p_value = sum(perm_chi2 > chi2observed) / len(perm_chi2) print(f'Observed chi2: {chi2observed:.4f}') print(f'Resampled p-value: {resampled_p_value:.4f}')
The test shows that this result could easily have been obtained by randomness.
Leave a comment