Day117 - STAT Review: Statistical Experiments and Significance Testing (4)

February 7, 2025 6 minute read

Practical Statistics for Data Scientists: ANOVA(One & Two-Way), F-statistic, and Chi-Square Test

5BDF34C9-064E-4C43-B6F9-265EB87335FE_1_105_c

ANOVA

Imagine that, rather than conducting an A/B test, we assessed several groups—A, B, C, and D—each containing numeric data. The statistical method used to determine whether there is a significant difference between these groups is known as analysis of variance, or ANOVA.

Key Terms for ANOVA

Pairwise Comparison
- A hypothesis test (e.g., of means) between two groups among multiple groups.
Omnibus Test
- A single hypothesis test of the overall variance among multiple group means.
Decomposition of Variance
- Separating components of an individual
F-Statistic
- A standardized statistic that measures how much the differences between group means exceed expectations based on a chance model.
SS
- “Sum of Squares”, which” refers to deviations from an average value.

Let’s assume we have a table displaying the number of seconds each visitor spent on four pages. The four pages are rotated so each web visitor receives one randomly. Each page has five visitors in the table, and each column represents an independent data set.

–	Page 1	Page 2	Page 3	Page 4
	164	178	175	155
	172	191	193	166
	177	182	171	164
	156	185	163	170
	195	177	176	162
Average	172	185	176	162
Grand Average				173.75

Comparing two groups was simple; with four means, there are six comparisons between them.

Page 1 compared to page 2
Page 1 compared to page 3
Page 1 compared to page 4
Page 2 compared to page 3
Page 2 compared to page 4
Page 3 compared to page 4

Instead of pairwise comparisons, we can conduct an overall test to determine whether all pages share the same underlying stickiness. Differences in stickiness arise from a random allocation of page times among them.

The procedure used to test this is ANOVA. It can be demonstrated using the following resampling method for the A/B/C/D web page stickiness test:

Combine all data into a box.
Shuffle and draw four resamples of five values each.
Record the mean of each group.
Calculate the variance among the means.
Repeat steps 2–4 many times (e.g., 1,000).

The p-value indicates the proportion of the time the resampled variance exceeded the observed variance. This permutation test is more complicated than the one we discussed in earlier posts. (Check out here: Day114 - STAT Review: …Permutation Test)

In R, we can use the aovp function in the lmPerm package.

> library(lmPerm)
> summary(aovp(Time ~ Page, data=four_sessions))
---
[1] "Settings:  unique SS "
Component 1 :
            Df R Sum Sq R Mean Sq Iter Pr(Prob)
Page         3    831.4    277.13 3104  0.09278 .
Residuals   16   1618.4    101.15
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value Pr(Prob) is $0.09278$. This means that, with the same underlying stickiness, $9.3 \%$ of the time, the response rate among the four pages could differ significantly from what was observed.

The column Iter indicates the number of iterations completed in the permutation test. The remaining columns relate to a standard ANOVA table.

In Python, we can compute the permutation test as follows.

observed_variance = four_sessions.groupby('Page').mean().var()[0]
print('Observed Means:', four_sessions.groupby('Page').mean().values.ravel())
print('Variance:', observed_variance)
  
def perm_test(df):
  df = df.copy()
  df['Time'] = np.random.permutation(df['Time'].values)
  return df.groupby('Page').mean().var()[0]
  
perm_variance = [perm_test(four_sessions) for _ in range(3000)]
print('Pr(Prob)', np.mean([var > observed_variance for var in perm_variance]))

F-Statistic

Similar to how the t-test is an alternative to a permutation test for comparing the means of two groups</u>, a statistical test for ANOVA utilizes the F-statistic.

The F-Statistic measures the variance ratio between group means (treatment effect) to residual error variance. A higher ratio indicates a more significant result.

In R, we utilize the aov function to generate an ANOVA table.

> summary(aov(Time ~ Page, data=four_sessions))
---
            Df Sum Sq Mean Sq F value Pr(>F)
Page         3  831.4   277.1    2.74 0.0776 .
Residuals   16 1618.4   101.2
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

In Python, we use the statsmodels package.

model = smf.ols('Time ~ Page', data=four_sessions).fit()
  
aov_table = sm.stats.anova_lm(model)
aov_table

DF = degrees of freedom, Sum Sq = sum of squares, Mean Sq = mean squared deviations, F value = F-statistic. For the grand average, the sum of squares is the squared departure from $0$, multiplied by $20$ (the number of observations). The degrees of freedom for the grand average is defined as $1$.

Treatment means have $3$ degrees of freedom (once three values are set and the grand average is determined, one treatment mean remains constant). The sum of squares for treatment means is the total of squared deviations from the grand average.

Residuals have $16$ degrees of freedom ($20$ observations, with $16$ varying after the grand mean and treatments). SS is the sum of squared differences between individual observations and treatment means. Mean Squares (MS) is the sum of squares divided by degrees of freedom.

The F-statistic is the ratio MS(Treatment)/MS(Error). It compares this ratio to the F-distribution to assess if treatment mean differences exceed random chance variation.

Two-Way ANOVA

The described A/B/C/D test is a “one-way” ANOVA with one varying factor. Adding a second factor, like “weekend versus weekday,” would create a “two-way ANOVA.” (group A weekend, group A weekday, etc.)

We will address it similarly to a one-way ANOVA by identifying the “interaction effect.”

We separate weekend and weekday observations for each group, then **calculate the difference** between their averages and the treatment average.

Chi-Square Test

The chi-square test is applied to count data to determine how well it aligns with an expected distribution.

Web testing often goes beyond A/B testing to evaluate multiple treatments simultaneously. The chi-square statistic is most commonly used in statistical practice with $r \times c$ contingency tables to determine if the null hypothesis of independence among variables is valid.

Key Terms for Chi-Square Test

Chi-Square Statistic
- A measure of how much observed data differs from what is expected.
Expectation or expected
- How we expect the data to perform under certain assumptions, typically the null hypothesis.

Chi-Square Test: A Resampling Approach

Imagine testing three distinct headlines—A, B, and C—each presented to 1,000 visitors, with the results displayed in the table below.

–	Headline A	Headline B	Headline C
Click	14	8	12
No-Click	986	992	988

A resampling procedure tests if click rates differ beyond chance. This test requires the “expected” distribution of clicks, based on the null hypothesis that all three headlines have the same click rate, resulting in a total click rate of 34 out of 3,000.

–	Headline A	Headline B	Headline C
Click	0.792	-0.990	0.198
No-click	-0.085	0.106	-0.021

The chi-square statistic is calculated using the sum of the squared Pearson residuals:

$X = \sum_i^r \sum_j^c R^2$

where $r$ and $c$ are the number of rows and columns, respectively. The chi-square statistic for this example is 1.666.

We can test with this resampling algorithm:

Create a box with 34 clicks and 2,966 zeros.
Shuffle, sample three groups of 1,000, and count clicks.
Calculate squared differences between shuffled and expected counts, then sum them.
Repeat this process multiple times (i.e.,1,000 times.)
Determine how often the resampled sum of squared deviations exceeds the observed value to find the p-value.

In R, chisq.test function is used.

> chisq.test(clicks, simulate.p.value=TRUE)
---
  
	Pearson's Chi-squared test with simulated p-value (based on 2000 replicates)
  
data:  clicks
X-squared = 1.6659, df = NA, p-value = 0.4853

In Python

box = [1] * 34
box.extend([0] * 2966)
random.shuffle(box)
  
def chi2(observed, expected):
  pearson_residuals = []
  for row, expect in zip(observed, expected):
    pearson_residuals.append([(observe - expect) ** 2 / expect
                            for observe in row])
  # return sum of squares
  return np.sum(pearson_residuals)
  
  
expected_clcks = 34 / 3
expected_noclicks = 1000 - expected_clicks
expected = [34/3, 1000-34/3]
chi2observed = chi2(clicks.values, expected)
  
  
def perm_fun(box):
  sample_clicks = [sum(random.sample(box, 1000)),
                   sum(random.sample(box, 1000)),
                   sum(random.sample(box, 1000))]
  sample_noclicks = [1000 - n for n in sample_clicks]
  return chi2([sample_clicks, sample_noclicks], expected)
  
perm_chi2 = [perm_fun(box) for _ in range(2000)]
  
resampled_p_value = sum(perm_chi2 > chi2observed) / len(perm_chi2)
  
print(f'Observed chi2: {chi2observed:.4f}')
print(f'Resampled p-value: {resampled_p_value:.4f}')

The test shows that this result could easily have been obtained by randomness.

Share on

Twitter Facebook LinkedIn

Wonha Leah Shin

Day117 - STAT Review: Statistical Experiments and Significance Testing (4)

Practical Statistics for Data Scientists: ANOVA(One & Two-Way), F-statistic, and Chi-Square Test

ANOVA

Key Terms for ANOVA

F-Statistic

Two-Way ANOVA

Chi-Square Test

Key Terms for Chi-Square Test

Chi-Square Test: A Resampling Approach

Share on

Leave a comment

You may also enjoy

Day175 - MLOps Review: Data Distribution Shifts And Monitoring (2)

Day174 - MLOps Review: Data Distribution Shifts And Monitoring (1)

Day173 - MLOps Review: Model Deployment And Prediction Service (3)

Day172 - MLOps Review: Model Deployment and Prediction Service (2)