4 minute read

Mathematical Principles: Non-parametric Inference (Wilcoxon Signed-Rank Test & Wilcoxon Rank-Sum Test)

7AB0D54F-7D96-4B5C-996E-501DF19C2B5D_1_105_c

Nan-parametric Inference

Nonparametric Methods

Consider the statistical tests we discussed in previous posts (z-test, t-test, $\chi^2$-test, F-test, etc.). We understood the population distribution in these tests and only needed to make inferences about the unknown parameters. However, what do we do if we can’t observe the population distribution? Should we then use a nonparametric method?

We turn to nonparametric methods when the population distribution is unknown or the data doesn't meet the necessary assumptions for specific parametric techniques (such as when the Central Limit Theorem is inapplicable or the normal approximation for proportions is invalid). Nonparametric methods, which require fewer assumptions about the underlying distribution, are often referred to as distribution-free methods.

Why use nonparametric tests?

  • They require fewer assumptions about the underlying data, making them more robust in various contexts.
  • They work effectively with ordinal data or non-normal distributions, allowing for flexibility in statistical analysis.
  • They use ranks instead of raw values, making them more interpretable and less sensitive to outliers.

Wilcoxon Signed-Rank Test (Paired Data)

  • It is used to compare two samples from populations that are not independent. Because we are considering paired data, we may look at the value difference for each pair of observations.
  • It does not require populations to be normally distributed.
  • Null hypothesis: In the underlying population, the difference is equal to 0
  • Note that we consider medians for non-parametric tests as opposed to means.

Motivating Example

To assess if a new drug affects tumor size, we can’t assume that the tumor sizes follow a normal distribution. Consider a sample of $n$ pairs of observations (tumor size before the drug versus tumor size after the drug), where$ n = 13$. Since this sample size is small, the Central Limit Theorem (CLT) is not applicable. Therefore, we must utilize a parametric method known as the “Wilcoxon Signed-Rank Test.”

  • $H_0$: The median difference in tumor sizes equals 0

  • $H_1$: The median difference in tumor sizes is different from 0

  • Test at the $\alpha=0.05$ significance level.

  • Next, take the difference for each pair of observations (before/after). Ignoring the sign of these observations rank their absolute values from smallest to largest.

    • A difference of $0$ is not ranked.
    • Remove pair from data set and reduce number of pairs by 1.
  • Tied observations are assigned an average rank.

  • Finally, separate the ranks by sign to either + or -

  • Calculate the sum of the positive ranks, $T^+$, and the sum of the negative ranks, $T^-$

  • Calculate $T=T^+ - T^-$

  • Under the null hypothesis, the median of the underlying population differences equals 0. Thus, we expect approximately equal numbers of positive and negative ranks.

  • Additionally, the sum of the positive ranks should be approximately equal to the sum of the negative ranks, so $T$ should be approximately 0.

  • Evaluate the null hypothesis using the test statistic:

    $Z_T = \frac{T- \mu_T}{\sigma_T}$

  • Note that

    $\mu_T = 0$, $\sigma_T = \sqrt{\frac{n(n+1)(2n+1)}{6}}$

  • $Z_T \sim N(0,1)$, given that $n$ is large enough (typically $n>12$)

    • We cannot use the normal approximation if the sample size is $n \leq 12$. We cannot use psignrank(T,n) in R to calculate the exact p-value.
  • If $Z_T$ is large, we reject $H_0$meaning the drug significantly reduces tumor size.

  • In R

    wilcox.test(before, after, paired=T, Exact=F, correct=F)
    



Wilcoxon Rank-Sum Test (Independent Samples)

  • Used to compare samples from independent populations. Nonparametric analog to the two-sample t-test.
  • It does not require populations to be normally distributed. However, it does require the two populations to have the same general shape.
    • $H_0$: The medians of the two populations are identical.

Test Stepss

  • Combine all data from the two samples and rank the observations from smallest to largest.
  • If ranks are tied, we assign the average rank to those values. We then find the sum of ranks for each of the two original samples, denoted $W_1$ and $W_2$, and then let $W= \text{min}(W_1, W_2)$
  • Under $ H_0 $, the underlying populations have the same median, so we expect ranks to be randomly distributed between the two groups.
  • Thus, the average ranks for the two samples (i.e., $W1/n_1$ and $W_2/n_2$) should be approximately equal.
  • Evaluate the null hypothesis using the test statistic $z_w  = frac{W- \mu_W}{\sigma_W}$.
  • Let $n_1$ be the number of observations in the sample with the smaller sum of ranks. Let $n_2$ be the number of observations in the sample with the large sum of ranks.
  • Then, $\mu_w = \frac{n_1(n_1+n_2+1)}{12}$
    • $Z_W \sim N(0,1)$ when $n_1$ and $n_2$ are large enough $(n_1, n_2>10)$
  • If $Z_W$ is large, we reject $H_0$, meaning the two groups have different medians.

Example

  • Suppose we want to determine whether a particular disease is associated with a raised body temperature. We cannot assume that body temperature is normally distributed. Take a sample of $n_1=12$ people who do not have the disease. Then, take another sample of $n_2=15$ people with the disease. How can we compare the median body temperature for these two populations? $\Rightarrow$ use Wilcoxon Rank-Sum Test
  • $H_0$: The median body temperature for those without the disease is greater than or equal that of those with the disease.
  • $H_1$: The median body temperature for those without the disease is less than that of those with the disease.
  • Test at the $\alpha=0.05$ significance level.

  • In R

    wilcox.test(sample1, sample2, exact=F, correct=F, alt="less")
    



Leave a comment