Day110 - STAT Review: Exploratory Data Analysis (2)

January 26, 2025 5 minute read

Practical Statistics for Data Scientists: Data Distribution, Correlation, and Various Data Visualization Plots

A6EDFC05-568C-4C4A-890B-B2976BFD4DB0

Book Source: Practical Statistics for Data Scientists, 2nd Edition, by Peter Bruce, Andrew Bruce, Peter Gedeck / Released May 2020 / Publisher(s): O’Reilly Media, Inc.

Data Distribution

Key Terms for Exploring the Distribution

Boxplot: a plot to visualize the distribution of data
- = box and whiskers plot
Frequency Table: A tally of the count of numeric data values with a set of bins
Histogram: A plot of a frequency table with the bins on the x-axis and count on the y-axis.
Density Plot: A smooth version of the histogram.

Percentiles and Boxplots

Boxplots are based on percentiles and give a quick way to visualize the data distribution. The code snippet below is an example of R and Python, respectively.

In R, with the quantile function and use boxplot

quantile(state[['Murder.Rate']], p=c(.05, .25, .5, .75, .95))
  
boxplot(state[['Population']]/1000000, ylab='Population (millions)')

In Python, pandas dataframe provides it with quantile and plot.box()

state['Murder.Rate'].quantile([0.05, 0.25, 0.5, 0.75, 0.95])
  
ax = (state['Population']/1_000_000).plot.box()
ax.set_ylabel("Population (millions)")

How to Read Boxplot

The top and bottom of the box are the 75th and 25th percentiles, respectively.
The dashed lines, called whiskers, extend from the top and bottom of the box to indicate the range for the bulk of the data.
By default, the R function extends the whiskers to the furthest point beyond the box, except that it will not exceed 1.5 times the IQR.
Outside of the whiskers, points are plotted as single points or circles. (outliers)

Image Source: Wikipedia Boxplot

Frequency Tables and Histograms

breaks <- seq(from=min(state[['Population']],
                to=max(state[['Population']], length=11)
pop_freq <- cut(state[['Population']], breaks=breaks. right=TRUE, include.lowest=TRUE)
table(pop_freq)

Python

binnedPopulation = pd.cut(state.['Population'], 10) # 10 bins
binnedPopulation.value_counts()

In a histogram, generally plotted such that:
- Empty bins are included in the graph
- Binsa re of equal width.
- The number of bins (or, equivalently, bin size) is up to the user.
- Bars are contiguous – no empty space shows between bars, unless theere is an empty bin.

Binary and Categorical Data

Key Terms for Categorical Data

Mode: The most commonly occurring category or value in a data set.
Expected Value: When the categories can be associated with a numeric value, this gives an average value based on a category’s probability of occurrence.
Bar Chart: The frequency or proportion for each category plotted as bars.
Pie Chart: The frequency or proportion for each category plotted as wedges in a pie.

Example Code Snippet

In R

barplot(as.matrix(dfw) / 6, cex.axis=0.8, cex.names=0.7,
        xlab='Cause of delay', ylab='Count')

In Python

ax = dfw.transpose().plot.bar(figsize=(4, 4), legend=False)
ax.set_xlabel('Cause of delay')
ax.set_ylabel('Count')

Correlation

In numerous modeling projects, exploratory data analysis focuses on investigating correlations among predictors as well as between predictors and a target variable. If both variable X and variable Y have measured data, they are considered positively correlated when high values of X correspond with high values of Y, and low values of X align with low values of Y.

Key Terms for Correlation

Correlation Coefficient
- A measure indicating how closely numeric variables are related, with a range from -1 to +1.
Correlation Matrix
- A table that displays variables along the rows and columns, with cell values representing the correlations between them.
Scatterplot
- A plot in which the x-axis is the value of one variable, and the y-axis the value of another.
Pearson’s Correlation Coefficient
- $r=\frac{\sum^n_{i=a}(x_i-\bar{x})(y_i-\hat{y})}{(n-1)s_x s_y}$
- We multiply deviations from the mean for variable 1 times those for variable 2, and divide by the product of the standard deviations.
- Note that we divide by $(n-1)$ , which is the degree of freedom.
- The results always lies between $+1$ (perfectly positive correlation) and $-1$ (perfectly negative correlation); $0$ indicates no correlation.
  
  Image Source: Vitaflux : Correlation Heatmap with Seaborn
Other types of CoR CoE: Spearman’s rho or Kendall’s tau.

Example Code Snippet

In R, we can easily create with package corrplot.

etfs <- sp500_px[row.names(sp500_px) > '2012-07-01',
                sp500_sym[sp500_sym$sector == 'etf', 'symbol']]
library(corrplot)
corrplot(cor(etfs), method='ellipse')

In Python, the common packages do not provide an implementation to create the same graph; however, the most relevant option is using seaborn.heatmap.

etfs = sp500_px.loc[sp500_px.index > '2012-07-01',
                    sp500_sym[sp500_sym['sector'] == 'etf'][['symbol']]
sns.heatmap(etfs.corr(), vmin=-1, vmax=1,
           cmap=sns.diverging_palette(20, 220, as_cmap=True)

Scatterplots

The typical method for visualizing the relationship between two measured data variables is through a scatterplot.

In R

plot(telecom$T, telecom$VZ, xlab='ATT (T)', ylab='Verizon (VZ)' )

In Python

ax = telecom.plot.scatter(x='T', y='VZ', figsize(4, 4), marker='$\eff000')
ax.set_xlabel('ATT (T)')
ax.set_ylabel('Verizon (VZ)')
ax.axhline(0, color='grey', lw=1)
ax.axvline(0, color='grey', lw=1)

Image Source: Wikipedia - Scatterplot

Exploring Two or More Variables

Key terms for Two or More Variables

Contingency Table
- A summary that counts occurrences for two or more categorical variables.
Hexagonal Binning
- A visualization of two numeric variables, where data points are grouped into hexagons.
Contour Plot
- A graphical representation of the density of two numeric variables, resembling a topographical map.
Violin Plot
- Comparable to a boxplot, but displays the distribution density across multiple values de
The suitable analysis type—bivariate or multivariate—relies on whether the data is numeric or categorical.

Hexagonnal Binning and Contours (Numeric vs. Numeric)

In R, theggplot2 package is excellent for conducting advanced exploratory visual analyses.

ggplot(kc_tx0, (aes(x=SqFtTotLiving, y=TaxAssessedValue))) +
	stat_binhex(color='white') +
	theme_bw() +
	scale_fill_gradient(low='white'. high='black') +
	labs(x="Finished Square Feet", y="Tax-Assesed Value")

In Python, the hexbin method can be used.

ax = kc_tax0.plot.hexbin(x='SqFtTotLiving', y='TaxassessedValue',
                         gridsize=30, sharex=False, figsize=(5,4))
ax.set_xlabel('Finished Square Feet')
ax.set_ylabel('Tax Assessed Value')

For Contours, example code snippets as follows.

In R, use geom_density2d function in ggplot2.

ggplot(kc_tx0, (aes(x=SqFtTotLiving, y=TaxAssessedValue))) +
	stat_binhex(color='white') +
	theme_bw() +
	geom_point(alpha=0.1) + 
	geom_density2d(color='white') +
	labs(x="Finished Square Feet", y="Tax-Assesed Value")

In Python, seaborn kdeplot can be used.

ax = sns.kdeplot(kc_tax0.SqFtTotLiving, kc_tax0.TaxAssesedValue, ax=ax)
ax.set_xlabel('Finished Square Feet')
ax.set_ylabel('Tax-Assessed Value')

Two Categorical Variables

A good method for visualizing two categorical variables is through a contingency table, which shows counts by category. We can also add column percentages and overall totals for further insight.

In R, CrossTable function in the descr package is available.

library(descr)
x_tab <- CrossTable(lc_loans$grade, lc_loans$status,
                   prop.c=FALSE, prop.chisq=FALSE, prop.t=FALSE)

In Python, pivot_table method and aggfunc argument allow us to get the table.

crosstab = lc_loans.pivot_table(index='grade', columns'status',
                               aggfunc=lambda x:len(x), margins=True) 
#`margins` will add the cloumn and row sums
  
df = crosstab.loc['A':'G', :].copy() # Create copy of the pivot table
df.loc[:,'Charged Off':'Late'] = 
	df.loc[:,'Charged Off':'Late'].div(df['All'],
                                     axis=0) # Devide the rows with the row sum
df['All'] = df['All'] / sum(df['All'])  # Divide the 'all' column by sum
perc_crosstab = df

Categorical and Numerical Data

Boxplots provide a straightforward method for visually comparing the distributions of a numeric variable grouped by a categorical variable.

In R

boxplot(pct_carrier_delay ~ airline, data=airline_stats, ylim=c(0, 50))

In Python

ax = airline_stats.boxplot(by='airline', column='pct_carrier_delay')
ax.set_xlabel('')
ax.set_ylabel('Daily % of delayed Flights')
plt.suptitle('')

Violin plots enhance boxplots by providing a density estimate along the y-axis. They reveal distribution nuances often missed by traditional boxplots. In contrast, boxplots effectively highlight outliers within the data set.

Visualizing Multiple Variables

Charts for comparing two variables—like scatterplots, hexagonal binning, and boxplots—can be easily adapted to multiple variables using the concept of conditioning. Once we receive results from the numerical values, we would like to delve deeper; we can incorporate multiple variables as shown below.

Image source is from the book. (Practical Statistics for Data Scientists, 2nd Edition)

Share on

Twitter Facebook LinkedIn

Wonha Leah Shin

Day110 - STAT Review: Exploratory Data Analysis (2)

Practical Statistics for Data Scientists: Data Distribution, Correlation, and Various Data Visualization Plots

Data Distribution

Key Terms for Exploring the Distribution

Percentiles and Boxplots

How to Read Boxplot

Frequency Tables and Histograms

Binary and Categorical Data

Key Terms for Categorical Data

Example Code Snippet

Correlation

Key Terms for Correlation

Example Code Snippet

Scatterplots

Exploring Two or More Variables

Key terms for Two or More Variables

Hexagonnal Binning and Contours (Numeric vs. Numeric)

Two Categorical Variables

Categorical and Numerical Data

Visualizing Multiple Variables

Share on

Leave a comment

You may also enjoy

Day175 - MLOps Review: Data Distribution Shifts And Monitoring (2)

Day174 - MLOps Review: Data Distribution Shifts And Monitoring (1)

Day173 - MLOps Review: Model Deployment And Prediction Service (3)

Day172 - MLOps Review: Model Deployment and Prediction Service (2)