Day109 - STAT Review: Exploratory Data Analysis (1)

January 25, 2025 5 minute read

Practical Statistics of Data Scientists: Elements of Statistical Terminologies- Data Statistics Fundamentals, Data Types, and Estimates of Location & Variability

D6B69366-54F9-40C8-B0F7-E5B051CDA73C_1_105_c

Book Source: Practical Statistics for Data Scientists, 2nd Edition, by Peter Bruce, Andrew Bruce, Peter Gedeck / Released May 2020 / Publisher(s): O’Reilly Media, Inc.

Statisticians commonly refer to an “estimate” as a value derived from the available data, clearly distinguishing between observed data and the theoretical actual state of affairs. In contrast, data scientists and business analysts call this value a “metric.”

Elements of Structured Data

Key Terms for Data Types

Numeric: Data that are expressed on a numerical scale
- Continuous: Data that can take on any value in an interval. (Synonyms: interval, float, numeric)
- Discrete: Data that can take on only integer values, such as counts. (Synonyms: integer, count)
Categorical: Data that can take on only a specific set of values representing a set of possible categories (Synonyms: Enums, enumerated, factors, nominal)
- Binary: A special case of categorical data with just two categories of values, e.g., 0/1, true/false. (Synonyms: dichotomous, logical, indicator, boolean)
- Ordinal: Categorical data that has an explicit ordering (Synonym: ordered factor)

Rectangular Data

Rectangular data is the general term for a two-dimensional matrix with rows indicating records (cases) and columns indicating features (variables); data frame is the specific format in R and Python.

The data doesn’t always start in this form: unstructured data(e.g., text) must be processed and manipulated to represent a set of features in the rectangular data.

Key Terms for Rectangular Data

Data Frame: Rectangular data (like a spreadsheet) is the basic data structure for statistical and machine learning models.
Feature: A column within a table is commonly called a feature.
- Synonyms: attribute, input, predictor, variable
Outcome: Many data science projects focus on predicting outcomes, frequently in a yes/no format. Features are sometimes employed to predict results in experiments or studies.
- Synonyms: dependent variables, response, target, output
Records: A row within a table is commonly called a record.
- Synonyms: case, instance, observation, pattern, sample

Data Frames and Indexes

In Python, with the pandas library, the basic rectangular data structure is a DataFrame object.
- Default: an automatic integer index is created
In R, the basic rectangular data structure: data.frame object.
- Default: an implicit integer index

Nonrectangular Data Structure

Time Series Data: raw material for statistical forecasting methods such as IoT.
Spatial Data: Used in mapping and location analytics
- Object: the focus of the data is an object (e.g., a house) and its spatial coordinates
- Fiield: focuses on small units of space and the value of a relevant metric (e.g., pixel brightness)
Graph or Network: physical, social, and abstract relationships

Estimates of Location

Key Terms for Estimates of Location

Mean: The sum of all values divided by the number of values.
- = average
Weighted mean: The sum of all values times a weight divided by the sum of the weights.
- = weighted average
Median: The value such that one-half of the data lies above and below.
- = 50% percentile
Percentile: The value such that $P$ percent of the data lies below.
- = quantile
Weighted median: The values are defined such that one-half of the sum of the weights lies above and below the sorted data.
Trimmed mean: The average of all values after dropping a fixed number of extreme values.
- = truncated mean.
Robust: Not sensitive to extreme values.
- = resistant
Outlier: A data value that is very different from most of the data
- = extreme value

Median and Robust Estimates

Median: Unlike the mean, which considers all observations, the median focuses solely on the central values of sorted data. Although this might appear to be a drawback, given that the mean is more responsive to variations, there are numerous cases where the median serves as a more effective measure of central tendency.
- The same applies to the Weighted Mean.
Weighted Median: Similar to calculating the median, we start by sorting the data and assigning each value a specific weight. Rather than identifying the middle number, the weighted median represents a value where the total of the weights is balanced between the lower and upper halves of the ordered list.
- Like the median, the weighted median is robust to the outliers.

Estimates of Variability

It is known as dispersion and assesses whether data values are closely grouped or widely distributed.

At the core of statistics is variability, which involves measuring, reducing, and differentiating between random and genuine variability. It also includes identifying the various sources of true variability and making informed decisions in its presence.

Key Terms for Estimates of Variability

Deviations: The difference between the observation and estimate of the location.
- = errors, residuals
Variance: The sum of squared deviations from the mean divided by $n-1$ where $n$ is the number of data values.
- = mean-squared-error
Standard deviation: The square root of the variance
Mean Absolute Deviation: The mean of the absolute values of the deviations from the mean.
- = L1-norm, Manhattan Norm
Mean absolute deviation from the median: The median of the absolute values of the deviations from the median.
Range: The difference between the largest and the smallest value in a data set.
Order statistics: Metrics based on the data values sorted from smallest to biggest.
- = ranks
Percentile: The value such that $P$ percent of the values take on this value or less, and $(100-P)$ Percent take on this value or more.
- = quantile
Interquartile Range: The difference between the 75th percentile and the 25th percentile
- = IQR

The deviations indicate the spread of the data around the central value.

The typical value for these deviations is the average, but it is not particularly meaningful.
Mean Absolute Deviation = $\sum^n_{i=1} \vert x_i-\hat{x} \vert $, where $\hat{x}$ is the sample mean.
Variance = $s^2$ = $\frac{\sum^{n}_{i=1}(x_i-\hat{x})^2}{n-1}$
- Note that the denominator is $n-1$, not $n$ (due to the degree of freedom)
Standard Deviation = $s$ = $\sqrt{\text{Variance}}$
- The standard deviation is much easier to understand than the variance because it shares the same scale as the original data.
The variance and standard deviation are especially sensitive to outliers since they are based on the squared deviations.
Median Absolute Deviation = $\text{Median} = ( \vert x_1-m \vert,\ \vert x_2-m \vert,\ \dots \ ,\vert x_N-m \vert)$
- More robust estimate of variability as MAD is not influenced by extreme values like the median.

Example Code Snippet

Use R’s built-in functions for sd, IQR, MAD(Median Absolute Deviation)

sd(state[['population']])
IQR(state[['population']])
MAD(state[['population']])

The pandas data frame provides SD, and quantile, so we can calculate IQR based on it. For the MAD, we use the function robust.scale.mad from the statsmodel package.
```
state['population'].std()
state['population'].quantile(0.75) - state['population'].quantile(0.25)
robust.scale.mad(state('population'))
```

Share on

Twitter Facebook LinkedIn

Wonha Leah Shin

Day109 - STAT Review: Exploratory Data Analysis (1)

Practical Statistics of Data Scientists: Elements of Statistical Terminologies- Data Statistics Fundamentals, Data Types, and Estimates of Location & Variability

Elements of Structured Data

Key Terms for Data Types

Rectangular Data

Key Terms for Rectangular Data

Data Frames and Indexes

Nonrectangular Data Structure

Estimates of Location

Key Terms for Estimates of Location

Median and Robust Estimates

Estimates of Variability

Key Terms for Estimates of Variability

Example Code Snippet

Share on

Leave a comment

You may also enjoy

Day175 - MLOps Review: Data Distribution Shifts And Monitoring (2)

Day174 - MLOps Review: Data Distribution Shifts And Monitoring (1)

Day173 - MLOps Review: Model Deployment And Prediction Service (3)

Day172 - MLOps Review: Model Deployment and Prediction Service (2)

Wonha Leah Shin

Practical Statistics of Data Scientists: Elements of Statistical Terminologies- Data Statistics Fundamentals, Data Types, and Estimates of Location & Variability

Elements of Structured Data

Key Terms for Data Types

Rectangular Data

Key Terms for Rectangular Data

Data Frames and Indexes

Nonrectangular Data Structure

Estimates of Location

Key Terms for Estimates of Location

Median and Robust Estimates

Estimates of Variability

Key Terms for Estimates of Variability

Standard Deviation and Related Estimates

Example Code Snippet

Share on

Leave a comment

You may also enjoy

Day175 - MLOps Review: Data Distribution Shifts And Monitoring (2)

Day174 - MLOps Review: Data Distribution Shifts And Monitoring (1)

Day173 - MLOps Review: Model Deployment And Prediction Service (3)

Day172 - MLOps Review: Model Deployment and Prediction Service (2)