Day109 - STAT Review: Exploratory Data Analysis (1)
Practical Statistics of Data Scientists: Elements of Statistical Terminologies- Data Statistics Fundamentals, Data Types, and Estimates of Location & Variability
Book Source: Practical Statistics for Data Scientists, 2nd Edition, by Peter Bruce, Andrew Bruce, Peter Gedeck / Released May 2020 / Publisher(s): O’Reilly Media, Inc.
Statisticians commonly refer to an “estimate” as a value derived from the available data, clearly distinguishing between observed data and the theoretical actual state of affairs. In contrast, data scientists and business analysts call this value a “metric.”
Elements of Structured Data
Key Terms for Data Types
- Numeric: Data that are expressed on a numerical scale
- Continuous: Data that can take on any value in an interval. (Synonyms: interval, float, numeric)
- Discrete: Data that can take on only integer values, such as counts. (Synonyms: integer, count)
- Categorical: Data that can take on only a specific set of values representing a set of possible categories (Synonyms: Enums, enumerated, factors, nominal)
- Binary: A special case of categorical data with just two categories of values, e.g., 0/1, true/false. (Synonyms: dichotomous, logical, indicator, boolean)
- Ordinal: Categorical data that has an explicit ordering (Synonym: ordered factor)
Rectangular Data
Rectangular data is the general term for a two-dimensional matrix with rows indicating records (cases) and columns indicating features (variables); data frame is the specific format in R and Python.
The data doesn’t always start in this form: unstructured data(e.g., text) must be processed and manipulated to represent a set of features in the rectangular data.
Key Terms for Rectangular Data
- Data Frame: Rectangular data (like a spreadsheet) is the basic data structure for statistical and machine learning models.
- Feature: A column within a table is commonly called a feature.
- Synonyms: attribute, input, predictor, variable
- Outcome: Many data science projects focus on predicting outcomes, frequently in a yes/no format. Features are sometimes employed to predict results in experiments or studies.
- Synonyms: dependent variables, response, target, output
- Records: A row within a table is commonly called a record.
- Synonyms: case, instance, observation, pattern, sample
Data Frames and Indexes
- In Python, with the pandas library, the basic rectangular data structure is a DataFrame object.
- Default: an automatic integer index is created
- In R, the basic rectangular data structure: data.frame object.
- Default: an implicit integer index
Nonrectangular Data Structure
- Time Series Data: raw material for statistical forecasting methods such as IoT.
- Spatial Data: Used in mapping and location analytics
- Object: the focus of the data is an object (e.g., a house) and its spatial coordinates
- Fiield: focuses on small units of space and the value of a relevant metric (e.g., pixel brightness)
- Graph or Network: physical, social, and abstract relationships
Estimates of Location
Key Terms for Estimates of Location
- Mean: The sum of all values divided by the number of values.
- = average
- Weighted mean: The sum of all values times a weight divided by the sum of the weights.
- = weighted average
- Median: The value such that one-half of the data lies above and below.
- = 50% percentile
- Percentile: The value such that $P$ percent of the data lies below.
- = quantile
- Weighted median: The values are defined such that one-half of the sum of the weights lies above and below the sorted data.
- Trimmed mean: The average of all values after dropping a fixed number of extreme values.
- = truncated mean.
- Robust: Not sensitive to extreme values.
- = resistant
- Outlier: A data value that is very different from most of the data
- = extreme value
- = extreme value
Median and Robust Estimates
- Median: Unlike the mean, which considers all observations, the median focuses solely on the central values of sorted data. Although this might appear to be a drawback, given that the mean is more responsive to variations, there are numerous cases where the median serves as a more effective measure of central tendency.
- The same applies to the Weighted Mean.
- Weighted Median: Similar to calculating the median, we start by sorting the data and assigning each value a specific weight. Rather than identifying the middle number, the weighted median represents a value where the total of the weights is balanced between the lower and upper halves of the ordered list.
- Like the median, the weighted median is robust to the outliers.
- Like the median, the weighted median is robust to the outliers.
Estimates of Variability
It is known as dispersion and assesses whether data values are closely grouped or widely distributed.
At the core of statistics is variability, which involves measuring, reducing, and differentiating between random and genuine variability. It also includes identifying the various sources of true variability and making informed decisions in its presence.
Key Terms for Estimates of Variability
- Deviations: The difference between the observation and estimate of the location.
- = errors, residuals
- Variance: The sum of squared deviations from the mean divided by $n-1$ where $n$ is the number of data values.
- = mean-squared-error
- Standard deviation: The square root of the variance
- Mean Absolute Deviation: The mean of the absolute values of the deviations from the mean.
- = L1-norm, Manhattan Norm
- Mean absolute deviation from the median: The median of the absolute values of the deviations from the median.
- Range: The difference between the largest and the smallest value in a data set.
- Order statistics: Metrics based on the data values sorted from smallest to biggest.
- = ranks
- Percentile: The value such that $P$ percent of the values take on this value or less, and $(100-P)$ Percent take on this value or more.
- = quantile
- Interquartile Range: The difference between the 75th percentile and the 25th percentile
- = IQR
- = IQR
Standard Deviation and Related Estimates
The deviations indicate the spread of the data around the central value.
- The typical value for these deviations is the average, but it is not particularly meaningful.
- Mean Absolute Deviation = $\sum^n_{i=1} \vert x_i-\hat{x} \vert $, where $\hat{x}$ is the sample mean.
- Variance = $s^2$ = $\frac{\sum^{n}_{i=1}(x_i-\hat{x})^2}{n-1}$
- Note that the denominator is $n-1$, not $n$ (due to the degree of freedom)
- Standard Deviation = $s$ = $\sqrt{\text{Variance}}$
- The standard deviation is much easier to understand than the variance because it shares the same scale as the original data.
- The variance and standard deviation are especially sensitive to outliers since they are based on the squared deviations.
- Median Absolute Deviation = $\text{Median} = ( \vert x_1-m \vert,\ \vert x_2-m \vert,\ \dots \ ,\vert x_N-m \vert)$
- More robust estimate of variability as MAD is not influenced by extreme values like the median.
- More robust estimate of variability as MAD is not influenced by extreme values like the median.
Example Code Snippet
-
Use R’s built-in functions for sd, IQR, MAD(Median Absolute Deviation)
sd(state[['population']]) IQR(state[['population']]) MAD(state[['population']])
-
The pandas data frame provides SD, and quantile, so we can calculate IQR based on it. For the MAD, we use the function
robust.scale.mad
from thestatsmodel
package.state['population'].std() state['population'].quantile(0.75) - state['population'].quantile(0.25) robust.scale.mad(state('population'))
Leave a comment