Day121 - STAT Review: Regression and Prediction (3)
Practical Statistics for Data Scientists: Factor Variables in Regression, Dummy Variables, Many Levels, and Ordered Factor Variables
Factor Variables in Regression
Factor variables, or categorical variables, have a limited number of discrete values. A binary (yes/no) variable, or indicator variable, is a specific type of factor variable. Since regression requires numerical inputs, factor variables must be converted into binary dummy variables for modeling.
Key Terms for Factor Variables
- Dummy Variables
- Binary variables, represented as 0 and 1, are created by collecting factor data for regression and various modeling applications.
- Reference Coding
- Statisticians commonly use coding where one-factor level serves as a reference for comparisons.
- = treatment coding
- One Hot Encoder
- A common coding type in machine learning retains all factor levels. While useful for some algorithms, it is unsuitable for multiple linear regression.
- Deviation Coding
- A type of coding that compares each level against the overall mean instead of the reference level.
- = sum contrasts
Dummy Variables Representation
For example, previous King County housing data includes a property type variable; below is a subset of six records.
-
In R
head(house[, 'PropertyType']) --- Source: local data frame [6 x 1] PropertyType (fctr) 1 Multiplex 2 Single Family 3 Single Family 4 Single Family 5 Single Family 6 Townhouse
-
In Python
house.PropertyType.head()
There are three possible values: Multiplex
, Single Family
, and Townhouse
. To use this factor variable, we need to convert it into a set of binary variables.
-
In R, we use the
model.matrix
function.prop_type_dummies <- model.matrix(~PropertyType -1, data=house) head(prop_type_dummies) --- PropertyTypeMultiplex PropertyTypeSingle Family PropertyTypeTownhouse 1 1 0 0 2 0 1 0 3 0 1 0 4 0 1 0 5 0 1 0 6 0 0 1
We can confirm that the function
model.matrix
converts a data frame into a matrix suitable for a linear model. The factor variable PropertyType, which has three distinct levels, is represented as a matrix with three columns. -
In Python, we can convert categorical variables to dummies using the pandas method
get_dummies
:pd.get_dummies(house['PropertyType']).head() pd.get_dummies(house['PropertyType'], drop_first=True).head()
The keyword argument
drop_first
returns $P – 1$ columns. Use this to prevent multicollinearity issues.
For specific machine learning models like nearest neighbors and tree models, one-hot encoding is the typical method for representing categorical variables.
In regression, a factor variable with $P$ levels is represented by a matrix of $P – 1$ columns due to the intercept term. Defining $P – 1$ binaries determine the $P$th value, making it redundant. Including the $P$th column leads to multicollinearity errors.
-
In R, the default representation uses the first factor level as a reference for the other levels.
lm(AdjSalePrice ~ SqFtTotLiving + SqFtLot + Bathrooms + Bedrooms + BldgGrade + PropertyType, data=house) --- Call: lm(formula = AdjSalePrice ~ SqFtTotLiving + SqFtLot + Bathrooms + Bedrooms + BldgGrade + PropertyType, data = house) Coefficients: (Intercept) SqFtTotLiving -4.468e+05 2.234e+02 SqFtLot Bathrooms -7.037e-02 -1.598e+04 Bedrooms BldgGrade -5.089e+04 1.094e+05 PropertyTypeSingle Family PropertyTypeTownhouse -8.468e+04 -1.151e+05
-
In Python, the get_dummies method includes the optional keyword argument drop_first to omit the first factor as a reference.
predictors = ['SqFtTotLiving', 'SqFtLot', 'Bathrooms', 'Bedrooms', 'BldgGrade', 'PropertyType'] X = pd.get_dummies(house[predictors], drop_first=True) house_lm_factor = LinearRegression() house_lm_factor.fit(X, house[outcome]) print(f'Intercept: {house_lm_factor.intercept_:.3f}') print('Coefficients:') for name, coef in zip(X.columns, house_lm_factors.coef_): print(f' {name}: {coef}')
The R regression results display two PropertyType coefficients: PropertyTypeSingle Family
and PropertyTypeTownhouse
. Multiplex
has no coefficient, as it’s implicitly defined when PropertyTypeSingle Family == 0
and PropertyTypeTownhouse == 0
. These coefficients are interpreted relative to Multiplex
; therefore, a Single Family
home is valued at nearly $85,000 less, while a Townhouse
is valued at over $150,000 less.
Various methods exist to encode factor variables, called contrast coding systems. For instance, deviation coding, or sum contrasts, evaluates each level to the overall mean. Another method is polynomial coding, which is suitable for ordered factors.
Factor Variables with Many Levels
Factor variables can create numerous binary dummies; zip codes, for example, number 43,000 in the US. Thus, analyzing how predictor variables relate to the outcome is crucial to determine if the categories provide valuable information. If they do, decide whether to retain all factors or combine some levels.
In the previous example dataset, King County has 80 zip codes with house sales.
ZipCode
is crucial as it proxies the effect of location on house value. Including all levels requires $79$ coefficients, representing $79$ degrees of freedom. Many zip codes have only one sale. Consolidating a zip code to the first two or three digits is one way to combine it into some levels. However, in King County, most sales are in 980xx or 981xx, making this method less useful.
Grouping zip codes by sales price or model residuals is a better analysis method.
-
The
dplyr
code in R consolidates 80 zip codes into five groups based on the median residual from thehouse_lm
regression:zip_groups <- house %>% mutate(resid = residuals(house_lm)) %>% group_by(ZipCode) %>% summarize(med_resid = median(resid), cnt = n()) %>% arrange(med_resid) %>% mutate(cum_cnt = cumsum(cnt), ZipGroup = ntile(cum_cnt, 5)) house <- house %>% left_join(select(zip_groups, ZipCode, ZipGroup), by='ZipCode')
The median residual is calculated for each zip code, and the
ntile
function divides the zip codes, ordered by the median, into five groups. -
In Python, we can compute this information as follows.
zip_groups = pd.Dataframe([ *pd.DataFrame({ 'ZipCode': house['ZipCode'], 'residual' : house[outcome] - house_lm.predict(house[predictors]), }) .groupby(['ZipCode']) .apply(lambda x: { 'ZipCode': x.iloc[0,0], 'count': lent(x), 'median_residual': x.residual.median() }) ]).sort_values('median_residual') zip_groups['cum_count'] = np.cumsum(zip_groups['count']) zip_groups['ZipGroup'] = pd.qcut(zip_groups['cum_count'], 5, labels=False, retbins=False) to_join = zip_groups[['ZipCode', 'ZipGroup']].set_index('ZipCode') house = house.join(to_join, on='ZipCode') house['ZipGroup'] = house['ZipGroup'].astype('category')
Ordered Factor Variables
Some factor variables denote levels of a factor and are referred to as ordered factor variables or ordered categorical variables. For example, a loan grade may be A, B, C, etc.—with each grade representing more risk than the preceding one. Often, ordered factor variables can be converted to numerical values and utilized as such. Treating ordered factors as numeric variables maintains the information contained in the ordering that would be lost if it were transformed into a mere factor.
Leave a comment