4 minute read

Practical Statistics for Data Scientists: Factor Variables in Regression, Dummy Variables, Many Levels, and Ordered Factor Variables

CD18FAC9-D116-4808-B910-2DF0D65D5A4C_1_105_c

Factor Variables in Regression

Factor variables, or categorical variables, have a limited number of discrete values. A binary (yes/no) variable, or indicator variable, is a specific type of factor variable. Since regression requires numerical inputs, factor variables must be converted into binary dummy variables for modeling.

Key Terms for Factor Variables

  • Dummy Variables
    • Binary variables, represented as 0 and 1, are created by collecting factor data for regression and various modeling applications.
  • Reference Coding
    • Statisticians commonly use coding where one-factor level serves as a reference for comparisons.
    • = treatment coding
  • One Hot Encoder
    • A common coding type in machine learning retains all factor levels. While useful for some algorithms, it is unsuitable for multiple linear regression.
  • Deviation Coding
    • A type of coding that compares each level against the overall mean instead of the reference level.
    • = sum contrasts

Dummy Variables Representation

For example, previous King County housing data includes a property type variable; below is a subset of six records.

  • In R

    head(house[, 'PropertyType'])
    ---
    Source: local data frame [6 x 1]
      
       PropertyType
             (fctr)
    1     Multiplex
    2 Single Family
    3 Single Family
    4 Single Family
    5 Single Family
    6     Townhouse
    
  • In Python

    house.PropertyType.head()
    

There are three possible values: Multiplex, Single Family, and Townhouse. To use this factor variable, we need to convert it into a set of binary variables.

  • In R, we use the model.matrix function.

    prop_type_dummies <- model.matrix(~PropertyType -1, data=house)
    head(prop_type_dummies)
    ---
      PropertyTypeMultiplex PropertyTypeSingle Family PropertyTypeTownhouse
    1                     1                         0                     0
    2                     0                         1                     0
    3                     0                         1                     0
    4                     0                         1                     0
    5                     0                         1                     0
    6                     0                         0                     1
    

    We can confirm that the function model.matrix converts a data frame into a matrix suitable for a linear model. The factor variable PropertyType, which has three distinct levels, is represented as a matrix with three columns.

  • In Python, we can convert categorical variables to dummies using the pandas method get_dummies:

    pd.get_dummies(house['PropertyType']).head()
    pd.get_dummies(house['PropertyType'], drop_first=True).head()
    

    The keyword argument drop_first returns $P – 1$ columns. Use this to prevent multicollinearity issues.

For specific machine learning models like nearest neighbors and tree models, one-hot encoding is the typical method for representing categorical variables.

In regression, a factor variable with $P$ levels is represented by a matrix of $P – 1$ columns due to the intercept term. Defining $P – 1$ binaries determine the $P$th value, making it redundant. Including the $P$th column leads to multicollinearity errors.

  • In R, the default representation uses the first factor level as a reference for the other levels.

    lm(AdjSalePrice ~ SqFtTotLiving + SqFtLot + Bathrooms +
       Bedrooms + BldgGrade + PropertyType, data=house)
    ---
    Call:
    lm(formula = AdjSalePrice ~ SqFtTotLiving + SqFtLot + Bathrooms + Bedrooms + BldgGrade + PropertyType, data = house)
      
    Coefficients:
                  (Intercept)              SqFtTotLiving
                   -4.468e+05                  2.234e+02
                      SqFtLot                  Bathrooms
                   -7.037e-02                 -1.598e+04
                     Bedrooms                  BldgGrade
                   -5.089e+04                  1.094e+05
    PropertyTypeSingle Family      PropertyTypeTownhouse
                   -8.468e+04                 -1.151e+05
    
  • In Python, the get_dummies method includes the optional keyword argument drop_first to omit the first factor as a reference.

    predictors = ['SqFtTotLiving', 'SqFtLot', 'Bathrooms', 'Bedrooms', 'BldgGrade', 'PropertyType']
      
    X = pd.get_dummies(house[predictors], drop_first=True)
      
    house_lm_factor = LinearRegression()
    house_lm_factor.fit(X, house[outcome])
      
    print(f'Intercept: {house_lm_factor.intercept_:.3f}')
    print('Coefficients:')
    for name, coef in zip(X.columns, house_lm_factors.coef_):
      print(f' {name}: {coef}')
    

The R regression results display two PropertyType coefficients: PropertyTypeSingle Family and PropertyTypeTownhouse. Multiplex has no coefficient, as it’s implicitly defined when PropertyTypeSingle Family == 0 and PropertyTypeTownhouse == 0. These coefficients are interpreted relative to Multiplex; therefore, a Single Family home is valued at nearly $85,000 less, while a Townhouse is valued at over $150,000 less.

Various methods exist to encode factor variables, called contrast coding systems. For instance, deviation coding, or sum contrasts, evaluates each level to the overall mean. Another method is polynomial coding, which is suitable for ordered factors.

Factor Variables with Many Levels

Factor variables can create numerous binary dummies; zip codes, for example, number 43,000 in the US. Thus, analyzing how predictor variables relate to the outcome is crucial to determine if the categories provide valuable information. If they do, decide whether to retain all factors or combine some levels.

In the previous example dataset, King County has 80 zip codes with house sales.

ZipCode is crucial as it proxies the effect of location on house value. Including all levels requires $79$ coefficients, representing $79$ degrees of freedom. Many zip codes have only one sale. Consolidating a zip code to the first two or three digits is one way to combine it into some levels. However, in King County, most sales are in 980xx or 981xx, making this method less useful.

Grouping zip codes by sales price or model residuals is a better analysis method.

  • The dplyr code in R consolidates 80 zip codes into five groups based on the median residual from the house_lm regression:

    zip_groups <- house %>%
    	mutate(resid = residuals(house_lm)) %>%
      group_by(ZipCode) %>%
      summarize(med_resid = median(resid),
                cnt = n()) %>%
      arrange(med_resid) %>%
      mutate(cum_cnt = cumsum(cnt),
             ZipGroup = ntile(cum_cnt, 5))
    house <- house %>%
      left_join(select(zip_groups, ZipCode, ZipGroup), by='ZipCode')
    

    The median residual is calculated for each zip code, and the ntile function divides the zip codes, ordered by the median, into five groups.

  • In Python, we can compute this information as follows.

    zip_groups = pd.Dataframe([
      *pd.DataFrame({
        'ZipCode': house['ZipCode'],
        'residual' : house[outcome] - house_lm.predict(house[predictors]),
      })
      .groupby(['ZipCode'])
      .apply(lambda x: {
        'ZipCode': x.iloc[0,0],
        'count': lent(x),
        'median_residual': x.residual.median()
      })  
    ]).sort_values('median_residual')
    zip_groups['cum_count'] = np.cumsum(zip_groups['count'])
    zip_groups['ZipGroup'] = pd.qcut(zip_groups['cum_count'], 5, labels=False, 
                                     retbins=False)
      
    to_join = zip_groups[['ZipCode', 'ZipGroup']].set_index('ZipCode')
    house = house.join(to_join, on='ZipCode')
    house['ZipGroup'] = house['ZipGroup'].astype('category')
    



Ordered Factor Variables

Some factor variables denote levels of a factor and are referred to as ordered factor variables or ordered categorical variables. For example, a loan grade may be A, B, C, etc.—with each grade representing more risk than the preceding one. Often, ordered factor variables can be converted to numerical values and utilized as such. Treating ordered factors as numeric variables maintains the information contained in the ordering that would be lost if it were transformed into a mere factor.



Leave a comment