2 minute read

SSE Sum Squared Error and Choosing Cost Function

day9in

The Sum of Squared Errors (SSE)

SSE is another common measure used to evaluate the performance of a linear regression model, closely related to the Mean Squared Error (MSE). While MSE is the average of the squared differences between the predicted and actual values, SSE is the total of these squared differences.

The SSE can be mathematically represented as follows:

$$ \text{SSE} = \sum_{i=1}^n (y_i-\hat{y_i}^2) $$


Where:

  • $y_i$ is the actual value of the dependent variable for the $i^{th}$ observation,
  • $\hat{y_i}$ is the predicted value for the $i^{th}$ observation,
  • $n$ is the number of observations.


Purpose of SSE

  • Model Accuracy: SSE measures the total error of the predictions. A lower SSE indicates a model that more accurately predicts the dependent variable.
  • Model Optimization: When training a linear regression model, minimizing the SSE is equivalent to finding the line of best fit. This is typically done through optimization techniques such as gradient descent.
  • Comparing Models: SSE can be used to compare the performance of different models on the same dataset. The model with the lower SSE is generally considered better at prediction.


Why Use SSE?

  • Simplicity and Differentiability: SSE is mathematically simple and smooth, which allows for applying optimization algorithms that require derivatives, such as gradient descent.

  • Emphasis on Larger Errors: SSE gives greater weight to larger errors by squaring the residuals. This characteristic makes the model sensitive to outliers but also ensures that the fitting process focuses on reducing larger errors more significantly.


Characteristics of SSE

  • Scale Dependency: Unlike MSE, SSE depends on the number of data points. Larger datasets tend to have a larger SSE simply due to the increased number of errors being summed.
  • Sensitivity to Outliers: Like MSE, SSE is sensitive to outliers because the errors are squared. This means outliers disproportionately affect the total SSE, which can skew the model fitting.


Relation to Other Metrics

SSE is directly related to MSE and the coefficient of determination ($R^2$):

  • MSE Relation: MSE is simply SSE divided by the number of observations $n$ $(MSE = SSE/n)$. MSE is, therefore, an average of the SSE, making it scale-independent.
  • $R^2$ Relation: $R^2$ is a statistic that provides a sense of how well the observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model. It is calculated to where $SST(Total\ Sum\ of\ Squares)$ measures the total variability of the dataset.


How can we estimate the lowest cost?

  • Direct Minimization: typically refers to solving the normal equations derived from setting the gradient of the cost function to zero in the context of linear regression.

  • Gradient Descent: is an iterative optimization algorithm used in machine learning to find the minimum of a function. In the context of linear regression, it is used to find the coefficients that minimize the cost function by iteratively moving in the direction of the steepest descent as defined by the negative of the gradient. (Gradient: A vector-valued function that represents the slope of the tangent of the function graph, pointing the direction of the greatest rate of increase in the function)

  • Choosing Between Them

    • Dataset Size and Features: Direct minimization is more suitable for smaller or moderate-sized datasets where computational resources allow for matrix operations. Gradient descent is preferred for large datasets or when features far exceed samples.
    • Precision vs. Scalability: Direct minimization is ideal if precision is crucial and the dataset is well-conditioned (e.g., no issues with multicollinearity). Gradient descent offers scalability and flexibility at the cost of requiring careful tuning and potentially slower convergence.

day09out

Leave a comment