12  Classification Using Logistic Model

In this section, we dive into classification models

12.1 Overview

The linear models that we have constructed so far have proven to be effective in forecasting numerical outcomes such as the housing price or the number of customers who may click on a hyperlink. These models, known as regression models, are suited for such applications. Nonetheless, there are situations where we require the ability to predict events or categories, rather than just a numerical quantity. For instance, we may want to ascertain the species of a flower, whether a person is likely to click on an emailed link, or the likelihood of a tree dying in a given year. For these tasks, we need to use a classification model.

For logistic regression, we use a “soft threshold”, by choosing a logistic function, \(\theta\), that has a sigmoidal shape. The sigmoidal function can take on various forms, such as the following:

\[\theta\left(s\right) = \frac{e^s}{1+e^s}\] This model implements a probability that has a genuine probability interpretation. The likelihood of any dataset, \(\mathcal{D} = \left(\mathbf{x_1},y_1\right), \dots, \left(\mathbf{x_N},y_N\right)\), that we wish to maximize is given by:

\[\prod\limits_{i=1}^N P\left(y_i | \mathbf{x_i}\right) = \prod\limits_{i=1}^N \theta\left(y_i \mathbf{w^T x_i}\right)\]

It is possible to derive an error measure (that would maximise the above likelihood measure), which has a probabilistic connotation, and is called the in-sample “cross-entropy” error. It is based on assuming the hypothesis (of the logistic regression function) as the target function:

\[E_{in}\left(\mathbf{w}\right) = \frac{1}{N}\sum\limits_{n=1}^N \ln\left[1 + \exp\left(-y_n \mathbf{w^T x_n}\right)\right]\]\[E_{in}\left(\mathbf{w}\right) = \frac{1}{N}\sum\limits_{n=1}^N e\left[ h\left(\mathbf{x_n}\right), y_n \right]\]

While the above does not have a closed form solution, it is a convex function and therefore we can find the weights corresponding to the minimum of the above error measure using various techniques. Such techniques include gradient descent (and its variations, such as stochastic gradient descent and batch gradient descent) and there are others which make use of second order derivatives (such as the conjugate gradient method) or Hessians.

To walk through how these work, let’s start using a new dataset included in base R - mtcars which contains information on different car models, including their fuel efficiency (mpg), number of cylinders (cyl), engine displacement (disp), horsepower (hp), weight (wt), and other features. To obtain more information on the dataset, you can type ?mtcars in the console.

library(ggplot2)
library(plotly)
#> 
#> Attaching package: 'plotly'
#> The following object is masked from 'package:ggplot2':
#> 
#>     last_plot
#> The following object is masked from 'package:stats':
#> 
#>     filter
#> The following object is masked from 'package:graphics':
#> 
#>     layout
# Load mtcars dataset
data(mtcars)

p <- ggplot(mtcars, aes(mpg, am)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = F) +
  theme_minimal()
ggplotly(p)
#> `geom_smooth()` using formula = 'y ~ x'

There are a few reasons based on statistics why using a simple straight line is not appropriate for classification problems. When the data has heavy tails, meaning that there are data points that have a high probability (such as 95%) of belonging to one group, the linear formula becomes less effective. Similarly, when predictors are non-linear, where high and low values of a predictor tend to make belonging to group 0 more likely and middle values tend to be 1s, the linear model performs poorly. Instead, a logistic model is preferred, which produces a line that looks more like this:

p <- ggplot(mtcars, aes(mpg, am)) + 
  geom_point() + 
  geom_smooth(method = "loess", se = F)+
  theme_minimal()

ggplotly(p)
#> `geom_smooth()` using formula = 'y ~ x'

Logistic Regression vs. Linear Regression Logistic regression works similarly to linear regression, where data points with probabilities higher than 50% are assigned to the class “1” and so on. Comparing the two, the logistic model demonstrates slightly better classification performance. For example, using the same model formula, logistic regression achieved 81% accuracy, outperforming the linear model.

While some experts, such as certain professors, argue that the simplicity of linear models makes them the best choice for the majority of datasets, logistic models are generally as easy to implement and understand. Moreover, logistic regression usually provides better classification results. As such, logistic models are commonly relied upon for classifying datasets.

Creating Logistic Regression Models The process of creating logistic regression models is quite similar to that of linear regression models. Instead of using the lm() function, the glm() function is employed, as logistic regression belongs to the family of algorithms known as generalized linear models. The formula and data arguments still need to be supplied, just as with lm(). The main difference when using glm() is the additional requirement to specify the argument family. In this case, setting family = binomial will calculate the logistic model.

We can build a logistic regression model using the following line of code:

glm(am ~ mpg, data = mtcars, family = binomial)
#> 
#> Call:  glm(formula = am ~ mpg, family = binomial, data = mtcars)
#> 
#> Coefficients:
#> (Intercept)          mpg  
#>      -6.604        0.307  
#> 
#> Degrees of Freedom: 31 Total (i.e. Null);  30 Residual
#> Null Deviance:       43.23 
#> Residual Deviance: 29.68     AIC: 33.68

The model summary can be viewed just like we did with lm():

logmod <- glm(am ~ mpg, data = mtcars, family = binomial)
summary(logmod)
#> 
#> Call:
#> glm(formula = am ~ mpg, family = binomial, data = mtcars)
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)   
#> (Intercept)  -6.6035     2.3514  -2.808  0.00498 **
#> mpg           0.3070     0.1148   2.673  0.00751 **
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 43.230  on 31  degrees of freedom
#> Residual deviance: 29.675  on 30  degrees of freedom
#> AIC: 33.675
#> 
#> Number of Fisher Scoring iterations: 5

The output presents various metrics that help to understand and interpret the logistic regression model:

  • Deviance Residuals: These are a measure of the difference between the observed values and the values predicted by the model. The minimum, 1st quartile, median, 3rd quartile, and maximum deviance residuals are shown. Smaller values indicate better model fit.

  • Coefficients: These are the estimated coefficients for the model predictors (intercept and mpg in this case). The Estimate column shows the estimated value of each coefficient, while the Std. Error column presents the standard error for the estimate. The z value is the test statistic (estimate divided by standard error), and Pr(>|z|) is the p-value, which indicates the significance of each predictor in the model. Significance codes are provided to help interpret the p-value (e.g., *** for p < 0.001, ** for p < 0.01, and * for p < 0.05).

  • Dispersion parameter: For the binomial family, the dispersion parameter is taken to be 1. This indicates that the model assumes the variance of the response variable is equal to its mean.

  • Null deviance: This is the deviance of the null model (a model with no predictors). In this case, the null deviance is 43.230, calculated based on 31 degrees of freedom.

  • Residual deviance: This is the deviance of the fitted model, which measures the goodness-of-fit. The residual deviance is 29.675, calculated based on 30 degrees of freedom. A lower residual deviance compared to the null deviance indicates a better model fit.

  • AIC (Akaike Information Criterion): This is a measure of the model’s quality in terms of the trade-off between goodness-of-fit and model complexity. A lower AIC value suggests a better model.

  • Number of Fisher Scoring iterations: This is the number of iterations taken by the Fisher Scoring algorithm to converge. In this case, the algorithm converged in 5 iterations.

We can further refine our model by adding more variables and changing the formula, just as we did with linear models:

summary(glm(am ~ hp + wt, data = mtcars, family="binomial"))
#> 
#> Call:
#> glm(formula = am ~ hp + wt, family = "binomial", data = mtcars)
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)   
#> (Intercept) 18.86630    7.44356   2.535  0.01126 * 
#> hp           0.03626    0.01773   2.044  0.04091 * 
#> wt          -8.08348    3.06868  -2.634  0.00843 **
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 43.230  on 31  degrees of freedom
#> Residual deviance: 10.059  on 29  degrees of freedom
#> AIC: 16.059
#> 
#> Number of Fisher Scoring iterations: 8

We can see the AIC decreased significantly after adding an extra feature.

Predicting with Logistic Regression Models In fact, logistic models created with the glm() function work similarly to linear models made with lm(). However, there are some key differences to consider. For instance, when using the model to make predictions, let’s assign the model to mtmod:

mtmod <- glm(am ~ hp + wt, data = mtcars, family="binomial")

Recall how we generated predictions from our models using predict(model, data). Here is what happens if we do something similar with logistic regression:

head(predict(mtmod, mtcars))
#>         Mazda RX4     Mazda RX4 Wag        Datsun 710    Hornet 4 Drive 
#>         1.6757093        -0.3855769         3.4844067        -3.1339584 
#> Hornet Sportabout           Valiant 
#>        -2.5961266        -5.2956878

This does not produce the expected probabilities. The reason is that R is attempting to use the logistic model as if it were a linear model. To fix this, we must specify type = "response" in our predict() call:

head(predict(mtmod, mtcars, type = "response"))
#>         Mazda RX4     Mazda RX4 Wag        Datsun 710    Hornet 4 Drive 
#>       0.842335537       0.404782533       0.970240822       0.041728035 
#> Hornet Sportabout           Valiant 
#>       0.069388122       0.004988159

With this adjustment, we can generate the correct probabilities.