Machine Learning - Support Vector Machine – Technical Documentation

Fits a support vector machine^[1] for classification or regression.

Examples

Categorical outcome

The table below shows the Accuracy as computed by a Support Vector Machine. The Overall Accuracy is the percentage of instances that are correctly categorized by the model. The accuracies of each individual class are also displayed. In this example, the model is best at correctly predicting Group C.

The Prediction-Accuracy Table gives a more complete picture of the output, showing the number of observed examples for each class that were predicted to be in each class. In this example, 33 instances of Group B are wrongly predicted to be Group A.

Numeric outcome

The table below shows the Support Vector Machine outputs for a numeric outcome variable. Accuracy displays 2 measures of performance : Root Mean Square Error (the square root of the average squared difference between the predicted and target outcomes) and R-squared (a measure of the fraction of variation in the data that is explained by the model).

For a numeric outcome variable, the Prediction-Accuracy Table is generated by bucketing the predicted and target outcomes and indicating when the bucket of a predicted example does or does not match its observed bucket.

Options

Outcome The variable to be predicted by the predictor variables. It may be either a numeric or categorical variable.

Predictors The variable(s) to predict the outcome.

Algorithm The machine learning algorithm. Defaults to Support Vector Machine but may be changed to other machine learning methods.

Output

Accuracy Produces measures of the goodness of model fit, as illustrated above.

Prediction-Accuracy Table Produces a table relating the observed and predicted outcome. Also known as a confusion matrix.

Detail This returns the default output from svm in the e1071 package^[2] .

Missing data See Missing Data Options.

Variable names Displays Variable Names in the output instead of labels.

Cost Controls the extent to which the model correctly predicts the outcome for each training example. Low values of cost maximize the margin between the classes when searching for a separating hyperplane, with the trade-off that certain examples may be misclassified (i.e. lie on the wrong side of the hyperplane). High values of cost result in a smaller margin of separation between the classes and fewer misclassifications. Lowering the cost has the impact of increasing the regularisation, which implies higher bias / lower variance and thus controls overfitting. Raising the cost increases the flexibility of the model but for extreme values will decrease the ability to generalize predictions to unseen data. A typical range of cost to explore would be 0.0001 to 10000.

Random seed Seed used to initialize the (pseudo)random number generator for the model fitting algorithm. Different seeds may lead to slightly different answers, but should normally not make a large difference.

Increase allowed output size Check this box if you encounter a warning message "The R output had size XXX MB, exceeding the 128 MB limit..." and you need to reference the output elsewhere in your document; e.g., to save predicted values to a Data Set or examine diagnostics.

Maximum allowed size for output (MB). This control only appears if Increase allowed output size is checked. Use it to set the maximum allowed size for the regression output in MegaBytes. The warning referred to above about the R output size will state the minimum size you need to increase to return the full output. Note that having very many large outputs in one document or page may slow down the performance of your document and increase load times.

Weight Where a weight has been set for the R Output, a new data set is generated via resampling, and this new data set is used in the estimation.

Filter The data is automatically filtered using any filters prior to estimating the model.

Additional options are available by editing the code.

Diagnostics

Prediction-Accuracy Table Creates a table showing the observed and predicted values, as a heatmap.

Save Variable(s)

Predicted Values Creates a new variable containing predicted values for each case in the data.

Probabilities of Each Response Creates new variables containing predicted probabilities of each response.

Additional Notes for classification

Typically, classification algorithms will estimate the probabilities that an observation belongs to each category and then estimate the category as the one with the highest probability. However, Support Vector Machines do not do this. As a result, the predicted category for an observation (as obtained using the Predicted Values option above) may not be the category with the highest probability (as obtained using the Probabilities of Each Response option above). The sections below describe the technical details of how the Support Vector Machine works out the predicted categories and probabilities.

The Support Vector Machine determines a hyperplane using a constrained optimization equation for a binary classification problem. The extension for Support Vector Machines to the multi-class (multi-label) situation involves using a One vs. One approach where a Support Vector Machine is fitted for all pairwise combinations of labels and then the labels and probabilities are estimated by aggregating the results of each pairwise classifier.

Binary Classification

The standard Support Vector Machine (binary classifier) solves the constrained estimation equation.

\[\begin{align} \max_{\beta_0, \beta_1, \beta_2, \ldots, \beta_p, \epsilon_1, \ldots, \epsilon_n} M \quad \text{such that} \quad \sum_{j = 1}^p \beta_j^2 &= 1\\
y_i ( \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots +\beta_p x_{ip}) &\ge M(1 - \epsilon_i)\\
\epsilon_i \ge 0, \sum_{i = 1}^n \epsilon_i &\le C\end{align}\]

where the data and parameters are defined with

The \(y_i\) values denote the observed class labels, which are −1 or +1 if an observation belongs to the negative or positive class, respectively.
The \(x_{ij}\) values denote the observed value for respondent \(i\) for variable \(j\).
The decision value is defined, \( f(x) = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + ... + \beta_p x_{ip} \) which is the hyperplane equation that defines the hyperplane (solid black line in above plot) and produces the signed distance of points away from the hyperplane.
The \(M\) parameter is the width of the margin from the hyperplane (distance between dotted line and hyperplane above).
The \(C\) parameter is the cost parameter. It is a non-negative parameter that allows the points to violate the margin and hyperplane boundary (no violation is allowed if \(C\) is zero and increasing amounts of violation are allowed as \(C\) increases.

Once the hyperplane equation is estimated, the labels and probabilities of label membership are determined using the signed distance from the hyperplane. Nonlinear boundaries are accounted for using the standard ‘kernel trick’ and everything below will still follow.

Class prediction

This is quite simple, and labels are estimated by their placement away from the hyperplane \(f(x)\). The signed distance away from the hyperplane (defined as the decision value in the algorithm), determines the estimated label. E.g., if \( f(x) \lt 0\), it predicts the negative class, if \(f(x) \gt 0\) it predicts the positive class.

Probability prediction

To estimate the probability that an observation belongs to each class, the logistic mapping is used to convert the signed distances (decision values) to probabilities. That is, each observation produces a decision value \(f(x)\). These decision values are mapped using the logistic function,

\[\begin{align} P(Y = 1|x) = \frac{1}{1 + \exp\left( A f(x) + B\right)}\end{align}\]

The \(A\) and \(B\) parameters are estimated by maximizing (minimizing) the (negative) log-likelihood of the data. This is equivalent to fitting a logistic regression using the single index version of the data (decision values of the data).

Multi-class Classification

The Support Vector Machine cannot directly handle multi-class classification data. Instead, all pairs of class labels are separately fitted using binary classification Support Vector Machines. A One vs. One approach is used to determine the predictions for each class.

Class predictions

Each pair is inspected and the class that wins the most pairwise comparisons is determined as the predicted class.

Probability prediction

Each pairwise model will generate a probability of belonging to the positive class in each case. These pairwise probabilities are then used to estimate the multi-class probabilities via the second algorithm in Wu, Lin and Weng (2003) ^[3].

Acknowledgments

Cortes, C., Vapnik, V. Support-vector networks. Machine Learning, 20, 273–297 (1995). doi: https://doi.org/10.1007/BF00994018
Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F (2023). _e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien_. R package version 1.7-13, <https://CRAN.R-project.org/package=e1071>
Wu, T-F and Lin, C-J and Weng, R (2003). Probability estimates for multi-class classification by pairwise coupling. Journal of Machine Learning Research, 5, 975--1005. doi: https://doi.org/10.5555/1005332.1016791

More information

This blog post explains the concept of support vector machines.
The process of determining the cost parameter is described here.

How to Run Support Vector Machine in Displayr

Machine Learning - Support Vector Machine in Q

Articles in this section

Machine Learning - Support Vector Machine

Examples

Categorical outcome

Numeric outcome

Options

Diagnostics

Save Variable(s)

Additional Notes for classification

Binary Classification

Class prediction

Probability prediction

Multi-class Classification

Class predictions

Probability prediction

Acknowledgments

More information

Next

Articles in this section

Examples

Categorical outcome

Numeric outcome

Options

Diagnostics

Save Variable(s)

Additional Notes for classification

Binary Classification

Class prediction

Probability prediction

Multi-class Classification

Class predictions

Probability prediction

Acknowledgments

More information

Next

Related articles