Fits a random forest of classification or regression trees.
Examples
Categorical outcome
The table below shows the variable importance as computed by a Random Forest. The column called MeanDecreaseAccuracy contains a measure of the extent to which a variable improves the accuracy of the forest in predicting the classification. Higher values mean that the variable improves prediction. In a rough sense, it can be interpreted as showing the amount of increase in classification accuracy that is provided by including the variable in the model (a more precise statement of the meaning is complicated, and requires a detailed understanding of the underlying mechanics of random forests). In this example, x1 is clearly the most important variable, followed by x2, and x3.
The first three columns show the importance of the variable at improving accuracy by category of the outcome variable. We can see in this example, that x1's importance as a predictor is largely due to its usefulness in predicting membership of Group C, whereas x2 is primarily improving prediction of Group A, followed by Group C, and has a marginally deleterious impact on prediction of Group B.
Importance (MeanDecreaseGini) provides a more nuanced measure of importance, which factors in both the contribution that variable makes to accuracy, and the degree of misclassification (e.g., if a variable improves the probability of an observation being classified to a segment from 55% to 90%, this will show up in the Importance (MeanDecreaseGini), but not in MeanDecreaseAccuracy). As with MeanDecreaseAccuracy, high numbers indicate that a variable is more important as a predictor.
Numeric outcome
The table below shows the random forest outputs for a numeric outcome variable. The first column can be interpreted as indicating the extent to which different variables explain the variance in the dependent variable. The second column can be interpreted as showing the extent to which different variables reduce uncertainty in the predictions of the model. As with the description of the categorical variable random forest, these are only rough "translations" of the true meaning of these metrics. It is not clear which metric is better for judging importance.
Outcome variables which are numeric but only have two non-missing unique values will be treated as categorical.
Options
Outcome The variable to be predicted by the predictors. It may be either a numeric variable, in which case a forest of regression trees is estimated, or classification trees if categorical.
Predictors The variable(s) to predict the outcome.
Algorithm The machine learning algorithm. Defaults to Random Forest but may be changed to other machine learning methods.
Output
- Importance Produces importance tables, as illustrated above.
- Detail This returns the default output from randomForest in the randomForest package. It includes a confusion matrix for classification trees, and the percentage of variance explained for regression trees. Note, that the confusion matrix reported here is for predicting on only "out-of-bag" data, i.e. observations that were held-out when fitted the tree.
- Prediction-Accuracy Table Produces a table relating the observed and predicted outcome. Also known as a confusion matrix. Note, that this table includes results for in-sample predictions, not just out-of-bag samples as for the Detail option.
Missing data See Missing Data Options.
Variable names Displays Variable Names in the output instead of labels.
Sort by importance Sort the rows by importance (the last column in the table).
Random seed Seed used to initialize the (pseudo)random number generator for the model fitting algorithm. Different seeds may lead to slightly different answers, but should normally not make a large difference.
Increase allowed output size Check this box if you encounter a warning message "The R output had size XXX MB, exceeding the 128 MB limit..." and you need to reference the output elsewhere in your document; e.g., to save predicted values to a Data Set or examine diagnostics.
Maximum allowed size for output (MB). This control only appears if Increase allowed output size is checked. Use it to set the maximum allowed size for the regression output in MegaBytes. The warning referred to above about the R output size will state the minimum size you need to increase to return the full output. Note that having very many large outputs in one document or page may slow down the performance of your document and increase load times.
Weight. Where a weight has been set for the R Output, a new data set is generated via resampling, and this new data set is used in the estimation. This causes the resulting measures of prediction accuracy (R-square and out-of-bag sample) to be overly optimistic. The unweighted model should be used when evaluating prediction accuracy.
Filter The data is automatically filtered using any filters prior to estimating the model.
Additional options are available by editing the code.
DIAGNOSTICS
Prediction-Accuracy Table Creates a table showing the observed and predicted values, as a heatmap.
SAVE VARIABLE(S)
Predicted Values Creates a new variable containing predicted values for each case in the data.
Probabilities of Each Response Creates new variables containing predicted probabilities of each response.
Acknowledgments
Uses the algorithm randomForest algorithm from the randomForest package.
Breiman, L. (2001), Random Forests, Machine Learning 45(1), 5-32.
More information
This blog post explains random forests.
This post describes the data fitting process.
The calculation of variable importance is described here.