t-Distributed Stochastic Neighbor Embedding is a technique for compressing high dimensional data to a small number of dimensions. See this post for further explanation. It attempts to preserve local structure by maintaining the distribution of the neighbors of each point. This can be contrasted with Principal Components Analysis, which preserves large-scale relationships.
t-SNE is used primarily to compress data to 2 dimensions for visualization. It does not produce a predictive model such that unseen data can be mapped to 2-D. Nor is it typically used to produce numerical output that is used as input to predictive models.
The algorithm tends to compress sparse regions and separate dense regions to produce a balanced and visually appealing output. If a t-SNE visualization show a clear separation between categories, it is likely that predictive machine learning models are capable of accurately predicting categories. Changing the seed (via the R code) may lead to visually different charts and nearest-neighbor accuracies.
Example
Output Example:
The projection of 14 variables onto 2 dimensions, with a grouping category.
Input Example:
Either a list of variables or a distance matrix between points can be given as input. In the former case, a further variable may be specified to classify the output into groups and the probability that each point has the same class as its nearest neighbor is calculated. This blog post describes an application of t-SNE to visualize a distance matrix.
Dimension Reduction - Plot - Goodness of Fit can be used to assess the accuracy of the fit.
Options
Algorithm Either t-SNE, PCA, MDS - Metric or MDS - Non-metric.
The input data can be provided via one of three options:
-
- Variables The variables or a question containing variables that you would like to analyze. Cases with missing data are ignored.
- Distance matrix Select an existing distance matrix. This should be a symmetric matrix of distances, such as the output of Correlation - Distances.
- Paste or type distance matrix Opens up a blank spreadsheet into which tabular data can be manually entered or pasted.
Create binary variables from unordered categories If selected, unordered categorical Variables with N categories are converted are converted into N-1 binary indicator variables. Otherwise such variables are each converted to a single numeric variable with integers representing categories (as happens for ordered categories). This option is only available if Variables are provided.
Group variable A variable to categorize the output. If numeric, the data are shaded from light (lowest values) to dark (highest). If categorical, data points are colored according to their category. This option is only available if Variables are provided.
Normalize variables For Variables input, whether to normalize the data.
-
- For t-SNE and MDS each variable is standardized to the range [0, 1].
- For PCA the correlation matrix is used rather than the covariance matrix.
Perplexity A parameter used by the t-SNE algorithm and related to the number of nearest neighbors considered when placing each data point. The typical useful range is from 5 to 50.
-
- Low values imply that immediately local structure is most important.
- High values increase the impact of more distant neighbors and global structure.
Additional Properties
When using this feature you can obtain additional information that is stored by the R code which produces the output.
- To do so, in Q select Create > R Output and in Displayr select Calculation > Custom Code
- In the R CODE, paste: item = YourReferenceName
- Replace YourReferenceName with the reference name of your item. Find this in the Report tree or by selecting the item and then going to Properties > General > Name from the object inspector on the right.
- Below the first line of code, you can paste in snippets from below or type in str(item) to see a list of available information.
For a more in depth discussion on extracting information from objects in R, checkout our blog post here.
Acknowledgements
van der Maaten, L. Visualizing Data using t-SNE, Journal of Machine Learning Research 9 (2008) 2579-2605
Next