Automatically Combine Categories - By Pattern (CHAID) - Any Categories – Technical Documentation

This tool automatically combines categories in one variable based on how similar they are in distribution when compared to another variable. It is used when you have a variable with a large number of categories and you want to combine the categories of that variable by considering the patterns present when compared to a second variable. For instance, you may have a variable that contains categories for the occupation of respondents in a survey, and you may want to group those occupations based on those with similar income distributions or age distributions.

The CHAID algorithm is used to obtain the solution and only considers adjacent categories as possible combined categories. The algorithm also supports identifying some categories to be handled differently. The ones identified as unordered are free to combine with any other category.

Example

Two examples are considered below that demonstrate the algorithm with only adjacent categories allowed and the other with two categories identified as unordered and free to combine with any other category. Both examples use an input variable that measures the size of Family income and uses the Education level of the respondent to deduce the pattern for the combined categories in CHAID.

Example 1 (Purely ordered categories)

Consider the following table which shows categories of Family incomes in the columns and the Education level obtained in the rows:

We may want to create new categories by combining those Family Income categories that have the most similar pattern of Education level. Applying the Automatically Combine Categories - By Pattern (CHAID) - Adjacent Categories feature results in combined categories as shown in the new table:

We see that the Family Incomes are combined into two compound categories. The first compound category combines all incomes to a range up to $50,000 as those that tend to have a higher proportion of families with school or college level education and a lower proportion of tertiary education. The second compound category combines all the higher earning incomes with at least $50k as those with a higher proportion of tertiary level education at a university.

Example 2 (Ordered categories with two unordered categories)

This example uses the input variable called Family income with unordered categories. It is very similar to the Family income variable above but has two extra categories, "Don't know" and "I refuse to answer this question" which a respondent may wish to choose if they don't know their income precisely or don't want to communicate it. These two categories are not suitable for the income scale and it is possible to allow them the freedom to combine with any other category in the Family income scale.

The Automatically Combine Categories - By Pattern (CHAID) - Adjacent Categories feature can handle this situation. Categories that are identified as unordered are allowed to combine with any other category to form the final compound categories resulting in the new table below:

We see that the resulting combined categories are similar to the ones in Example 1 where it reduces to two compound categories that contain the incomes split at a boundary of $50,000 based on the higher or lower proportion of individuals that complete tertiary education. The addition here is that the unordered categories of "Don't know" and "I refuse to answer this question" are free to combine with any other category. In this case, the respondents who didn't know or refuse to respond seem to have a similar pattern to those with a lower proportion of tertiary education and lower incomes.

Usage

In Displayr:

From the Data Sources tree, select the variable whose categories you want to combine and click + > Combine Categories > By Pattern (CHAID) > Any Categories.
From the following menu, select the variable you want to use to determine which categories are combined. Categories are combined that are most similar when compared to a second variable.
Click OK. The new variable will appear immediately below the variable in the Data Sources tree whose categories we are combining (main_category).

In Q:

In the Variables and Questions tab, select the variable whose categories you wish to be combined, and the variable which should be compared against. The variables should be from Pick One or Pick One - Multi questions.
Select Automate > Browse Online Library > Automatically Combine Categories > By Pattern (CHAID) > Adjacent Categories.
To change how the categories are combined:
1. Select the new variable or question in the Variables and Questions tab.
2. Right-click and select Edit R Variable.
3. Choose the desired options in the Inputs section on the right.
4. Click Update R Variable.

Options

Variable This is the variable whose categories you wish to combine.

Combine by Choose the approach you wish to use for combining categories. If you do not wish to use CHAID and want to use an alternative approach, you can change this to By Value if combining numeric data, or By Geography if your data contains geographic locations (zip codes, states, cities, etc).

Based on This is the variable that you want to compare with the first Variable above. The categories of the variable selected in Variable will be combined based on the similarity of their distributions of this Based on variable.

Weight Select a weight variable here if you wish to apply the weighted version of CHAID. This will combine categories based on the weighted distributions.

CHAID ALGORITHM SETTINGS

Combine The option to choose which pairs of categories are permissible to combine. The options are:

Any categories: It is permissible for each category to combine with any other category.
Adjacent categories: It is only permissible for each category to combine with adjacent categories. Unless one or more categories are specified in the Unordered categories control. In that case, the categories specified in that control are permitted to combine with any other category and not restricted to adjacent categories.
Adjacent categories unless missing value code: The same behavior as Adjacent categories except if there are any categories which are coded with a value of NaN in the Value Atrributes. Then those categories will always be considered as unordered.
Using variable set structure: The permissible combine options are determined by the Variable type of the input variable. If the input variable is Categorical then Any categories are permissible to combine. If the input variable is Ordered Categorical then Adjacent categories are permissible to combine.

Unordered categories This control only appears if the Combine option is Adjacent categories or Adjacent categories unless missing value code. This control gives the ability to specify if particular categories should be considered as not ordered and allowed to combine with any other category. Other categories not entered here can only be combined with adjacent ones. This is appropriate if the input variable contains ordinal values on a scale but some options are not ordered. For example, if the categories are on a scale with options, 'Strongly agree', 'Agree', 'Neutral', 'Disagree', 'Strongly disagree', "Don't know" and 'I refuse to answer' then "Don't know" and 'I refuse to answer' can be identified and then permitted to combine with any of the other categories. The two category labels would need to be typed into this control and separated with a ';' or ','. Note that if Combine is set to Adjacent categories unless missing value code, then any category that is coded with a value of NaN in the Value Attributes will always be considered as unordered.

Use Exhaustive CHAID This controls whether the Exhaustive CHAID algorithm will be used. Exhaustive CHAID will take longer than a standard CHAID because it searches a larger set of category combinations, but it tends to produce a better result. The default value is Usually, which means that Exhaustive CHAID will always be used unless your Variable has so many categories that the exhaustive algorithm is likely to be really slow. If you do have a large number of categories and exhaustive CHAID is not applied, you will receive a message in the top right of your screen. In this case, you can ensure the exhaustive algorithm is applied by changing this setting to Yes.

Minimum category size The CHAID algorithm will not produce new categories that have fewer than this many cases. It will always ensure smaller categories are combined with their most similar category regardless of the statistical significance of that particular combination.

Alpha level to combine categories This is the significance level for combining categories. Each potential pair of categories to be combined is associated with a p-value, and two categories will not be combined if their p-value is lower than this level. This setting is not used in the exhaustive CHAID algorithm (so it will only have an effect if you change Use Exhaustive CHAID to No).

Alpha level to validate final combined categories This is the significance level to asses the final CHAID solution. If the p-value for the final solution is larger than this value, all of the categories will be combined into a single category because there is insufficient variation between the categories at this level. If you obtain a single category from this feature then you should consider using a different selection in the Based on menu which has a greater level of variation with the main Variable, or you can increase the value of this setting.

Multiple Comparison Adjustment This setting determines whether or not a Bonferroni correction is made when evaluating the final combined category solution. That is, it affects the p-value used to check against the Alpha level to validate final combined categories. This correction will tend to be more conservative when using the exhaustive CHAID algorithm as it conducts a much greater number of statistical tests.

Technical details

CHAID stands for Chi-square automatic interaction detection. It is an algorithm which has traditionally been used to create decision trees with multi-way splits of categorical data. It employs repeated application of Chi-squared tests to evaluate how similar pairs of categories are when compared to a second variable. See Kass, G. V. (1980)^[1] and Biggs, D., Ville, B., and Suen, E. (1991)^[2] for more details.

The standard CHAID algorithm uses a fixed level of significance to determine if a merge should be conducted, and whether or not to stop merging categories.

The exhaustive CHAID algorithm generates a set of potential solutions by always merging the two least significantly different categories until only two categories remain. It then chooses from all of those solutions by identifying the solution with the smallest p-value.

When weights are used, the second-order survey weight-adjusted test of independence of Rao and Scott (1984)^[3] is used instead of the standard Pearson Chi-squared test.

Bonferroni adjustments

If the Multiple comparison adjustment option is selected, then the significance test to assess the significance of the final state of the combined categories from the CHAID algorithm is adjusted by the number of pairwise tests conducted during the combining of each category. This adjustment is a Bonferroni-type adjustment that is computed differently for the standard CHAID algorithm against the exhaustive CHAID algorithm. The standard algorithm terminates if there are no pairwise tests that are above the significance level. While the exhaustive algorithm will combine a category with another category until only two categories remain. From the set of states generated in the exhaustive algorithm, the state with the smallest p-value is considered the optimal configuration and becomes the final combined category solution.

In the sections below, details about the Bonferroni adjustments used for both the standard CHAID algorithm and the exhaustive CHAID algorithm. In each section, the detailed adjustments for the Combine option allowing Any categories to combine or only Adjacent categories are given. The latter also considers a more refined adjustment when some Unordered categories are specified in the Adjacent categories option in Combine. More possibilities are explored and therefore a larger Bonferroni adjustment is required when some categories are allowed to combine with any other category in the Adjacent categories option. Define the initial number of categories in the variable as $c$ and the final number of reduced categories from the combined solution as $r$. Then the Bonferroni adjustment is denoted $B(c,r)$ for each of the possible scenarios below (assuming of course that $1 \le r \le c$ are integer-valued).

Standard algorithm

The standard algorithm follows the Bonferroni adjustment approach used in Kass (1980)^[1]. Here the adjustment considers the number of possible arrangements from reducing $c$ categories into $r$ categories. In the case of all categories being allowed to combine with any other category (Any categories selected in the Combine) control. Then this is solved by a result of partitions. In particular, Stirling numbers of the second kind ^[4] give the number of ways to partition a set of $c$ categories into $r$ non-empty subsets as $\left\{ \begin{smallmatrix} c\\ r \end{smallmatrix} \right\}$ and takes the role of the Bonferroni adjustment value for the case of Any categories. In particular:

\[\begin{align} B(c,r) = \left\{ \begin{matrix} c\\ r \end{matrix} \right\} = \frac{1}{r!}\sum_{i = 0}^r (-1)^i\binom{r}{i}(r - i)^c, \qquad \left\{ \begin{matrix} c\\ c \end{matrix} \right\} = 1, \quad \text{ and for } c \ge 1, \quad \left\{ \begin{matrix} c\\ 1 \end{matrix} \right\} = 1.\end{align}\]

For the case of purely adjacent categories being permissible to combine. That is Adjacent categories is selected in Combine and there are no Unordered categories specified and no missing values have been coded. Then, the adjustment is given by:

\[\begin{align} B(c,r) = \binom{c - 1}{r - 1}.\end{align}\]

In the case when there are Unordered categories specified when an Adjacent categories combine option is selected (and/or missing values coded in the case of Adjacent categories unless missing value code), then the Bonferroni adjustment becomes a combination of the two above. Assuming there are $u$ unordered categories (including categories coded as missing), with $1\le u \lt c$ then the Bonferroni adjustment is:

\[\begin{align} B(c, r, u) = \sum_{s = 0}^u \binom{c - u - 1}{r - s - 1}\sum_{i = 0}^{u-s}\binom{u}{i}\left\{ \begin{matrix} u - i\\ s \end{matrix} \right\} (r - s)^i\end{align}\]

Exhaustive algorithm

The exhaustive algorithm follows the Bonferroni adjustment approach used in Biggs, D., Ville, B., and Suen, E. (1991)^[2]. Here the adjustment considers the number of tests conducted as the algorithm traverses from the full set of $c$ categories down to two categories.

In the case of all categories being allowed to combine with any other category (Any categories selected in the Combine) control, the Bonferroni adjustment is:

\[\begin{align} B(c,r) = \sum_{k = 2}^c \binom{k}{2}\end{align}\]

\[\begin{align} B(c,r) = \binom{c}{2}.\end{align}\]

\[\begin{align} B(c, r, u) = \binom{c - u}{2} + \sum_{i = 0}^{u - 1}\frac{c - i}{2} \left( 2 c - u - 1 - i\right).\end{align}\]

Differences to SPSS CHAID

When the exhaustive CHAID algorithm evaluates very small p-values, the SPSS algorithm can in some cases stop searching for solutions earlier than the one available here. As a result, the algorithm we use here will tend to find solutions that are more significant than those produced in SPSS. The result is that the algorithm used here will combine more categories. This situation tends to arise when there is a very high level of significance between the two variables before the algorithm begins.

In some cases, the exhaustive CHAID algorithm can encounter two possible category merges which have equal p-values, which we refer to as a tie. This algorithm will attempt to break the tie by re-examining these merges within the larger set of categories at that stage of the algorithm (i.e. given the current set of merges that have happened so far). SPSS has not documented the mechanism that their algorithm uses to break ties. Such ties are rare in practice as they require identical test statistics.

References

Kass, G. V. (1980). An Exploratory Technique for Investigating Large Quantities of Categorical Data. Applied Statistics, 20, 2, 119-127. doi: https://doi.org/10.2307/2986296
Biggs, D., Ville, B., and Suen, E. (1991). A Method of Choosing Multiway Partitions for Classification and Decision Trees. Journal of Applied Statistics, 18, 1, 49-62. doi: https://doi.org/10.1080/02664769100000005
Rao, J. N. K. and A. J. Scott (1984). 'On Chi-Squared Tests for Multiway Contingency Tables with Cell Proportions Estimated from Survey Data.' The Annals of Statistics, 12, 1, 46-60. doi: https://doi.org/10.1214/aos/1176346391
Stirling Numbers of the second kind (2022). Retrieved June 9, 2022, from https://en.wikipedia.org/wiki/Stirling_numbers_of_the_second_kind

How to apply this QScript

Start typing the name of the QScript into the Search features and data box in the top right of the Q window.
Click on the QScript when it appears in the QScripts and Rules section of the search results.

Select Automate > Browse Online Library.
Select this QScript from the list.

Customizing the QScript

This QScript is written in JavaScript and can be customized by copying and modifying the JavaScript.

Start typing the name of the QScript into the Search features and data box in the top right of the Q window.
Hover your mouse over the QScript when it appears in the QScripts and Rules section of the search results.
Press Edit a Copy (bottom-left corner of the preview).
Modify the JavaScript (see QScripts for more detail on this).
Either:
- Run the QScript, by pressing the blue triangle button.
- Save the QScript and run it at a later time, using Automate > Run QScript (Macro) from File.

Articles in this section