Statistical Testing with Overlapping Data

The default Advanced settings for significance testing in both Q and Displayr remove overlapping respondents that are in groups being compared, see Overlaps in Advanced Statistical Testing Assumptions. This is because the software, by default, compares sub-groups against the total/average/net. This article gives a detailed explanation of why overlaps are removed in testing and what other options users have to include those respondents.

Why the software removes overlapping data

Using this table as an example:

At an intuitive level, this example can be thought of as testing the 50% for Blue amongst the Young versus the NET of 40%. Although this intuitive understanding is essentially correct in terms of how to interpret the data, it is not, at a technical level, a valid way to describe the test. To see the problem with this, it is useful to work through the maths.

Amongst the Young, 50 out of 100 people preferred Blue. Amongst the total sample of 300, 120 (40%) preferred Blue. However, the 300 respondents in the total sample include the 100 Young people, so if we compare the 100 with the 300, we would be double-counting (or, to use the more formal statistical language, we would violate the assumption of independent samples).

The standard solution to this problem is to subtract the data of the Young from the total and then perform the test. So, as 120 people in total preferred Blue and 50 of these were Young, this means that 70 (35%) of the 200 people who are not Young prefer Blue. When at its default settings, the software performs the test comparing the 50% of the 100 Young with the 35% of the 200 people who are not Young.

Note that while the testing is not explicitly being done of the 100 versus the 300, at a conceptual level, we can interpret the test as if we had compared the 100 versus the 300. This is because the only way that young people can be different from the total is if they are different from the people who are not young people.

Dependent/overlapping related samples tests

A number of statistical tests exist for testing samples where the groups overlap. They are variously known as dependent, overlapping, and related sample tests. These tests have not been developed for the problem described above. Rather, they have been developed for the following two problems:

Where all the columns in a table are permitted to overlap. For example, if the columns represent websites a person has visited in the last month. A person may have visited multiple websites, meaning that they appear in multiple columns. Note that with such data, the software does not, by default, use dependent/overlapping/related samples tests, and instead performs the test amongst respondents that are not overlapping.
Where the table is constructed from repeated measurements (e.g., a Number - Grid or Pick Any - Grid) question.

The Overlaps section in Advanced Statistical Testing Assumptions explains the other available options for how to treat overlapping data in testing.

In theory, it is possible to apply a dependent/overlapping/related samples test to the data in the example above, but if the test is conducted in a valid way it will give the same result as will be obtained if applying the strategy described above (i.e., as the underlying maths of the test would simply remove the effect of the Young being in the total sample).

Column Comparisons

If wanting to compare against the main NET column when using column comparisons, please see How to Include the Main NET Column in Column Comparisons.

Articles in this section

Why the software removes overlapping data

Dependent/overlapping related samples tests

Column Comparisons

See Also

Articles in this section

Why the software removes overlapping data

Dependent/overlapping related samples tests

Column Comparisons

See Also

Related articles