For a more general definition of Sample Size or Base, see Sample Size in the Data Story Guide.
How the base is computed
The base is computed separately for every cell in a table.
Categorical data
When a categorical question is used in computing a statistic, a case is included in the base if:
- Missing Data is not selected in the Value Attributes for the Value of that case.
- The case is not filtered.
- The weight is greater than 0, or, there is no weight (i.e. each case is weighted as 1).
Numeric data
When a numeric question is used in computing a statistic, a case is included in the base if:
- Missing Data is not selected in the Value Attributes for the Value of that case.
- The Value of that case is not NaN.
- The case is not filtered.
- The weight is greater than 0, or, there is no weight applied (i.e. each case is weighted as 1).
Difference from other programs
The software's concept of the base is essentially the same as those in all modern statistics programs. However, some traditional crosstabbing programs and algorithms have a different definition, instead defining the base as the total number of cases (or sum of weights of the cases) that have data in at least one cell in the table. These definitions give different answers in the following situations:
- Where there is multiple response data (Binary-Multi and Binary-Grid), and cases exist with no selections (e.g., people have not chosen any options in a multiple response question), the percentages that the software shows will be systematically lower. This is usually obvious by the NET showing a value other than 100%.
- Where different cases have different sample sizes (e.g., if randomization was used, so that people only saw a subset of answers), many results will differ, because the traditional crosstabbing programs are not designed for data such as this so produce incorrect results.
The software can be made to replicate the results of other programs by rebasing tables, although some caution should be undertaken prior to attempting to do this, as the approach used in the software is in general a more valid approach. The only situations where the traditional crosstabbing programs' computations are preferable are when the NaNs and/or Missing Data selections in the data are incorrect, and it is generally a safer approach to correct these problems in either the software or the raw data file, rather than use approaches based on the assumption that the data file is incorrect.