The Data Preparation Agent feature uses Displayr AI (which needs to be enabled) to review a Data Set and run a series of data checks (listed below) to clean and prepare the data for analysis. It will modify existing variables that need attention, create new variables to analyze, and review data quality. It is integrated into Research Agent, and can also be run alone by right-clicking on a data set name and selecting Data Preparation Agent. See How to Automatically Check and Prepare Your Data with Data Preparation Agent for more information about running the feature.
Options
The following table lists the series of checks run in the Data Preparation Agent. The order and wording match the prompt; however, a more general explanation is also provided.
| Data Preparation Step | Description |
| Automatic splitting of grids, Nominal - Multi and Ordinal - Multi based on variable names. | Split categorical grid variable sets (Nominal-Multi and Ordinal-Multi) that are incorrectly grouped together. |
| Automatic combining of variable sets (e.g., into grids). | Find and combine individual variables that are similar and should be combined into a grid variable set. |
| Change a variable set's structure (e.g., Nominal to Ordinal). | Ensure categorical variable sets are correctly marked as ordered (Ordinal) or not ordered (Nominal). |
| Convert dates saved as text or categorical variables to Date/Time variable sets. | Automatically identify dates that are stored in your data set as Text, Nominal, or Ordinal variables and converts them to a Date/Time structure. |
| Identify nominal/ordinal/binary - multi variables to treat as waves in statistical tests. | Identify tracking period variables that should be treated as waves and automatically toggle the Treat as wave in statistical tests setting on. This setting will be automatically triggered if you try to convert a nominal or ordinal variable to Date/Time and it can’t be converted. |
| Create a better name for the variable sets (e.g., How old are you → Age). | Create better names for variable sets (e.g., How old are you → Age). |
| Create a better name for the variable sets, including a number if it exists (e.g., How old are you → Q3. Age). | In addition to creating better names, add question number if it exists to variable set name (e.g., How old are you → Q3. Age). |
| Where an Ordinal-Multi contains don't know or NA options, create a Binary - Multi variable set of these categories. | For ordered categorical variable sets, create variables to show how many Don't Know or Not Applicable observations were excluded. |
| Set don't know and missing data categories to missing for Ordinal - Multis. | For ordered categorical variable sets, change value attributes to exclude Don't Knows and Not Applicable responses from analyses. |
| Reverse scales of Ordinal - Multi variable sets, so that the highest category has the highest value. | For ordered categorical variable sets, ensure those with a rating scale use ascending values from worst to best (i.e., Strongly disagree = 1 to Strongly agree = 5). |
| Create Binary - Multis of the top 2 categories of Ordinal - Multis. | For ordered categorical variable sets with a rating scale, create a variable set for the top 2 boxes of scaled variable sets. |
| Create Numeric - Multi from Ordinal - Multi. | For ordered categorical variable sets with a rating scale, create a variable set for the top 2 boxes of scaled variable sets. |
| Create NPS variable from 11-point likelihood to recommend data. |
Identify Net Promoter Score (NPS) variable sets, specifically, and create a new numeric NPS version that is recoded to produce NPS scores when used in tables. |
| Identify and set the 'Unique identifier'; if there isn't one, identify ID variables and flag duplicates. | Set the Unique Identifier for the data set (required to manually edit and/or delete raw data). If there isn't a suitable variable found, it will flag duplicates for other ID variables for your review. |
| Flag poor quality text responses for text variables. | Flag text variable responses that are blank or don't have at least one letter in the response. |
| Flag straight-lining behavior across multi-item rating questions, based on the number of items each person answered. | For ordered categorical variable sets with a rating scale, create a variable that flags straight-lining behavior, based on the number of items each person answered. You can then use this flag to filter and review and/or delete those cases. |
| Flag respondents who have a high proportion of low information responses | Flag respondents who gave a high proportion of low information responses (e.g., Don't know, Refused, None, N/A). Respondents who selected low information responses more than 25% of the time receive one flag. Those who selected low information responses more than 35% of the time receive two flags. Those who selected low information responses more than 45% of the time receive three flags. These flags help to identify lazy respondents and bots in your data set, and are taken into consideration when the agent deletes poor quality cases from your data set. |
| Flag respondents who completed the survey in an abnormally short time | Flags respondents who completed the survey in less than half the median duration receive one flag. Those who completed in less than a third of the median receive two flags; and those who completed in less than a quarter of the median receive three flags. These flags help to identify lazy respondents and bots in your data set, and are taken into consideration when the agent deletes poor quality cases from your data set. |
| Delete rows of data where 30% or more of data quality flags are failed. | Delete rows of data where 30% or more of data quality flags (straight lining or poor text) failed. If you wish to restore deleted cases see Identifying and Restoring Deleted Cases. |
| Set empty age categories at the top and bottom of the range to missing. | Set Missing Values to Exclude from analyses to top and bottom (i.e., first and last) age categories that do not contain any data. For example, age categories for "Under 18" in a study targeting adults or "Over 70" for a study targeting working adults. This only applies to nominal or ordinal age variables where the categories represent ordered age bands. |
| Hide variable sets that contain no responses | Hide variable sets that contain no responses/is all missing data (NaN). Note that this does not delete variable sets. They can be unhidden by right-clicking and selecting Unhide. |
| Identify themes in unstructured text data and classify text data into the themes. | Use text categorization to create binary - multi variables from text variables. If you want to check and/or edit the categorization, see How to Refine and Edit Text Themes After Classification. |
| Identify and flag weight variable sets |
Evaluates the variable's name and label, looking for the word "weight" or an abbreviation of the word, along with the raw data values (looking for numeric values equal to or greater than 0 and contains decimals). If the variable passes those checks, the Data Preparation Agent will automatically tag the variable as Usable as a weight. See Data Preparation Agent - FAQs for more information about how weight variables are identified. See How to Use a Weight Created Outside of Displayr for steps to apply weights to outputs. |
Next
How to Automatically Check and Prepare Your Data with Data Preparation Agent