Text Analysis - Automatic Categorization - Entity Extraction – Technical Documentation

Automatically performs the task of named entity recognition from a text variable containing unstructured text. Named entities are pre-defined categories that include real-world objects such as people, locations, and organizations; temporal and numeric expressions such as dates, money, and other numeric measures; and abstract concepts such as religion, ideology, and criminal charge.

Example

The output below shows the extracted entities for a text variable asking people where they would like to travel for their next holiday outside of the U.S.. Here is a snapshot of how the output was set up:

The final result is below. Each row in the example column can be expanded to show all the variants of each entity type extracted.

Usage

To create an Unstructured Text output, In Q, go to How to Automatically Extract Entities and Sentiment from Text.

In Displayr, go to How to Automatically Classify Unstructured Text Data Into an Entity List.

Options

Text variable The text variable to run the entity extraction.

Minimum number of cases to save The minimum number of observed named entities of a single type required to be identified in a text variable before saving the extracted entities. Create > Text Analysis > Advanced > Save Variable(s) > First Category and Create > Text Analysis > Advanced > Save Variable(s) > Categories E.g. if the minimum number of cases to save is set to 3, then there needs to be at least 3 observed entities of each type before that variable can be saved and added to the dataset through the menu. So if there were 5 Person entities extracted in the R output, then it is possible to save the Person entities and add to the dataset. However, if there were only 2 location entities extracted then the location variable would not be created.

Add named entities to extraction Add custom named entities to be extracted from the text via a data entry form. The data entry form window requires a named entity type to be specified in a cell in the first row. In the column below each named entity type, a list of words (named entities) can be specified with a word within in each cell to be included in the entity extraction. See technical information about constraints on adding named entities.

Remove named entities from extraction Similar to above, named entities can be excluded from the extraction by populating a similar data entry form. In the first row it should state the entity type to remove. In the column below each specified entity type, a list of words can be listed with a word for each cell to specify all the words or entities that the user wishes to remove.

An example of the expected input is given below for the Remove named entities from extraction control to remove five entities. In the example, the following are specified: one Person entity named "Wall St", one Country entity named "fold" and three Number entities; "007cigarjoe", "@tzard000" and "1sonny12".

Save Variable(s)

Maximum number of unique entity levels to save Maximum number of unique levels in each entity type to save from the text. If the number of unique unique entities for one entity type exceeds this limit then the most popular entities are chosen with ties broken alphabetically.

Maximum number of entities per case to save Maximum number of entity mentions per case to save when using Save Variable(s) Categories. If this control is set to a value n, but there the maximum number of tokens for a case is m > n, then only the first n entity mentions are saved.

Technical Information

Possible named entity types are Person, Location, Organization, Misc, Money, Number, Ordinal, Percent, Date, Time, Duration, Set, Email, URL, City, State or province, Country, Nationality, Religion, Title, Ideology, Criminal charge, and Cause of death. The entity extraction uses the Stanford Core Natural Language Processing (CoreNLP) Named Entity Recognition (NER) annotator which uses a combination of machine learning sequence models to rule-based extraction such as regular expressions (Regex) and other classifiers. These run in a sequence and once entities are identified by a method, the identified entities cannot be changed or modified by later steps in the sequence. This has consequences for adding named entities to the extraction in the user settings. If an entity is already identified by Stanford CoreNLP, it cannot be assigned to another entity type via the user settings to add named entities to the extraction. However, this constraint doesn't apply in the removal of named entities in the user settings. Any identified entities by the CoreNLP NER can be removed from the extraction.

How to apply this QScript

This feature is also available as a QScript called Text Analysis - Automatic Categorization - Extract entities

Start typing the name of the QScript into the Search features and data box in the top right of the Q window.
Click on the QScript when it appears in the QScripts and Rules section of the search results.

Select Automate > Browse Online Library.
Select this QScript from the list.

Customizing the QScript

This QScript is written in JavaScript and can be customized by copying and modifying the JavaScript.

Customizing QScripts in Q4.11 and more recent versions

Start typing the name of the QScript into the Search features and data box in the top right of the Q window.
Hover your mouse over the QScript when it appears in the QScripts and Rules section of the search results.
Press Edit a Copy (bottom-left corner of the preview).
Modify the JavaScript (see QScripts for more detail on this).
Either:
- Run the QScript, by pressing the blue triangle button.
- Save the QScript and run it at a later time, using Automate > Run QScript (Macro) from File.

Customizing QScripts in older versions

Contact support@q-researchsoftware.com to obtain a copy of the JavaScript code.
Create a new text file, giving it a file extension of .QScript. See here for more information about how to do this.
Modify the JavaScript (see QScripts for more detail on this).
Run the file using Automate > Run QScript (Macro) from File.

More Information

Automatic Coding of Unstructured Text Data
At Last, Machine Learning Can Accurately Categorize Text Data

References

How to Automatically Code Unstructured Data Into an Entity List in Displayr

How to Automatically Extract Entities and Sentiment from Text in Q

Articles in this section

Text Analysis - Automatic Categorization - Entity Extraction

Example

Usage

Options

Save Variable(s)

Technical Information

How to apply this QScript

Customizing the QScript

Customizing QScripts in Q4.11 and more recent versions

Customizing QScripts in older versions

See also

More Information

References

Next

Articles in this section

Example

Usage

Options

Save Variable(s)

Technical Information

How to apply this QScript

Customizing the QScript

Customizing QScripts in Q4.11 and more recent versions

Customizing QScripts in older versions

See also

More Information

References

Next

Related articles