How extraction works
During the extraction of key concepts and ideas from your responses, Text Analytics relies on linguistics-based text analysis. This approach offers the speed and cost effectiveness of statistics-based systems. But it offers a far higher degree of accuracy, while requiring far less human intervention. Linguistics-based text analysis is based on the field of study known as natural language processing, also known as computational linguistics.
Understanding how the extraction process works can help you make key decisions when fine-tuning your linguistic resources (libraries, types, synonyms, and more). Steps in the extraction process include:
- Converting source data to a standard format
- Identifying candidate terms
- Identifying equivalence classes and integration of synonyms
- Assigning a type
- Indexing
- Matching patterns and events extraction
Step 1. Converting source data to a standard format
In this first step, the data you import is converted to a uniform format that can be used for further analysis. This conversion is performed internally and does not change your original data.
Step 2. Identifying candidate terms
It is important to understand the role of linguistic resources in the identification of candidate terms during linguistic extraction. Linguistic resources are used every time an extraction is run. They exist in the form of templates, libraries, and compiled resources. Libraries include lists of words, relationships, and other information used to specify or tune the extraction. The compiled resources cannot be viewed or edited. However, the remaining resources (templates) can be edited in the Template Editor or, if you're in a Text Analytics Workbench session, in the Resource editor.
Compiled resources are core, internal components of the extraction engine.
These resources include a general dictionary containing a list of base forms with a part-of-speech
code (noun, verb, adjective, adverb, participle, coordinator, determiner, or preposition). The
resources also include reserved, built-in types used to assign many extracted terms to the following
types, <Location>
, <Organization>
, or
<Person>
.
In addition to those compiled resources, several libraries are delivered with the product and can be used to complement the types and concept definitions in the compiled resources, as well as to offer other types and synonyms. These libraries—and any custom ones you create—are made up of several dictionaries. These include type dictionaries, substitution dictionaries (synonyms and optional elements), and exclude dictionaries.
After the data is imported and converted, the extraction engine will begin
identifying candidate terms for extraction. Candidate terms are words or groups of words that are
used to identify concepts in the text. During the processing of the text, single words
(uni-terms) that are not in the compiled resources are considered as candidate term
extractions. Candidate compound words (multi-terms) are identified using part-of-speech
pattern extractors. For example, the multi-term sports car
, which follows the
adjective noun part-of-speech pattern, has two components. The multi-term fast
sports car
, which follows the adjective adjective noun part-of-speech pattern,
has three components.
Finally, a special algorithm is used to handle uppercase letter strings, such as job titles, so that these special patterns can be extracted.
Step 3. Identifying equivalence classes and integration of synonyms
After candidate uni-terms and multi-terms are identified, the software uses a
set of algorithms to compare them and identify equivalence classes. An equivalence class is a base
form of a phrase or a single form of two variants of the same phrase. The purpose of assigning
phrases to equivalence classes is to ensure that, for example, president of the
company
and company president
are not treated as separate concepts. To
determine which concept to use for the equivalence class—that is, whether president of the
company
or company president
is used as the lead term, the extraction
engine applies the following rules in the order listed:
- The user-specified form in a library.
- The most frequent form in the full body of text.
- The shortest form in the full body of text (which usually corresponds to the base form).
Step 4. Assigning type
Next, types are assigned to extracted concepts. A type is a semantic grouping of concepts. Both compiled resources and the libraries are used in this step. Types include such things as higher-level concepts, positive and negative words, first names, places, organizations, and more. Additional types can be defined by the user.
Step 5. Indexing
The entire set of records or documents is indexed by establishing a pointer between a text position and the representative term for each equivalence class. This assumes that all of the inflected form instances of a candidate concept are indexed as a candidate base form. The global frequency is calculated for each base form.
Step 6. Matching patterns and events extraction
Text Analytics can discover not only types and concepts but also relationships among them. Several algorithms and libraries are available with this tool and provide the ability to extract relationship patterns between types and concepts. They are particularly useful when attempting to discover specific opinions (for example, product reactions) or the relational links between people or objects (for example, links between political groups or genomes).