Oracle8 ConText Cartridge Application Developer's Guide
Release 2.3

A58164-01

Library

Product

Contents

Index

Prev Next

7
Linguistic Concepts

This chapter describes the approach used by ConText linguistics to provide advanced analysis of English-language text.

The following topics are covered in this chapter:

Overview of ConText Linguistics

ConText linguistics is used to analyze the content of English-language documents. You use ConText linguistics to create different views of the contents of documents that allow the user to quickly review the essential content of documents and determine their relevance.

Because these services are separate and distinct from text and theme indexing, you can incorporate linguistic analysis and functionality in a text application, independent of the text/theme indexing process.

ConText linguistics can generate the following forms of linguistic output for documents:

Output Type   Description  

Themes  

The main concepts of a document.  

Gist  

Paragraph or paragraphs in a document that best represent what the document is about as a whole.  

Theme Summary  

Paragraph or paragraphs in a document that best represent a given theme in the document.  

Sentence-Level Gist  

Sentence or sentences in a document that best represent the themes in the document as a whole.  

Sentence-Level Theme Summary  

Sentence or sentences in a document that best match a single theme in the document.  

You obtain linguistic output by submitting a linguistic request using the CTX_LING PL/SQL package. Linguistic requests can only be processed by ConText servers running with the Linguistic personality.

Requirements

The requirements for using ConText linguistics are:

Application Program Interface (API)

Linguistic and queue management functions are invoked by using PL/SQL procedures called or executed within the programming language in which the application is developed. If the application is developed in PL/SQL, these procedures may be invoked directly as PL/SQL execute statements. If the application is developed in another language, such as C, the PL/SQL procedures for linguistic and queue management functions are accessed through the Oracle Call Interface (OCI).

ConText provides the following PL/SQL packages for generating linguistic output and managing the Services Queue, respectively:

CTX_LING Package

The stored procedures in CTX_LING are used to request linguistic output and submit the requests to the Services Queue. CTX_LING also provides procedures for specifying user settings for generating linguistic output and enabling logging of parse information generated during the processing of a request.

The model for submitting requests and querying the linguistic output is similar to the two-step query model (CONTAINS procedure) provided within the ConText framework for content-based text retrieval.

For example, to generate themes for a document, you first create a table to store the results of the theme generation, then call CTX_LING.REQUEST_THEMES procedure followed by the CTX_LING.SUBMIT function. ConText stores the results in a theme table. To view the results, issue a SELECT statement to select the theme from the output table.

See Also:

For more information about the procedures in the CTX_LING package, see "CTX_LING:Linguistics" in Chapter 10.

 

CTX_SVC Package

The stored procedures in CTX_SVC are used to monitor the Services Queue for the status of specific requests. CTX_SVC can be used to check the status of pending requests, and to display errors encountered. You can also cancel the request if it has not been picked up for processing by a ConText server or clear the request if the request encountered an error.

See Also:

For more information about procedures in the CTX_SVC package, see "CTX_SVC: Services Queue Administration" in Chapter 10.

 

Linguistic Personality

To process requests for linguistic output (themes and Gists), a ConText server with the Linguistic personality must be running. A ConText server with the Linguistic personality can also have other personalities in its personality mask.

Starting up ConText servers is the task of the ConText administrator, through the CTXSYS Oracle user.

See Also:

For more information about the Linguistic personality and about specifying personality masks for ConText servers, see Oracle8 ConText Cartridge Administrator's Guide.

 

Services Queue

The Services Queue is used for managing ConText linguistic requests. Such a request is cached in memory until the requestor submits the request, at which time the request is added to the Services Queue. If more than one request is cached in memory when the user submits the requests, ConText stores all of the requests as a single batch job.

If a ConText server has the appropriate Linguistic personality, the server monitors the Service Queue for requests and processes the next request in the queue.


Note:

If no ConText servers with the' L' personality are running, the Services Queue still accepts requests and holds the requests for the next available ConText server with the appropriate personality.

 

The ConText administration tool can be used to perform all administration functions on the Services Queue (e.g., cleaning up entries, etc.). In addition, the CTX_SVC PL/SQL package can be used to perform ConText administration from the command-line.

Creating Linguistic Output

You can generate linguistic output in batch during the text indexing process or generate it as needed. Because the generation of linguistic output is independent of the text-indexing process, ConText places no restrictions on when you can create themes and Gists.

See Also:

For more information about generating linguistic output at indexing time versus generating linguistic output on demand, see "Combining Theme/Text Queries with Linguistic Output" in Chapter 8.

 

Linguistic Core

The linguistic core is made up of the following components:

Lexicon

The lexicon is a static knowledge base that provides word and phrase information for the parsing engine. The lexicon recognizes over one million English words and phrases and defines hundreds of lexical characteristics for each word.


Note:

The lexicon is specific to the English language, but it recognizes the difference between American and British usage and spelling.

 

Linguistic information about words in the lexicon is divided into the following types:

Information Type   Description  

Syntax  

Syntax flags provide surface level assessments of a word or phrase isolated from its grammatical context.  

Theme  

Theme flags identify the thematic qualities of a word (e.g. weak noun/needs support, strong verb). The parser uses these flags to determine how a word contributes to the thematic construction of the sentence as a whole.  

Knowledge Catalog

The knowledge catalog is a language-independent organization of industries, fields of study, special terms and jargon, and abstract concepts. It creates a classification scheme that defines ConText's semantic view of the world.

Context uses the knowledge catalog to generate linguistic output, to classify documents by theme during theme indexing, and to normalize theme queries.

See Also:

For more information about the knowledge catalog, see "Understanding Theme Queries" in Chapter 4.

 

Parsing Engine

ConText uses the linguistic parsing engine whenever you request thematic analysis of text either through CTX_LING.REQUEST_GIST or CTX_LING.REQUEST_THEMES or through theme indexing and querying.

The parsing engine grammatically analyses text, identifying phrase, sentence and paragraph boundaries. It then interprets meaning, selecting the high-information content to produce theme output. The lexicon and knowledge catalog provide the reference information necessary to do this processing.

If case-conversion is enabled, the parsing engine converts all the text to lowercase and processes the text through the case-sensitivity routines to determine capitalization.


Note:

Case conversion does not affect the original text of the documents being processed; only the output of the parsing engine is stored in mixed-case.

 

Linguistic Output

ConText linguistics produces the following output:

List of Themes

You can generate a list of themes or list of main concepts of a document on a per document basis. Because themes present a profile of the main subjects of a document, a list of themes provide a snapshot of what the document is about. To generate a list of themes, use CTX_LING.REQUEST_THEMES. You can generate a list of themes in two ways:

Single Themes

You can generate up to 16 themes for each document, using the CTX_LING.REQUEST_THEMES procedure. This procedure writes a single word or phrase that represents the theme to a row in the theme table. The words or phrases that represent the themes are normalized themes derived from the knowledge catalog.

Theme Hierarchies

You can also generate each theme name accompanied with its parent themes. To enable hierarchical list of themes output, you must use CTX_LING.SET_FULL_THEMES before you call CTX_LING.REQUEST_THEMES.

Generating theme hierarchical information in the theme table helps to match themes with theme summaries generated with CTX_LING.REQUEST_GIST.


Note:

ConText linguistics produces only document-level themes; paragraph-level themes cannot be produced.

 

See Also:

For more information about generating themes, see "Generating Themes and Gists" in Chapter 8.

 

Theme Weight

When you generate document themes, ConText assigns a weight that measures the strength of the theme relative to the other themes in the document.

The cumulative weight of a theme also reflects the overall thematic content of the document. As such, theme weights can be used to compare a document theme to other themes within the same document or to other documents with the same theme.

Theme Summaries

A theme summary for a document provides a short summary of the document from a specific point-of-view. You can generate two types of theme summaries:

A paragraph-level theme summary consists of the paragraph or paragraphs that best represent a single document theme A sentence-level theme summary consists of the sentence or sentences that best match a single document theme.

To create either paragraph-level or sentence-level theme summaries, use CTX_LING.REQUEST_GIST.

Because it provides a concise, focused summary for a particular theme in a document, a theme summary can be used to compare documents with similar themes.

You can control the size of sentence-level and paragraph-level theme summaries with linguistic settings.


Note:

The settings for theme summaries can only be modified by creating custom setting configurations in the GUI administration tool.

 

See Also:

For more information about how to generate theme summaries, see "Generating Themes and Gists" in Chapter 8.

For more information on specifying linguistic settings, see "Specifying Linguistic Settings" in Chapter 8.

For a complete list of ConText's predefined labels, see the specification for CTX_LING.SET_SETTINGS_LABEL in Chapter 10.

 

Gists

A Gist for a document provides a summary that reflects all of the themes in the document. You can generate two types of Gists:

A paragraph-level Gist consists of the document paragraphs that best represent the themes in a document as a whole. A sentence-level Gist is the sentence or sentences that best represent the themes in a document as a whole.

To generate either a paragraph-level or sentence-level Gist, use CTX_LING.REQUEST_GIST.

Because a Gist is generally longer than a theme summary, it serves better as a document reading tool than a document selection tool. For example, it can be used to quickly scan a document and to extract the most meaningful thematic information.


Note:

The settings for Gist can only be modified by creating custom setting configurations in the GUI administration tool.

 

See Also:

For more information about how to generate a Gists, see "Generating Themes and Gists" in Chapter 8.

For more information on specifying linguistic settings, see "Specifying Linguistic Settings" in Chapter 8.

For a complete list of ConText's predefined labels, see the specification for CTX_LING.SET_SETTINGS_LABEL in Chapter 10.

 

Theme Indexes

Theme indexes are created as a prerequisite for issuing theme queries. Given a theme policy, you can create a theme index for all documents in an entire text column using CTX_DDL.CREATE_INDEX


Note:

A theme index is the only type of linguistic output you can generate without running an 'L' server.

 

.

See Also:

For more information about creating theme indexes, see "Understanding Theme Queries" in Chapter 4 and the Oracle8 ConText Cartridge Administrator's Guide.

 

Linguistic Settings

You can perform linguistic processing of documents to generate themes and Gists only when a ConText server with the Linguistic personality is running. ConText provides two pre-defined linguistic settings, one for mixed case text and one for all upper-case text:

Setting   Description  

GENERIC  

Default configuration. Parses mixed-case English text. Produces theme output.  

Case Sensitive (SA)  

Same as GENERIC except that ConText converts all-uppercase or all lower-case text to case-sensitive text before performing theme analysis.

When your text is all upper-case or all lower-case and you use this setting to convert the text, Oracle Corporation does not recommend creating theme indexes or issuing theme queries. Creating theme indexes with the SA setting does not produce consistent results.  

You can set these options with the CTX_LING.SET_SETTINGS_LABEL procedure. You can also define your own settings with the administration tool and set these settings with CTX_LING.SET_SETTINGS_LABEL. With the administration tool, you can create settings to control the following options:

When you use the administration tool to create your own settings, Oracle Corporation recommends using one of the ConText predefined settings as a starting point, depending on whether your text is mixed case, or all upper-case or all lower-case text.

See Also:

For more information on how to specify linguistic settings, see "Specifying Linguistic Settings" in Chapter 8.

For more information about the using the administration tool to set your own labels, see the help file for the administration tool.

 




Prev

Next
Oracle
Copyright © 1997 Oracle Corporation.

All Rights Reserved.

Library

Product

Contents

Index