6
Text Concepts

This chapter introduces the concepts necessary for understanding how text is setup and managed by ConText.

The following topics are discussed in this chapter:

The ConText Data Dictionary

ConText utilizes a data dictionary, separate from the Oracle data dictionary, for storing indexing options. The following objects are registered in the ConText data dictionary:

policies (template and user-created) and the table and column to which each policy is assigned
indexing preferences (default and user-created)
sources for automated batch-loading of text into database columns
text loading preferences
Tiles

The ConText data dictionary also stores resource limits and the status of all ConText servers that are currently running.

The ConText data dictionary is owned by the Oracle user CTXSYS. CTXSYS and the data dictionary tables and views are created during installation of ConText.

Text Operations

ConText supports five types of operations that are processed by ConText servers:

Text Loading
DDL
DML
Text/Theme Queries

Linguistics Requests

Note:

The personality mask for a ConText server determines which operations the server can process.

For more information about personality masks, see "Personality Masks" in Chapter 2, "Administration Concepts".

Text Loading

Text loading is an ongoing operation performed by ConText servers running with the Loader personality. It differs from the other text operations in that a request is not made to the Text Request Queue for handling by the appropriate ConText server.

Instead, ConText servers with the Loader personality regularly scan a document repository (i.e. operating system directory) for documents to be loaded into text columns for indexing.

If a file is found in the directory, the contents of the file are automatically loaded by the ConText server into the appropriate table and column.

See Also:

For more information about text loading using ConText servers, see "Automated Batch Loading" in this chapter.

DDL

A ConText DDL operation is a request for the creation, deletion, or optimization of a text/theme index on a text column. DDL requests are sent to the DDL pipe in the Text Request Queue, where available ConText servers with the DDL personality pick up the requests and perform the operation.

DDL operations are requested through the GUI administration tools (System Administration or Configuration Manager) or the CTX_DDL package.

See Also:

For more information about the CTX_DDL package, see "CTX_DDL: Text Setup and Management" in Chapter 11, "PL/SQL Packages - Text Management".

DML

A text DML operation is a request for the ConText index (text or theme) of a column to be updated. An index update is necessary for a column when the following modifications have been made to the table:

insertion of a new row
deletion of an existing row
update of the primary key or text column(s) for an existing row

Requests for index updates are stored in the DML Queue where they are picked up and processed by available ConText servers. The requests can be placed on the queue automatically by ConText or they can be placed on the queue manually.

In addition, the system can be configured so DML requests in the queue are processed immediately or in batch mode.

Automatic DML Queue Notification

DML requests are automatically placed in the queue via an internal trigger that is created on a table the first time a ConText index is created for a text column in the table.

ConText supports disabling automatic DML at index creation time through a parameter, create_trig, for CTX_DDL.CREATE_INDEX. The create_trig parameter specifies whether the DML trigger is created/updated during indexing of the text column in the column policy.

In addition, a DML trigger can be removed at any time from a table using CTX_DDL.DROP_INTTRIG.

Note:

DROP_INTTRIG deletes the trigger for the table. If the table contains more than one text column with existing ConText indexes, automatic DML is disabled for all the text columns.

DROP_INTTRIG is provided mainly for maintaining backward compatibility with previous releases of ConText and should be used only when it is absolutely necessary to disable automatic DML for all the text columns in a table.

If the DML trigger is not created during indexing or is dropped, the ConText index is not automatically updated when subsequent DML occurs for the table. Manual DML can always be performed, but automatic DML can only be reenabled by first dropping, then recreating the ConText index or creating your own trigger to handle updates.

Manual DML Queue Notification

DML operations may be requested manually at any time using the CTX_DML.REINDEX procedure, which places a request in the DML Queue for a specified document.

Immediate DML Processing

In immediate mode, one or more ConText servers are running with the DML personality. The ConText servers regularly poll the DML Queue for requests, pick up any pending requests (up to 10,000 at a time) for an indexed column and update the index in real-time.

In this mode, an index is only briefly out of synchronization with the last insert, delete, or update that was performed on the table; however, immediate DML processing can use considerable system resources and create index fragmentation.

Batch DML Processing

If a text table has frequent updates, you may want to process DML requests in batch mode. In batch mode, no ConText servers are running with the DML personality. The queue continues to accept requests, but the requests are not processed because no DML servers are available.

To start DML processing, the CTX_DML.SYNC procedure is called. This procedure batches all the pending requests for an indexed column in the queue and sends them to the next available ConText server with a DDL personality. Any DML requests that are placed in the queue after SYNC is called are not included in the batch. They are included in the batch that is created the next time SYNC is called.

SYNC can be called with a level of parallelism. The level of parallelism determine the number of batches into which the pending requests are grouped. For example, if SYNC is called with a parallelism level of two, the pending requests are grouped into two batches and the next two available DDL ConText servers process the batches.

Calling SYNC in parallel speeds up the updating of the indexes, but may increase the degree of index fragmentation.

Concurrent Index Creation

A text column within a table can be updated while a ConText server is creating an index on the same text column. Any changes to the table being indexed by a ConText server are stored as entries in the DML Queue, pending the completion of the index creation.

After index creation completes, the entries are picked up by the next available DML ConText server and the index is updated to reflect the changes. This avoids a race condition in which the DML Queue request might be processed, but then overwritten by index creation, even though the index creation was processing an older version of the document.

Text/Theme Queries

A text query is any query that selects rows from a table based on the contents of the text stored in the text column(s) of the table.

A theme query is any query that selects rows from a table based on the themes generated for the text stored in the text column(s) of the table.

Note:

Theme queries are only supported for English-language text.

ConText supports three query methods for text/theme queries:

In addition, ConText supports Stored Query Expressions (SQEs).

Before a user can perform a query using any of the methods, the column to be queried must be defined as a text column in the ConText data dictionary and a text and/or theme index must be generated for the column.

See Also:

For more information about text columns, see "Text Columns" in this chapter.

For more information about text/theme queries and creating/using SQEs, see Oracle8 ConText Cartridge Application Developer's Guide.

Two-step Queries

In a two-step query, the user performs two distinct operations. First, the ConText PL/SQL procedure, CONTAINS, is called for a column. The CONTAINS procedure performs a query of the text stored in a text column and generates a list of the textkeys that match the query expression and a relevance score for each document. The results are stored in a user-defined table.

Then, a SQL statement is executed on the result table to return the list of documents (hitlist) or some subset of the documents.

One-step Queries

In a one-step query, the ConText SQL function, CONTAINS, is called directly in the WHERE clause of a SQL statement. The CONTAINS function accepts a column name and query expression as arguments and generates a list of the textkeys that match the query expression and a relevance score for each document.

The results generated by CONTAINS are returned through the SELECT clause of the SQL statement.

In-memory Queries

In an in-memory query, PL/SQL stored procedures and functions are used to query a text column and store the results in a query buffer, rather than in the result tables used in two-step queries.

The user opens a CONTAINS cursor to the query buffer in memory, executes a text query, then fetches the hits from the buffer, one at a time.

Stored Query Expressions

In a stored query expression (SQE), the results of a query expression for a text column, as well as the definition of the SQE, are stored in database tables. The results of a SQE can be accessed within a query (one-step, two-step, or in-memory) for performing iterative queries and improving query response.

The results of an SQE are stored in an internal table in the index (text or theme) for the text column. The SQE definition is stored in a system-wide, internal table owned by CTXSYS. The SQE definitions can be accessed through the views, CTX_SQES and CTX_USER_SQES.

See Also:

For more information about the SQE result table, see "SQR Table" in Appendix C, "ConText Index Tables and Indexes".

Linguistics Requests

The ConText Linguistics are used to analyze the content of English-language documents. The application developer uses the Linguistics output to create different views of the contents of documents.

The Linguistics currently provide two types of output, on a per document basis, for English-language documents stored in an Oracle database:

list of themes
document Gist and/or theme summaries

See Also:

For more information about themes, Gists, and theme summaries, as well as using the Linguistics in applications, see Oracle8 ConText Cartridge Application Developer's Guide.

Text Columns

A text column is any column used to store either text or text references (pointers) in a database table or view. ConText recognizes a column as a text column if one or more policies are defined for the column.

Supported Datatypes

Text columns can be any of the supported Oracle datatypes; however, text columns are usually one of the following datatypes:

CHAR
VARCHAR2
LONG
LONG RAW
BLOB
CLOB
BFILE

A table can contain more than one text column; however, each text column requires a separate policy.

See Also:

For more information about policies and text columns, see "Policies" in Chapter 7, "Understanding the ConText Data Dictionary: Indexing".

For more information about Oracle datatypes, see Oracle8 Server Concepts.

For more information about managing LOBs (BLOB, CLOB, and BFILE), see Oracle8 Application Developer's Guide and PL/SQL User's Guide and Reference.

Textkeys

ConText uses textkeys to uniquely identify a document in a text column. The textkey for a text column usually corresponds to the primary key for the table or view in which the column is located; however, the textkey for a column can also reference unique keys (columns) that have been defined for the table.

When a policy is defined for a column, the textkey for the column is specified.

Composite Textkeys

A textkey for a text column can consist of up to sixteen primary or unique key columns.

During policy definition, the primary/unique key columns are specified, using a comma to separate each column name.

In two-step queries, the columns in a composite textkey are returned in the order in which the columns were specified in the policy.

In an in-memory queries, the columns in a composite textkey are returned in encoded form (e.g. 'p1,p2,p3'). This encoded textkey must be decoded to access the individual columns in the textkey.

Note:

There are some limits to composite textkeys that must be considered when setting up your tables and columns, and when creating policies for the columns.

See Also:

For more information about encoding and decoding composite textkeys, see Oracle8 ConText Cartridge Application Developer's Guide.

Column Name Limitations

There is a 256 character limit, including the comma separators, on the string of column names that can be specified for a composite textkey.

Because the comma separators are included in this limit, the actual limit is 256 minus (no. of columns minus 1), with a maximum of 241 characters (256 - 15), for the combined length of all the column names in the textkey.

This limit is enforced during policy creation.

Column Length Limitations

There is a 256 character limit on the combined lengths of the columns in a composite textkey. This is due to the way the textkey values for composite textkeys are stored in the index.

For a given row, ConText concatenates all of the values from the columns that constitute the composite textkey into a single value, using commas to separate the values from each column.

As such, the actual limit for the lengths of the textkey columns is 256 minus (no. of columns minus 1), with a maximum of 241 characters (256 - 15), for the combined length of all the columns.

Note:

If you allow values that contain commas (e.g. numbers, dates) in your textkey columns, the commas are escaped automatically by ConText during indexing. The escape character is the backslash character.

In addition, if you allow values that contain backslashes (e.g. dates or directory structures in Windows) in your textkey columns, ConText uses the backslash character to escape the backslashes.

As a result, when calculating the limit for the length of columns in a composite textkey, the overall limit of 256 (241) characters must include the backslash characters used to escape commas and backslashes contained in the data.

Text Loading

The loading of text into database tables is required for using ConText to perform queries and generate linguistic output. This task can be performed within an application; however, if you have a large document set, you may want to perform loading as a batch process.

See Also:

For more information about building text loading capabilities into your applications, see Oracle8 ConText Cartridge Application Developer's Guide.

Plain Text Loading

For loading strings of plain (ASCII) text into individual rows (documents), you can use the INSERT command in SQL.

See Also:

For more information about the INSERT command, see Oracle8 Server SQL Reference.

Batch Loading

Either SQL*Loader or ctxload can be used to perform batch loading of text into a database column.

SQL*Loader

To perform batch loading of plain (ASCII) text into a table, you can use SQL*Loader, a data loading utility provided by Oracle.

See Also:

For more information about SQL*Loader, see Oracle8 Server Utilities.

ctxload Utility

For batch loading plain or formatted text, you can use the ctxload command-line utility provided by ConText.

The ctxload utility loads text from a load file into a specified database table. The load file can contain multiple documents, but must use a defined structure and syntax. In addition, the load file can contain plain (ASCII) text or it can contain pointers to separate files containing either plain or formatted text.

Note:

ctxload is best suited for loading text into columns that use direct data store. If you use external data store to store file pointers in the database, it is possible to use ctxload; however, you should use another loading method, such as SQL*Loader.

See Also:

For an example of loading text using ctxload, see "Using ctxload" in Chapter 9, "Setting Up and Managing Text".

Automated Batch Loading

If you set up sources for your columns, you can use ConText servers running with the Loader personality to automate batch loading of text from load files.

If a ConText server is running with the Loader personality, it regularly checks all the sources that have been defined for columns in the database, then scans specified directories for new files. When a new file appears, it calls ctxload to load the contents of the file into the appropriate column.

When loading of the file contents is successful, the server deletes the file to prevent the contents from being loaded again.

See Also:

For an example of automated text loading, see "Using ConText Servers for Automated Text Loading" in Chapter 9, "Setting Up and Managing Text".

User-Defined Translators

If the contents of the file to be loaded are not in the load file format required by ctxload, the file needs to be formatted before loading.

To ensure that the files are in the correct format, a user-defined translator can be specified as one of the preferences in the source for the column.

A user-defined translator is any program that accepts a plain text file as input and generates a load file formatted for ctxload as its output. The user-defined translator could also be used to perform pre-loading cleanup and spell-checking of your text.

After the contents of the load file have been successfully loaded into the column, the load file generated by the translator is deleted along with the original input file to prevent the contents from being loaded again.

See Also:

For more information about translators for text loading, see "Translator Tiles" in Chapter 8, "Understanding the ConText Data Dictionary: Text Loading".

Error Handling

If an error occurs while loading, the error is written to the error log, which can be viewed using CTX_INDEX_ERRORS. In addition, the original file is not deleted.

Text Storage

ConText supports three methods of storing text in a column:

Direct Storage
External Storage

Master-Detail Storage

Note:

The tables illustrated in the following sections are examples only. The column names and definitions for actual tables used to store text will vary depending on the needs of your application.

Direct Storage

With direct storage, text for documents is stored directly in a database column.

The following table description illustrates a table in which text is stored directly in a column:

Table Name Column Name Datatype Description

DIR_TEXT

TEXTKEY

NUMBER

Primary or unique key for table

TEXTDATE

DATE

Document publication date

AUTHOR

VARCHAR2(50)

Document author

NOTES

VARCHAR2(2000)

Text column with direct storage

TEXT

LONG

Text column with direct storage

Table Name	Column Name	Datatype	Description
DIR_TEXT	TEXTKEY	NUMBER	Primary or unique key for table
	TEXTDATE	DATE	Document publication date
	AUTHOR	VARCHAR2(50)	Document author
	NOTES	VARCHAR2(2000)	Text column with direct storage
	TEXT	LONG	Text column with direct storage

The requirements for storing text directly in a column are relatively straightforward. The text is physically stored in a text column and the policy for the text column contains a Data Store preference that utilizes the DIRECT Tile.

External Storage

With external storage, the text column does not contain the actual text of the document, but rather stores a pointer to the file that contains the text of the document.

Suggestion:

If text is stored as external text in a column, the column should be either a CHAR or VARCHAR2 column. LONG and LONG RAW columns are best suited for documents stored internally in the database.

The pointer can be either:

a file name for accessing text stored in local operating system files (OSFILE Tile)
a Uniform Resource Locator (URL) for accessing text stored in Web files on the World Wide Web or locally (URL Tile)

The following table description illustrates a table that uses external data storage:

Table Name Column Name Datatype Description

EXT_TEXT

TEXTKEY

NUMBER

Primary or unique key for the table

TEXTDATE

DATE

Document publication date

AUTHOR

VARCHAR2(50)

Document author

NOTES

VARCHAR2(2000)

Text column with direct text storage

TEXT

VARCHAR2(100)

Text column with names of operating system files that contain the document text

Table Name	Column Name	Datatype	Description
EXT_TEXT	TEXTKEY	NUMBER	Primary or unique key for the table
	TEXTDATE	DATE	Document publication date
	AUTHOR	VARCHAR2(50)	Document author
	NOTES	VARCHAR2(2000)	Text column with direct text storage
	TEXT	VARCHAR2(100)	Text column with names of operating system files that contain the document text

In this example, the only difference between a table used to store text internally and externally is the datatype of the text column. In an external table, the text column would typically be assigned a datatype of VARCHAR2, rather than LONG, because the column contains a pointer to a file rather than the contents of the file (which requires more space to store).

However, there are additional requirements for storing text externally due to the different methods (file names and URLs) of accessing text stored in flat files.

See Also::

For more information about the requirements for storing text externally, see "External Text" in this chapter.

Master-Detail Storage

Master-detail storage is for documents stored directly in a text column, similar to direct storage; however, each document consists of one or more rows which are indexed as a single row.

In a master-detail relationship, the master table contains the textkey column and the detail table contains the text column, the line number column, and a foreign key to a primary or unique key column in the master table.

The foreign key and the line number columns comprise the primary key for the detail table, which is used to store the text.

The following table description illustrates two tables with a master-detail relationship:

Table Name Column Name Datatype Description

MASTER

PK

NUMBER

Primary key for table

AUTHOR

VARCHAR2

Document author

TITLE

VARCHAR2

Document title

DETAIL

FK

NUMBER

Foreign key to master.pk

LINENO

NUMBER

Detail information for document

TEXT

VARCHAR2

Text column

The following query illustrates the relationship between the two tables:

select DETAIL.TEXT
from DETAIL
where DETAIL.FK = MASTER.PK
order by DETAIL.LINENO

ConText supports two methods of creating policies for text columns in master-detail tables:

Policies on Columns in Master Table

With this method, the MASTER DETAIL NEW Tile is used to create Data Store preferences, which are used in the policy assigned to one of the columns in the master table. The column to which the policy is assigned (i.e. the text column) can be any column in the master table, except the column that serves as the textkey column for the policy.

Note:

The contents of the text column are not actually indexed. The text column only serves as a place-holder for the policy.

The detail table name and attributes, including the name of the column that contains the text to be indexed, are specified in the Data Store preference.

Using the tables described above, the textkey for the policy would be pk in master. The text column for the policy could be either author or title.

The Data Store preference for the policy would identify detail as the detail table, lineno as the line number column, and text as the column containing the text to be indexed.

See Also:

For an example of creating a policy on a master table column, see"Creating a Data Store Preference for a Master Table" in Chapter 9, "Setting Up and Managing Text"

Advantages

This method has the following advantages:

DML is handled with one insert to the DML Queue, resulting in a smaller queue and quicker processing
Structured data queries in text/theme queries can be applied to the master table
For example:

exec ctx_query.contains('MY_POL','Oracle','ctx_temp', struct_query=>'author=''SMITH''');

Limitations

This method has the following limitations:

The column storing text in the detail table is limited to CHAR, VARCHAR2, and LONG datatypes.
Updates to individual rows in the detail table are no longer automatically detected, since the DML trigger is on the master table. Updates to the text in the detail table must be manually reindexed using CTX_DML.REINDEX or by creating a trigger on the detail table that calls CTX_DML.REINDEX.

Policies on Columns in Detail Table

With this method, the policy is created on the detail table, rather than on the master table, and the MASTER DETAIL Tile is used instead of the MASTER DETAIL Tile, to create Data Store preferences.

The textkey column and text column for the detail table, along with the line number column, are specified in the policy. The textkey column and the line number column together uniquely identify rows in the detail table.

Using the tables described above, the textkey for the policy would be fk in detail. The text column for the policy would be text.

Disadvantages

This method has the following disadvantages:

Structured data queries in text/theme queries may be slow. The relevant relational criteria is often stored in a different table, resulting in sub-selects to return structured data.
DML may be slow, because the DML trigger is created on the detail table. When a new row is created in the master table and its corresponding rows are created in the detail table, one request is sent to the DML queue for each new detail row, thereby slowing down the queue.

The syntax for one-step queries is non-intuitive. Since the policy is created on the detail table, the one-step query is on the detail table, which may result in multiple rows per document returned by a query.

Note:

This method is provided primarily to maintain backward compatibility with previous versions of ConText.

If you want to index text stored in master-detail tables, Oracle Corporation suggests that you create policies on the master table.

External Text

The requirements for storing text externally are more complicated than storing text directly in a column due to the different methods of accessing text stored in external files. This section provides detailed information about the two different external file methods supported by ConText:

Text Stored as File Names

For text stored as file names pointing to external files (OSFILE Tile), the name and location of the file must be stored.

File Names

The names of the external text files are stored in the text column.

Directory Path Names

The directory path(s) where the external text files are located can be stored in the text column as part of the file name or in the Data Store preference that you create for the OSFILE Tile.

Note:

If the preference does not contain the directory path for the files, ConText requires the directory path to be included as part of the file name stored in the text column.

File Access

All the external files referenced in the text column must be accessible from the server machine on which the ConText server is running. This can be accomplished by storing the files locally in the file system for the server machine or by mounting the remote file system to the server machine.

File Permissions

File permissions for external files in which text is stored must be set accordingly to allow ConText to access the files. If the file permissions are not set properly for a file and ConText cannot access the file, the file cannot be indexed or retrieved by ConText.

Text Stored as URLs

For text stored in external Web files, the complete address for each file must be stored as a URL in the text column and the URL Tile utilized in the policy for the column.

Note:

Text that contains HTML tags and is stored directly in a text column is considered internal, rather than external, text. As such, the Data Store preference for the text column policy would use the Data Store Tiles which support direct text storage.

In addition, Web files can be any format supported by the World Wide Web, including HTML files, plain (ASCII) text files, and proprietary formats, such as PDF and Word. The filter for the column must be able to recognize and process any of the possible documents formats that may be encountered on the Web.

A URL consists of the access scheme for the Web file and the address of the file, in the following format:

access_scheme://file_address

The ConText URL Tile supports three access scheme protocols in URLs:

Hypertext Transfer Protocol (HTTP)

If a URL uses HTTP, the file address contains the host name of the Web server where the file is located and, optionally, the URL path for the file on the Web server.

For example:

http://my_server.com/welcome.html

http://www.oracle.com

Note:

The file address may also (optionally) contain the port on which the Web server is listening.

In this context, a Web server is any host machine that is running an HTTP daemon, which accepts requests for files and transfers the files to the requestor.

File Transfer Protocol (FTP)

If a URL uses FTP, the file address contains the host name of the Web server where the file is located and, optionally, the directory path for the file on the Web server.

For example:

ftp://my_server.com/code/samples/sample1.tar.Z

Note:

The file address may also (optionally) contain a username/password for accessing the host machine.

In this context, a Web server is any host machine that is running an FTP daemon, which accepts requests for files and transfers the files to the requestor.

File Protocol

If a URL uses the file protocol, the address for the file contains the absolute directory path for the location of the file on the local file system.

For example:

file://private/docs/html/intro.html

The file referenced by a URL using the file protocol must reside locally on a file system that is accessible to the machine running ConText.

Because the file is accessed through the operating system, the machine on which the file is located does not need to be configured as a Web server. However, the same requirements that apply to text stored as file names apply to text stored as URLs which use the file protocol.

If the requirements are not met, ConText returns one or more error messages.

See Also:

For more information, see "Text Stored as File Names" in this chapter.

For the error messages returned by the URL data store, see Oracle8 Error Messages.

Intranet Support

Through HTTP and FTP, the URL Tile can be used to index files in an intranet, as well as files on any publicly-accessible Web servers on the World Wide Web.

Intranets are private networks that use the Internet to link machines in the network, but are protected from public access on the Internet via a gateway proxy server which acts as a firewall.

Outside a firewall, a URL request for a Web file is processed directly by the host machine identified in the URL. Within a firewall, requests are processed by the proxy server, which passes the request to the appropriate host machine and transfers the response back to the requestor.

For security reasons, access to an intranet is generally restricted to machines within the firewall; however, machines in an intranet can access the World Wide Web through the gateway proxy server if they have the appropriate permission and security clearance.

Document Access Using HTTP or FTP

When HTTP or FTP is used in a URL stored in the database, ConText acts as a client, submitting a request to a Web server for the file (document) referenced by the URL. If the request is successful, the Web server returns the file to ConText where it can be indexed for querying or highlighted for viewing.

Proxy Servers

If the document to be accessed is located on the World Wide Web outside a firewall and the machine on which ConText is installed is inside the firewall, a host machine that serves as the proxy (gateway) for the firewall must be specified as an attribute for the URL Tile.

A single machine can be specified as the proxy for handling HTTP and FTP requests or two separate machines can be specified, one for each protocol. If network traffic is expected to be heavy or a large number of FTP requests are expected, separate proxies should be specified for HTTP and FTP, since FTP is generally used for accessing large, binary files which may affect performance on the proxy server.

In addition to specifying proxy servers, a sub-string of host or domain names, which identify all or most of the machines internal to the firewall, should be specified. Access to these machines does not require going through the proxy server, which helps reduce the request load that your proxy server(s) have to process.

Multi-threading

In a single-threaded environment, a request for a URL blocks all other requests until a response to the request is returned. Because a response may not be returned for a long time, a single-threaded environment in any text system using HTTP or FTP to access files could create a bottleneck.

To prevent this type of bottleneck, the URL Tile supports multi-threading. With multi-threading, while one thread is blocked, waiting to communicate with a Web server, another thread can retrieve a document from another Web server.

Redirection

The response to a request to retrieve a URL may be a new (redirected) document to retrieve. The URL Tile supports this type of redirection by automatically processing the redirection to retrieve the new document. However, to avoid infinite loops, the URL Tile limits the number of redirections that it attempts to process to three (3).

Timeouts

The time necessary to retrieve a URL using HTTP may vary widely, depending on where the Web server is geographically located. The Web server may even be temporarily unreachable.

To allow control over the length of time an application waits for a response to an HTTP request for a URL, the URL data store supports specifying a maximum timeout.

Exception Handling

When using URLs as your data store, a number of exceptions can occur when a file is accessed. These exceptions are written as errors to the CTX_INDEX_ERRORS view.

The URL data store returns error messages for the following exceptions:

the document referenced in the URL has been permanently moved or cannot be found
access to the document referenced in the URL requires authentication which the user does not have or requires payment which the user must provide
access to the document referenced in the URL is denied by the Web server
the Web server referenced in the URL does not comply with HTTP standards
the specified URL is incorrectly formatted
connection to the Web server is denied (this may occur when the incorrect port is referenced in the URL or the Web server is outside the firewall of an intranet)
the wait for a response to a request to retrieve a URL from a Web server exceeds the maximum timeout specified for the URL preference in the text column policy
the maximum number of supported redirections were encountered in attempting to retrieve the document referenced in the URL
the length of the URL exceeds the maximum specified for the URL preference in the text column policy
the size of the document referenced in the URL exceeds the maximum specified for the URL preference in the text column policy

See Also::

For the error messages returned by the URL data store, see Oracle8 Error Messages.

Text Filtering

ConText supports both plain text and formatted text (i.e. Microsoft Word, WordPerfect). In addition, ConText supports text that contains hypertext markup language (HTML) tags.

Regardless of the format, ConText requires text to be filtered for the purposes of indexing the text or processing the text through the Linguistics, as well as highlighting the text for viewing.

This section discusses the following topics relevant to text filtering:

Internal Filters

ConText provides internal filters for:

Plain Text Filtering
HTML Filtering (plaint text containing HTML tags)
Formatted Text Filtering

Plain Text Filtering

Plain text requires little or no filtering because the text is already in the format that ConText requires for identifying tokens.

HTML Filtering

ConText provides an internal filter that supports English and Japanese text with HTML tags for versions 1, 2, and 3.

Note:

For non-English and non-Japanese documents that contain HTML tags, an external filter must be used.

The HTML filter processes all text that is delimited by the standard HTML tag characters (angle brackets).

All HTML tags are either ignored or converted to their representative characters in the ASCII character set. This ensures that only the text of the document is processed during indexing or by the Linguistics.

Formatted Text Filtering

ConText provides internal filters for filtering English and Western European text in a number of proprietary word processing formats.

Note:

For Japanese, Korean, and Chinese formatted text, external filters must be used.

The filters extract plain, ASCII text from a document, then pass the text to ConText, where the text is indexed or processed through the Linguistics. The following document formats are supported by the internal filters:

Format Version

AmiPro for Windows

1, 2, 3

Lotus 123 for DOS

4, 5

Lotus 123 for Windows

2, 3, 4, 5

Microsoft Word for DOS

5.0, 5.5

Microsoft Word for Macintosh

3, 4, 5.x

Microsoft Word for Windows

2, 6.x, 7.0

WordPerfect for DOS

5.0, 5.1, 6.0

WordPerfect for Windows

5.x, 6.x

Xerox XIF for UNIX

5, 6

Note:

Only the following formats support WYSIWYG viewing in the ConText viewer:

Microsoft Word for Windows 2 and 6.x
Word Perfect for DOS 5.0, 5.1, 6.0
Word Perfect for Windows 5.x, 6.x

For more information about the ConText viewer, see Oracle8 ConText Cartridge Workbench User's Guide.

For those formats not supported by the internal filters, user can define/create their own external filters.

External Filters

ConText provides a framework for users to plug-in third-party filters to extract pertinent text information from documents. These external filters can be used for a number of purposes, including:

indexing text stored in a format, such as PDF, for which an internal filter does not exist
removing unnecessary text or markup in a document prior to indexing or processing through the ConText Linguistics

For example, the Linguistics rely on text that is grouped into logical paragraphs. If the text stored in the database does not contain clearly-identified paragraphs, the quality of the output generated by the Linguistics may be poor.

An external filter that outlines the paragraph boundaries according to ConText standards could be created to ensure that the Linguistics are provided with an ordered, logical text feed.

Note:

External filters do not support WYSIWYG viewing in the ConText Workbench Viewer Control (Windows 32-bit).

For more information about the Viewer Control, see Oracle8 ConText Cartridge Workbench User's Guide.

External Filter Requirements

An external filter can be any executable (e.g. shell script, C program, perl script) that processes an input file and produces a plain text output file. The text in the output file then can be indexed or processed through the Linguistics.

If the document is in a proprietary format, the executable must recognize the format tags for the document and be able to convert the formatted text into plain (ASCII) text.

In addition, the executable must be able to run from the operating system command-line and accept two arguments:

name of an input file, which stores the document to be filtered
name of an output file, which stores the filtered, ASCII text of the document

The external filter does not need to provide the values for these arguments; Context provides the values as part of its external filter processing.

Note:

The name of the executable cannot be larger than 64 bytes. In addition, the name cannot contain blank spaces or any of the following illegal characters:

! @ # $ % ^ & * ( ) ~ \ Q ' , ^ : " ; ,

Performance Issues

Performance is dependent on the external filter; ConText cannot begin processing a document until the entire document has been filtered. The external filter that performs the filtering should be tuned/optimized accordingly.

Using External Filters

The process model for using external filters is:

Create a filter in the form of a command-line executable.

Store the executable on the server machine where ConText is installed.

Note:

The filter executable must be located in the appropriate directory for your environment.

For example, in a UNIX-based environment, the filter executables must be stored in $ORACLE_HOME/ctx/bin.

In a Windows NT environment, the executables must be stored in \BIN in the Oracle home directory.

For more information about the required location for the external filter executables, see the Oracle8 installation documentation specific to your operating system.

Create a Filter preference that calls the filter executable.
The Tile you use to create the preference depends on whether you use the column to store documents in a single format or mixed formats.
Create a policy that includes the Filter preference for the external filter.

See Also::

For examples of creating Filter preferences for external filters, see "Creating Filter Preferences" in Chapter 9, "Setting Up and Managing Text".

Filters for Single-Format Columns

For columns that store documents in only one format, a single filter is specified in the Filter preference for the column policy. The filtering method for the column is determined by whether the format is supported by the internal or external filters:

if the format is supported by the internal filters, the appropriate internal filter can be used
if the format is not one of the supported internal filter formats, an external filter must be used

See Also::

For examples of creating Filter preferences for single-format columns, see "Creating Filter Preferences" in Chapter 9, "Setting Up and Managing Text".

Filters for Mixed-Format Columns

For columns that store documents in mixed formats, the filtering method is determined by whether the formats are supported by the internal filters, external filters, or both:

if the formats are all supported by the internal filters, the internal Autorecognize filter can be used in the Filter preference
if none of the formats are supported by the internal filter formats, an external filter must be specified for each of the formats in the column

if some, but not all of the formats are supported by the internal filter formats, an external filter must be specified for each of the unsupported formats

Note:

In columns that use external filters, only those external filter formats supported by ConText for mixed-format columns can be used.

For a complete list, see "Supported External Filter Formats for Mixed-Format Columns" in this chapter.

See Also::

For examples of creating Filter preferences for mixed-format columns, see "Creating Filter Preferences" in Chapter 9, "Setting Up and Managing Text".

Autorecognize Filter (Internal)

Autorecognize is an internal filter that automatically recognizes the document formats of all the supported internal filters, as well as plain text (ASCII) and HTML formats, and extracts the text from the document using the appropriate filters.

Note:

Microsoft Word for Windows 7.0 documents are not recognized by Autorecognize. As a result, ConText does not support storing Microsoft Word for Windows 7.0 documents in mixed-format columns.

See Also:

For a complete list of supported internal filters, see "Internal Filters" in this chapter.

External-Only Filters

For mixed-format columns that use only external filters, each filter executable for the formats in the column must be explicitly named in the Filter preference for the column policy. In addition, a format ID must be specified for each filter executable.

The format ID is used by ConText to identify document formats in text columns that store multiple formats.

Internal and External Filters

If the column uses both internal and external filters, each external filter executable must be explicitly named in the Filter preference for the column policy. In addition, a format ID must be specified for each filter executable.

The internal filters do not have to be specified.

During filtering, ConText recognizes whether a format uses the internal or external filters and calls the appropriate filter.

Note:

If required, internal filters can be overridden in a Filter preference by explicitly calling an external filter for the format. This can be useful if you have an external filter that provides additional filtering not provided by the internal filters.

For example, you may have MS Word documents that you want spellchecked before indexing. You could create an external MS Word filter that performs the spellchecking and specify the external filter in the Filter preference for the column policy.

Supported External Filter Formats for Mixed-Format Columns

The following table lists all of the document formats that ConText supports for columns that use external filters and store documents in more than one format.

For each format, the format ID is also listed. This is the value that must be specified when creating a Filter preference using the BLASTER FILTER Tile with the executable attribute.

Note:

This list does not represent the complete list of formats that ConText is able to process. The external filter framework enables ConText to process any document format, provided the documents are stored in a single format and an external filter exists which can filter the format to plain text.

It also does not represent the list of formats for which Oracle provides external filters.

For the complete list of external filters supplied by Oracle, see "Supplied External Filters" in this chapter.

See Also:

For an example of using format IDs in Filter preferences, see "Creating Filter Preferences" in Chapter 9, "Setting Up and Managing Text".

Document Format	ID
AmiPro 1.x - 3.1	19	Enable 1.1, 2.0, 2.15	11	Lotus 123 4.x; Lotus 123 3.0; Lotus 123 1A, 2.0, 2.1	20	MS RTF; MS RTF (ANSI Char Set)	17
AmiPro Graphics SDW Samna Draw	62	Encapsulated PostScript Preview; Encapsulated PostScript Bitmap	66	Lotus Freelance	85	MS Word for DOS 6.0; MS Word for DOS 5.0, 5.5; MS Word for DOS 4.0; MS Word for DOS 3.0, 3.1	8
ASCII	90	First Choice 3.0 Data Base	13	Lotus Manuscript 2.0, 2.1	26	MS Word for Mac 5.0, 5.1; MS Word for Mac 4.0; MS Word for Mac 3.0	28
AT&T Crystal Writer	46	FrameMaker (MIF) 3.0; FrameMaker (MIF) 3.0 Win	42	Lotus PIC	67	MS Word for Windows 2.0; MS Word for Windows 1.x	18
AutoCAD (DXF, DXB)	53	Framework III, 1.0, 1.1	22	Macintosh Paint	88	MS Word for Windows 6.0; MS Word for Mac 6.0	68
CEOwrite 3.0	78	FullWrite Professional 1.0x	31	Microsoft Windows Paint 2.x	70	MS Works for Windows 3.0	69
Computer Graphics Metafile (CGM)	79	GIF (Graphical Interchange Format)	51	Macintosh QuickDraw (PICT)	64	MS Write for Windows 3.x	7
CorelDraw 2.x and 3.x	59	Harvard Graphics	87	MacWrite 4.5 - 5.0	29	MultiMate 4; MultiMate Advantage II; MultiMate Advantage; MultiMate 3.3	6
CTOS DEF	75	HP Graphics Language (HPGL)	83	MacWrite II 1.0 - 1.1	30	Navy DIF (GSA)	35
DBase IV 1.0; DBase III, III +	37	HTML Level 1, 2, 3	91	Mass 11, Version 8.0 -8.33	36	OfficePower 7; OfficePower 6	44
DCA/FFT - Final Form Text	27	IBM Writing Assistant 1.0	16	MastSoft Graphics (MSG)	49	OfficeWriter 6.0 - 6.2; OfficeWriter 5.0; OfficeWriter 4.0	9
DCA/RFT - Revisable Form Text	0	IGES	52	Micrografx Designer (DRW)	60	OS/2 Bitmap; Windows Bitmap (BMP); Windows RLE	63
Digital DX	15	Interleaf 5.2; Interleaf 5.2 - 6.0	32	MS Access 2.0	39	Paradox 3.5, 4.0	38
Digital WPS-PLUS	47	JPEG (Joint Photographic Experts Group)	58	MS Excel 5.0 - 6.0; MS Excel 4.0; MS Excel 3.0; MS Excel 2.1	21	PC Paintbrush (PCX)	71
EBCDIC	89	Legacy 1.x, 2.0	41	MS Powerpoint for Windows 2, 3, 4	84	PDF (Adobe Acrobat)	57

Document Format	ID
PeachText 5000 2.1.2	82	TIFF (Tagged Image File Format)	50	WiziDraw	86	WordPerfect 4.2; WordPerfect 4.1	80
PFS:First Choice 3.0; PFS:First Choice 2.0; PFS:First Choice 1.0; PFS:WRITE Ver C; Professional Write 2.0 - 2.2; Professional Write 1.0	12	Uniplex V7 - V8	77	WiziWord	56	WordPerfect Mac 1.0	81
Quattro Pro DOS; Quattro Pro Windows	45	Vokswriter 3, 4	74	Word For Word Intermediate Communications format (COM)	34	WordPerfect Mac 3.0; WordPerfect Mac 2.1; WordPerfect Mac 2.0	33
Q&A 4.0; Q&A Write 1.x, Q&A 3.0	10	Wang PC, Version 3	24	WordPerfect for Windows 6.1; WordPerfect for Windows 6.0; WordPerfect 6.0	1	WordStar 5.0, 5.5, 6.0, 7.0	40
Rapid File 1.0	23	Wang WITA	55	WordPerfect 5.1 (Mail Merge)	2	WordStar 2000, Rel 3.0	14
RGIP	61	Windows Clipboard	72	WordPerfect for Windows 5.x; WordPerfect 5.1; WordPerfect 5.0	3	WriteNow 3.0	54
Samna Word IV & IV + 1.0, 2.0	25	Windows ICON	73	WordPerfect Graphics 1 (WPG)	4	Xerox - XIF 5.0, 6.0	43
Sun Raster Graphics	65	Windows Metafile (WMF)	48	WordPerfect Graphics 2 (WPG)	5	XYWrite IV; XyWrite III Plus	76

Supplied External Filters

ConText provides a number of external filters, licensed from MasterSoft (Inso Corporation), on a number of platforms.

These filters can be used for filtering documents in many of the popular desktop publishing and word processor formats; however, the executables for the filters do not provide ConText with the required arguments, so ConText also provides scripts which act as wrappers for the executables:

Document Format	Version	Format ID	Wrapper Name	DOS (Windows NT) Executable	Sun Solaris 2.x Executable	Other UNIX Platforms Executable
AmiPro for Windows	1, 2, 3	19	amipro	w033b16d.exe	w4w33b	w4w33b
Lotus Freelance for Windows	2	85	lotusfre	w114b16d.exe	w4w114b	w4w114b
Lotus 123	2, 3, 4	20	lotus123	w020b16d.exe	w4w20b	w4w20b
Lotus 123	5	N/A	lotus123	w020b16d.exe	w4w20b	w4w20b
MS Excel	5	21	msexcel	w021b16d.exe	w4w21b	w4w21b
MS Excel	7	N/A	msexcel	w021b16d.exe	w4w21b	w4w21b
MS PowerPoint for Windows	2, 3, 4	84	power234	w109b16d.exe	w4w109b	w4w109b
MS PowerPoint for Windows	7	N/A	power7	w116b16d.exe	w4w116b	w4w116b
MS Word for DOS	5.0, 5.5	8	worddos	w005b16d.exe	w4w05b	w4w05b
MS Word for Macintosh	3, 4, 5	28	wordmac	w054b16d.exe	w4w54b	w4w54b
MS Word for Windows	2	18	wordwn2	w044b16d.exe	w4w44b	w4w44b
MS Word for Windows	6	68	wordwn67	w049b16d.exe	w4w49b	w4w49b
MS Word for Windows	7	N/A	wordwn67	w049b16d.exe	w4w49b	w4w49b
PDF/Adobe Acrobat	N/A	57	acropdf	acront.exe	acrosol	w4w107b (BETA)
WordPerfect for DOS; WordPerfect for Windows	5.0, 5.1; 5.x	3	wp5	w007b16d.exe	w4w07b	w4w07b
WordPerfect for DOS; WordPerfect for Windows	6.0; 6.x	1	wp67	w048b16d.exe	w4w48b	w4w48b
WordPerfect for Windows	7.0	N/A	wp67	w048b16d.exe	w4w48b	w4w48b
Xerox XIF	5, 6	43	xeroxxif	w103b16d.exe	w4w103b	w4w103b

Supplied External Filters Installation

The supplied external filter executables and their wrappers are installed automatically during installation of ConText. The location and format of the executable and wrapper files are operating system dependent.

Note:

If you have upgraded from a release prior to release 2.3 of ConText, you may have existing external filters supplied by ConText. These external filters are no longer up-to-date and should be replaced by the external filters provided in this release.

You may also need to change your wrappers accordingly or use the wrappers provided by ConText in this release.

See Also:

For more information about the location of the supplied external filters, see the Oracle8 installation documentation specific to your operating system.

Supplied External Filter Setup

The supplied external filters do not require any setup, aside from creating preferences that call the wrappers for the filters; however, if you have upgraded from a previous release and already have wrappers for the external filters provided in the previous release, as well as preferences that call the wrappers, if you want to use the new wrappers as provided, you must drop your indexes, policies, and preferences, then create new preferences, policies, and indexes.

To avoid this situation, you can choose one of the following actions:

modify the names of the new wrappers to match the names of the existing wrappers
modify the names of the new external filters to match the names of the previous external filters

modify your existing wrappers to call the new external filter executables

Note:

If you modify your existing wrappers, in general, the only information that needs to change in the wrappers is the name of the filter executable.

If you are using the BETA PDF filter provided for Windows NT and Sun Solaris in the previous release and wish to use the new production PDF filter for these two platforms, you should drop any ConText indexes that you created with the BETA PDF filter, as well as the policies and preferences that called the filter. Then, create new preferences/policies that use the new filter and create new ConText indexes using the policies.

Supplied External Filter Usage

The supplied external filters have the following three usage issues:

The wrapper name (e.g. 'amipro'), not the executable (e.g. 'w033b16d.exe' or 'w4w33b'), must be specified in the Filter preferences that you create for the supplied external filters.
Because the wrappers, and not the executables, are called in the Filter preferences that you create for the supplied external filters, you generally do not need to know the name of the filter executables; however, if you find it necessary to modify the wrappers, it may be useful to know the names of the filter executables.
If a format does not have a format ID (e.g. MS Word for Windows, version 7), the external filter cannot be used in text columns that store multiple formats. It can only be used in text columns that store a single format.
Wrapper names are operating system dependent and may be different than the names listed. In particular, the wrapper names may have suffixes (e.g. '.sh' or '.bat') that your operating system uses to identify scripts.
If your operating system requires specifying the complete name of a script in order to run the script, you must include any suffixes in the Filter preferences that you create using the supplied external filters.

See Also:

For examples of using the supplied external filters, see "Creating Filter Preferences" in Chapter 9, "Setting Up and Managing Text".

Limitations

The PDF filter provided for the UNIX-based platforms has a status of BETA. It does not support filtering multi-column documents or documents over approximately 1 Megabyte in size.

ConText Indexes

A ConText index is the construct that allows ConText servers to process queries and return information based on the content or themes of the text stored in a text column of an Oracle database. A ConText index is an inverted index consisting of all the tokens (words or themes) that occur in a text column and the documents (i.e. rows) in which the tokens are found.

This information is stored in database tables that are associated with the text column through a policy. A ConText index is created by calling CTX_DDL.CREATE_INDEX for the policy.

When an query is issued against a text column, rather than scan the actual text to find documents that satisfy the search criteria of the query, ConText searches the ConText index tables for the column to determine whether a document should be returned in the results of the query.

ConText supports two types of indexes, text and theme. This section discusses the following concepts relevant to both text and theme indexes:

ConText Index Tables

The ConText index for a text column consists of the following internal tables:

DR_nnnnn_I1Tn (token table)
DR_nnnnn_KTB (textkey mapping table)
DR_nnnnn_LST (control table)
DR_nnnnn_NLT (DOCID resolution table)
DR_nnnnn_I1W (Soundex wordlist table -- created only if Soundex is enabled)
DR_nnnnn_SQR (stored query expression result table)

The nnnnn string is an identifier (from 1000-99999) which indicates the policy of the text column for which the ConText index is created.

In addition, ConText automatically creates one or more Oracle indexes for each ConText index table.

The tablespaces, storage clauses, and other parameters used to create the ConText index tables and Oracle indexes are specified by the attributes set for the Engine preference (GENERIC ENGINE Tile) in the policy for the text column.

See Also:

For a description of the ConText index tables, see Appendix C, "ConText Index Tables and Indexes".

For more information about stored query expressions (SQEs), see Oracle8 ConText Cartridge Application Developer's Guide.

Creating Empty ConText Indexes

If you want to create the ConText index tables without populating the tables, ConText provides a parameter, pop_index, for CTX_DDL.CREATE_INDEX, which specifies whether the ConText index tables are populated during indexing.

Stages of ConText Indexing

ConText indexing takes place in three stages:

Index Initialization

During index initialization, the tables used to store the ConText index are created.

See Also:

For more information about the tables used to store the ConText index, see "ConText Index Tables" in this chapter.

Index Population

During index population, the ConText index entries for the documents in the text column are created in memory, then transferred to the index tables.

If the memory buffer fills up before all of the documents in the column have been processed, ConText writes the index entries from the buffer to the index tables and retrieves the next document from the text column to continue ConText indexing.

The amount of memory allocated for ConText indexing for a text column determines the size of the memory buffer and, consequently, how often the index entries are written to the index tables.

See Also:

For more information about the effects of frequent writes to the index tables, see "Index Fragmentation" and "Memory Allocation" in this chapter.

Index Termination

During index termination, the Oracle indexes are created for the ConText index tables. Each ConText index table has one or more Oracle indexes that are created automatically by ConText.

Note:

The termination stage only starts when the population stage has completed for all of the documents in the text column.

Columns with Multiple Indexes

A column can have more than one index by simply creating more than one policy for the column and creating a ConText index for each policy. This is useful if you want to specify different indexing options for the same column. In particular, this is useful if you want to create a text and theme index on a column.

When two indexes exist for the same column, one-step queries (theme or text) require the policy name, as well as the column name, to be specified for the CONTAINS function in the query. In this way, the correct index is accessed for the query.

This requirement is not enforced for two-step and in-memory queries, because they use policy name, rather than column name, to identify the column to be queried.

See Also:

For more information about one-step queries and the CONTAINS function, see Oracle8 ConText Cartridge Application Developer's Guide.

Index Fragmentation

As ConText builds an index entry for each token (word or theme) in the documents in a column, it caches the index entries in memory. When the memory buffer is full, the index entries are written to the ConText index tables as individual rows.

If all the documents (rows) in a text column have not been indexed when the index entries are written to the index tables, the index entry for a token may not include all of the documents in the column. If the same token is encountered again as ConText indexing continues, a new index entry for the token is stored in memory and written to the index table when the buffer is full.

As a result, a token may have multiple rows in the index table, with each row representing a index fragment. The aggregate of all the rows for a word/theme represents the complete index entry for the word/theme.

See Also:

For more information about resolving index fragmentation, see "Index Optimization" in this chapter.

Memory Allocation

A machine performing ConText indexing should have enough memory allocated for indexing to prevent excessive index fragmentation. The amount of memory allocated depends on the capacity of the host machine doing the indexing and the amount of text being indexed.

If a large amount of text is being indexed, the index can be very large, resulting in more frequent inserts of the index text strings to the tables. By allocating more memory, fewer inserts of index strings to the tables are required, resulting in faster indexing and fewer index fragments.

See Also:

For an example of allocating memory for ConText indexing, see "Creating an Engine Preference" in Chapter 9, "Setting Up and Managing Text".

Parallel Indexing

Parallel indexing is the process of dividing ConText indexing between two or more ConText servers. Dividing indexing between servers can help reduce the time it takes to index large amounts of text.

To perform indexing in parallel, you must start two or more ConText servers (each with the DDL personality) and you must correctly allocate indexing memory.

The amount of allocated index memory should not exceed the total memory available on the host machine(s) divided by the number of ConText servers performing the parallel indexing.

For example, you allocate 10 Mb of memory in the policy for the text column for which you want to create a ConText index. If you want to use two servers to perform parallel indexing on your machine, you should have at least 20 Mb of memory available during indexing.

Note:

When using multiple ConText servers to perform parallel indexing, the servers can run on different host machines if the machines are able to connect via SQL*Net to the database where the index is stored.

Index Updates

When an existing document in a text column is deleted or modified such that the ConText index is no longer up-to-date, the index must be updated.

However, updating the index for modified/deleted documents affects every row that contains references to the document in the index. Because this can take considerable time, ConText utilizes a deferred delete mechanism for updating the index for modified/deleted documents.

In a deferred delete, the document references in the ConText index token table (DR_nnnnn_I1Tn) for the modified/deleted document are not actually removed. Instead, the status of the document is recorded in the ConText index control table (DR_nnnnn_LST), so that the textkey for the document is not returned in subsequent text queries that would normally return the document.

Actual deletion of the document references from the token table (I1Tn) takes place only during optimization of a index.

Index Optimization

Optimization performs two functions for an index:

Compaction of Index Fragments
Removal of Document References (also known as actual deletion or garbage collection)

Compaction of index fragments results in fewer rows in the ConText index tables, which results in faster and more efficient queries. It also allows for more efficient use of tablespace.

Garbage collection updates the index strings to accurately reflect the status of deleted and modified documents.

Compaction of Index Fragments

Compaction combines the index fragments for a token into longer, more complete strings, up to a maximum of 64 Kb for any individual string.

ConText provides two methods of index compaction:

in-place compaction
two-table compaction (default)

In-place compaction uses available memory to compact index fragments, then writes the compacted strings back into the original (existing) token table in the ConText index.

Two-table compaction creates a second token table into which the compacted index fragments are written. When compaction is complete, the original token table is deleted.

Two-table compaction is faster than in-place compaction; however, it requires enough tablespace to be available during compaction to accommodate the creation and population of the second token table.

Removal of Document References

ConText provides optimization methods which can be used to perform the actual deletion of all references to modified/deleted documents in an index.

During an actual delete, the index references for all modified/deleted documents are removed from the ConText index token table (DR_nnnnn_I1Tn), leaving only references to existing, unchanged documents. In addition, in an actual delete, the ConText index control table (DR_nnnnn_LST) is cleared of the information which records the status of documents.

Similar to compaction, ConText supports in-place or two-table actual deletion.

When to Optimize

Index optimization should be performed regularly, as the indexing process can create many rows in the database depending on the amount of memory allocated for indexing and the amount of text being indexed.

In general, optimize an index after:

large amounts of text are indexed
parallel indexing has been utilized
large numbers of documents in a table have been modified or deleted

Index Log

The ConText index log records all the indexing operations performed on a policy for a text column. Each time an index is created, optimized, or deleted for a text column, an entry is created in the index log.

Log Details

Each entry in the log provides detailed information about the specified indexing operation, including:

the policy for the text column on which the indexing operation was performed
the indexing operation that was performed (creation, optimization, deletion)
if the indexing operation was performed in parallel, the ID of the server that processed the operation
whether the operation failed and, if it did, the stage at which it failed
the number of documents selected for processing and the number of documents actually processed during the indexing operation
the textkeys of the first and last documents processed

Accessing the Log

The index log is stored in an internal table and can be viewed using the CTX_INDEX_LOG or CTX_USER_INDEX_LOG views. The index log can also be viewed in the GUI administration tools (System Administration or Configuration Manager).

Text Indexes

A text index is generated by one of the text lexers provided by ConText and consists of:

a list of every unique token (word) in the collection of documents in a text column
for each word, a string that identifies each document in which the word occurs and the location of each occurrence within each document

There is a one-to-one relationship between a text index and the text indexing policy for which it was created.

See Also:

For more information about text indexing policies, see "Text Indexing Policies" in Chapter 7, "Understanding the ConText Data Dictionary: Indexing".

Text Lexers

The text lexer identifies tokens for creating text indexes. During text indexing, each document in the text column is retrieved and filtered by ConText. Then, the lexer identifies the tokens and extracts them from the filtered text and stores the tokens in memory, along with the document ID and locations for each word, until all of the documents in the column have been processed or the memory buffer is full.

The index entries, consisting of each token and its location string, are then written as rows to the token table for the ConText index and the buffer is flushed.

ConText provides a number of Lexer Tiles that can be used to create text indexes. For non-pictorial languages, such as English and the other Western European languages, ConText provides a single Tile named BASIC LEXER.

For pictorial languages, ConText provides a separate Tile for each of the languages supported by ConText (Japanese, Chinese, and Korean).

See Also:

For more information about the text indexing lexers, see "Lexer Tiles" in Chapter 7, "Understanding the ConText Data Dictionary: Indexing".

What's in a Text Index?

Text index entries consist of each unique token in the text column and a location string for each token.The text index may be case-sensitive or case-insensitive and contain stop words.

Tokens in Text Indexes

A token is the smallest unit of text that can be indexed.

In non-pictorial languages, tokens are generally identified as alphanumeric characters surrounded by white space and/or punctuation marks. As a result, tokens can be single words, strings of numbers, and even single characters.

In pictorial languages, tokens may consist of single characters or combinations of characters, which is why separate lexers are required for each pictorial language. The lexers search for character patterns to determine token boundaries.

See Also:

For more information about token recognition, see "Lexer Tiles" in Chapter 7, "Understanding the ConText Data Dictionary: Indexing".

Token Location Information

The location information for a token is bit string that contains the location (offsets in ASCII) of each occurrence of the token in each document in the column. The location information also contains any stop words that precede and follow the token.

Case-sensitivity

For non-pictorial languages, the BASIC LEXER Tile, by default, creates case-insensitive text indexes. In a case-insensitive index, tokens are converted to all uppercase in the index entries.

The Tile also provides an attribute, mixed_case, for enabling case-sensitive text indexes. In a case-sensitive index, entries are created using the tokens exactly as they appear in the text, including those tokens that appear at the beginning of sentences.

For example, in a case-insensitive text index, the tokens oracle and Oracle are recorded as a single entry, ORACLE. In a case-sensitive text index, two entries, oracle and Oracle, are created.

As a result, case-sensitive indexes may be much larger than case-insensitive indexes and may have some effect on text queries; however, case-sensitive indexes allow for greater precision in text queries.

Note:

The case-sensitivity of a text index affects the text queries performed against the index. If the text index is case-sensitive, text queries are also case-sensitive.

See Also:

For more information about case-sensitivity in text queries, see Oracle8 ConText Cartridge Application Developer's Guide.

Stop Words

A stop word is any combination of alphanumeric characters (generally a word or single character) for which ConText does not create an entry in the index. Stop words are specified in the Stoplist preference for a text indexing policy.

While stop words do not have entries in the text index, stop word information for tokens is stored as numbers in the token bit strings. The number corresponds to the sequence defined for the stop word. The token bit string stores up to eight of the contiguous stop words immediately preceding and following the token. Because the stop words are stored in the text index, stop words can be included phrases in text queries.

Note:

Stoplists for case-sensitive indexes are automatically case-sensitive, meaning stop words in the text are only indexed as stopwords if they exactly match the case of the stop words in the stoplist.

As a result, when creating a Stoplist preference for a column on which you want create a case-sensitive text index, you should specify a stoplist entry for each variation (i.e. lowercase, initial uppercase, uppercase) that may occur for a stop word.

See Also:

For an example of creating a Stoplist preference, see "Creating a Stoplist Preference" in Chapter 9, "Setting Up and Managing Text".

For more information about stoplists, see "Stoplist Tiles" in Chapter 7, "Understanding the ConText Data Dictionary: Indexing".

For more information about stop words in text queries, see Oracle8 ConText Cartridge Application Developer's Guide.

DDL and DML

Text indexes are processed using ConText servers with the DDL and DML personalities. All requests for index creation and optimization are processed by any currently available DDL servers.

Text index updates are processed by the DML or DDL servers that are running at the time, depending on the DML index update method (immediate or batch) you are using.

See Also:

For more information about DDL and DML operations, see "DDL" and "DML" in this chapter.

For more information about ConText server personalities, see "Personality Masks" in Chapter 2, "Administration Concepts".

Theme Indexes

Theme indexes are functionally identical to text indexes and are created in the same manner:

a policy is created for a column
a DDL request for index creation is submitted for the column
once the theme index has been generated, the column is enabled for all three query methods

The key to generating a theme index is the lexer that you specify for the column policy. Instead of specifying the basic (default) lexer, the theme lexer is specified.

Note:

Theme indexing is only supported for English-language text.

See Also:

For more information about text indexes, see "Text Indexes" in this chapter.

For more information about theme queries and query methods, see Oracle8 ConText Cartridge Application Developer's Guide.

Theme Lexer

For theme indexing, ConText provides a Tile, THEME LEXER Tile, that bypasses the standard text parsing routines and, instead, accesses the linguistic core in ConText to generate themes for documents.

The theme lexer analyzes text at the sentence, paragraph, and document level to create a context in which the document can be understood. It uses a mixture of statistical methods and heuristics to determine the main topics that are developed throughout the course of the document.

It also uses the ConText Knowledge Catalog, a collection of over 200,000 words and phrases, organized into a conceptual hierarchy with over 2,000 categories, to generate its theme information.

See Also:

For more information about the ConText Knowledge Catalog, see Oracle8 ConText Cartridge Application Developer's Guide.

What's in a Theme Index?

A theme index contains a list of all the tokens (themes) for the documents in a column and the documents in which each theme is found. Each document can have up to sixteen themes.

Note:

Offset and frequency information are not relevant in a theme index, so this type of information is not stored.

Tokens in Theme Indexes

Unlike the single tokens that constitute the entries in a text index, the tokens in a theme index often consist of phrases.

In addition, these phrases may be common terms or they may be the names of companies, products, and fields of study as defined in the Knowledge Catalog.

For example, a document about Oracle contains the phrase Oracle Corp. In a (case-sensitive) text index for the document, this phrase would have two entries, ORACLE and CORP, all in uppercase. In a theme index for the document, the entry would be Oracle Corporation, which is the canonical form of Oracle Corp., as stored in the Knowledge Catalog.

Theme Weights

Each document theme has a weight associated with it. The theme weight measures the strength of the theme relative to the other themes in the document. Theme weights are stored as part of the theme signature for a document and are used by ConText to calculate scores for ranking the results of theme queries.

Case-sensitivity

Theme indexes are always case-sensitive. Tokens (themes) are recorded in uppercase, lowercase, and mixed-case in a theme index. The case for the entry is determined by how the theme is represented in the Knowledge Catalog. If the theme is not in the Knowledge Catalog, the case for the entry is identical to the theme as it appears in the text of the document.

Linguistic Settings

ConText uses linguistic settings, specified as setting configurations, to perform special processing for text that is in all-uppercase or all-lowercase. The Linguistics also use the settings to determine the size of theme summaries and the size and generation method for Gists.

ConText provides two predefined setting configurations:

GENERIC (mixed-case text)
SA (all-uppercase or all-lowercase text)

GENERIC is the default predefined setting configuration and is automatically enabled for each ConText server at start up.

In addition, you can create your own custom setting configurations in either of the GUI administration tools provided in the ConText Workbench.

Note:

Theme indexing is not currently supported for all-uppercase or all-lowercase text. In addition, the other linguistic settings only affect Gist and theme summary generation for the ConText Linguistics.

As such, you do not need to create custom setting configurations and should always use the default setting configuration, GENERIC, for theme indexing.

See Also:

For more information about Linguistics, Gists, and theme summaries, as well as the linguistic settings, see Oracle8 ConText Cartridge Application Developer's Guide.

Index Fragmentation

Because the number of distinct themes in a collection of documents is usually fewer than the number of distinct tokens, theme indexes generally contain fewer entries than text index. As a result, index fragmentation is not as much of an issue in theme indexes as in text indexes; however, some fragmentation may occur during theme indexing.

Similar to text indexes, index fragments in theme indexes can be consolidated through the CTX_DDL.OPTIMIZE_INDEX procedure.

DDL and DML

Theme indexes are processed identically to text indexes, meaning that requests for index creation and optimization are processed by any currently available DDL servers.

Similarly, theme index updates are processed by either the DML or DDL servers that are running at the time, depending on the DML index update method (immediate or batch) you are using.

In contrast, Linguistics requests, such as theme and Gist/theme summary generation, use ConText servers with the Linguistic personality.

See Also:

For more information about DDL and DML operations, see "DDL" and "DML" in this chapter.

Base-letter Conversion

For each text column in a table, you can specify whether the characters used in single-byte (8-bit), non-English languages are to be converted to their base-letter representation. This means that words with diacritical marks (accents, umlauts, etc.) are converted to their base form before their tokens are stored in the text index for the column.

Text Indexing

Base-letter conversion is an attribute that you can set when creating a Lexer preference.

If base-letter conversion is enabled for the Lexer preference in a policy, during text indexing of the column for the policy, all characters containing diacritical marks are converted to their base form in the text index. The original text is not affected.

Base-letter conversion requires that the database character set is a subset of the NLS_LANG character set.

For example, suppose the NLS_LANG environment variable is set to French_France.WE8ISO8859P1 and the following piece of text is to be converted to its base-letter representation:

La référence de session doit être égale à 'name'

The sentence is indexed as:

la reference de session doit etre egale a name

Note:

Base-letter conversion requires that the language component for NLS_LANG is set to a single-byte language (e.g. French, German) that supports an extended (8-bit) character set. In addition, the charset component must be set to one of the 8-bit character sets (e.g. WE8ISO8859P1).

See Also:

For more information about enabling base-letter conversion for a text column, see "BASIC LEXER Tile" in Chapter 7, "Understanding the ConText Data Dictionary: Indexing".

For more information about National Language Support and the NLS_LANG environment variable, see Oracle8 Server Reference Manual.

Text Queries

In a text query on a column with base-letter conversion enabled, the query terms are automatically converted to match the base-letter conversion that was performed during text indexing.

Note:

Base-letter conversion works with all of the query operators (logical, control, expansion, thesaurus, etc.), except the STEM expansion operator.

See Also:

For more information about text queries and the query operators, see Oracle8 ConText Cartridge Application Developer's Guide.

Thesauri

Users looking for information on a given topic may not know which words have been used in documents that refer to that topic.

ConText enables users to create case-sensitive or case-insensitive thesauri which define relationships between lexically equivalent words and phrases. Users can then retrieve documents that contain relevant text by expanding queries to include similar or related terms as defined in a thesaurus.

The ConText thesauri formats and functionality are compliant with both ISO-2788 and ANSI Z39.19 (1993).

Thesaural Maintenance

Thesauri are stored in internal tables owned by CTXSYS. Each thesaurus is uniquely identified by a name that is specified when the thesaurus is created.

Thesaurus Creation and Modification

Thesauri can be created and modified by all ConText users with the CTXAPP role.

ConText supports thesaural maintenance through PL/SQL (CTX_THES package) and the System Administration tool.

Note:

Thesauri can be created, updated, and deleted by all users with the CTXAPP role.

In addition, the ctxload utility can be used for loading (creating) thesauri from a load file into the thesaurus tables, as well as dumping thesauri from the tables into output (dump) files.

The thesaurus dump files created by ctxload can be printed out or used as input for other applications. The dump files can also be used to load a thesaurus into the thesaurus tables. This can be useful for using an existing thesaurus as the basis for creating a new thesaurus.

See Also:

For more information, see "CTX_THES: Thesaurus Management" in Chapter 11, "PL/SQL Packages - Text Management".

For more information about ctxload, see Chapter 10, "Text Loading Utility".

Default Thesaurus

Before the query operators can be used in a query expression, a thesaurus named 'DEFAULT' must be created either through the GUI administration tools, CTX_THES.CREATE_INDEX or through ctxload.

The reason for this is because the thesaurus that is automatically used by the thesaurus operators is named DEFAULT, unless a different thesaurus is explicitly called by name in the query expression.

Case-sensitivity

Thesauri support case-sensitivity. In other words, terms are stored in a case-sensitive thesaurus exactly as entered. In addition, for terms that are expanded using thesaurus operators in a text query, the case of the terms is retained and used for thesaural look-up.

To support case-sensitive thesauri, the case_sensitive parameter is provided for the CTX_THES.CREATE_THESAURUS procedure. In addition, ctxload provides an argument, -thescase, to support importing case-sensitive thesauri.

Query Expansion

The expansions returned by the thesaurus operators are combined using the ACCUMULATE operator ( , ) in the query expression.

Note:

ConText supports creating multiple thesauri; however, only one thesaurus can be used at a time in a query.

See Also:

For more information about using thesauri and the thesaurus operators to expand queries, see Oracle8 ConText Cartridge Application Developer's Guide.

Text and Theme Queries

Thesauri are primarily used for expanding text queries, but can be used for expanding theme queries, provided a thesaurus has been created for the themes that can be generated by ConText.

Limitations

In a query, the expansions generated by the thesaurus operators don't follow nested thesaural relationships. In other words, only one thesaural relationship at a time is used to expand a query.

For example, B is a narrower term for A. B is also in a synonym ring with terms C and D, and has two related terms, E and F. In a narrower term query for A, the following expansion occurs:

NT(A) query is expanded to {A}, {B}

Note:

The query expression is not expanded to include C and D (as synonyms of B) or E and F (as related terms for B).

Types of Thesaural Relationships

Three types of relationships can be defined for terms (words and phrases) in a thesaurus:

In addition, each entry in a thesaurus can have Scope Notes associated with it.

Note:

ConText supports creating multiple thesauri; however, only one thesaurus can be used at a time in a query.

See Also:

For more information about using thesauri to expand queries, see Oracle8 ConText Cartridge Application Developer's Guide.

Synonyms

Support for synonyms is implemented through synonym entries in a thesaurus. The collection of all of the synonym entries for a term and its associated terms is known as a synonym ring.

Synonyms support the following entries:

Synonym Rings

Synonym rings are transitive. If term A is synonymous with term B and term B is synonymous with term C, term A and term C are synonymous. Similarly, if term A is synonymous with both terms B and C, terms B and C are synonymous. In either case, the three terms together form a synonym ring.

For example, in the synonym rings shown in this example, the terms car, auto, and automobile are all synonymous. Similarly, the terms main, principal, major, and predominant are all synonymous.

Note:

A thesaurus can contain multiple synonym rings; however, synonym rings are not named. A synonym ring is created implicitly by the transitive association of the terms in the ring.

As such, a term cannot exist twice within the same synonym ring or within more than one synonym ring in a thesaurus.

Synonym rings are not named, but they have an ID associated with them. The ID is assigned when the synonym entry is first created.

Preferred Terms

Each synonym ring can have one, and only one, term that is designated as the preferred term. A preferred term is used in place of the other terms in a synonym ring when one of the terms in the ring is specified with the PT operator in a query.

Note:

A term in a preferred term (PT) query is replaced by, rather than expanded to include, the preferred term in the synonym ring.

Hierarchical Relationships

Hierarchical relationships consist of broader and narrower terms represented as an inverted tree. Each entry in the hierarchy is a narrower term for the entry immediately above it and to which it is linked. The term at the root of each tree is known as the top term.

For example, in the tree structure shown in the following example, the term elephant is a narrower term for the term mammal. Conversely, mammal is a broader term for elephant. The top term is animal.

ConText also supports the following hierarchical relationships in thesauri:

Each of the three hierarchical relationships supported by ConText represents a separate branch of the hierarchy and are accessed in a query using different thesaurus operators.

Note:

The three types of hierarchical relationships are optional. Any of the three hierarchical relationships can be specified for a term.

Generic Hierarchy

The generic hierarchy represents relationships between terms in which one term is a generic name for the other.

For example, the terms rat and rabbit could be specified as generic narrower terms for rodent.

Partitive Hierarchy

The partitive hierarchy represents relationships between terms in which one term is part of another.

For example, the provinces of British Columbia and Quebec could be specified as partitive narrower terms for Canada.

Instance Hierarchy

The instance hierarchy represents relationships between terms in which one term is an instance of another.

For example, the terms Cinderella and Snow White could be specified as instance narrower terms for fairy tales.

Multiple Occurrences of the Same Term

Because the four hierarchies are treated as separate structures, the same term can exist in more than hierarchy. In addition, a term can exist more than once in a single hierarchy; however, each occurrence of the term must be accompanied by a qualifier.

If a term exists more than once as a narrower term in one of the hierarchies, broader term queries for the term are expanded to include all of the broader terms for the term.

If a term exists more than once as a broader term in one of the hierarchies, narrower term queries for the term are expanded to include the narrower terms for each occurrence of the broader term.

For example, C is a generic narrower term for both A and B. D and E are generic narrower terms for C. In queries for terms A, B, or C, the following expansions take place:

NTG(A) expands to {C}, {A}
NTG(B) expands to {C}, {B}
NTG(C) expands to {C}, {D}, {E}
BTG(C) expands to {C}, {A}, {B}

Note:

The same expansions hold true for standard, partitive, and instance hierarchical relationships.

Qualifiers

For homographs (terms that are spelled the same way, but have different meanings) in a hierarchy, a qualifier must be specified as part of the entry for the word. When homographs that have a qualifier for each occurrence appear in a hierarchy, each term is treated as a separate entry in the hierarchy.

For example, the term spring has different meanings relating to seasons of the year and mechanisms/machines. The term could be qualified in the hierarchy using the terms season and machinery.

To differentiate between the terms during a query, the qualifier must be specified. Then, only the terms that are broader terms, narrower terms, or related terms for the term and its qualifier are returned. If no qualifier is specified, all of the related, narrower, and broader terms for the terms are returned.

Note:

In thesaural queries that include a term and its qualifier, the qualifier must be escaped, because the parentheses required to identify the qualifier for a term will cause the query to fail.

Related Terms

Each entry in a thesaurus can have one or more related terms associated with it. Related terms are terms that are close in meaning to, but not synonymous with, their related term. Similar to synonyms, related terms are reflexive; however, related terms are not transitive.

If a term that has one or more related terms defined for it is specified in a related term query, the query is expanded to include all of the related terms.

For example, B and C are related terms for A. In queries for A, B, and C, the following expansions take place:

RT(A) expands to {A}, {B}, {C}
RT(B) expands to {A}, {B}
RT(C) expands to {C}, {A}

Note:

Terms B and C are not related terms and, as such, are not returned in the expansions performed by ConText.

Scope Notes

Each entry in the hierarchy, whether it is a main entry or one of the synonymous, hierarchical, or related entries for a main entry, can have scope notes associated with it.

Scope notes can be used to provide descriptions or comments for the entry.

Document Sections

Text queries tend toward low precision. Structured or meta-data can be used to create more precise text queries, thus increasing query precision.

Often, this data is embedded in the document itself. For example, HTML documents have title information, paragraph offsets, and other document attributes as part of the content. E-mail messages are another example in which the text of the message may contain fields with consistent, regularly-occurring headers such as subject: and date:.

With document sections, ConText allows users to leverage the structure of documents to increase text query precision. Users define rules for dividing documents into sections. ConText includes the section information in the ConText text index for a column so that text queries on the column can be restricted to a specified section.

Note:

Section searching does not apply to theme queries. As such, defining document sections for theme indexes is not supported.

In addition, because section information is stored in the text index during indexing, if you want to use section searching sections for columns with existing text indexes, you must drop the indexes, create the required sections, section groups, and preferences, then reindex the columns.

Sections

A section is a user-defined body of text, delimited by tags, within a document. Sections are named and grouped into section groups.

Note:

A section is not created as an individual object. Instead, a section is created by adding the section to an existing section group.

For examples of adding, as well as removing, sections in section groups, see "Managing Document Sections" in Chapter 9, "Setting Up and Managing Text".

Start and End Tags

The beginning of a section is explicitly identified by a start tag, which can be any token in the text, provided the token can be recognized by the lexer for the text column. Each section must have a start tag.

The end of a section can be identified explicitly by an end tag or implicitly by the occurrence of the next occurring start tag, depending on whether the section is defined as a top-level or self-enclosing section. As a result, end tags can be optional. Similar to start tags, end tags can be any token in the text, provided the token can be recognized by the lexer.

Note:

Start and end tags are not case-sensitive. The tag '<head>' is identical to the tag '<HEAD>'.

For documentation purposes, all references to start and end tags in this section are presented in UPPERCASE.

Start and end tags are stored as part of the ConText index, but do not take up space in the index. For example, a document contains the following string, where <TITLE> and </TITLE> are defined as start and end tags:

<TITLE>cats</TITLE> make good pets

The string is indexed by ConText as:

cats make good pets

which enables searching on phrases such as cats make.

In addition, start and end tags do not produce hits if searched upon.

Suggestion:

Because each occurrence of a token specified as a start/end tag indicates the beginning/end of a section, specify tokens for start and end tags that are as distinctive as possible. Include any non-alphanumeric characters such as colons ': ' or angle brackets '<>' which help to uniquely identify the tokens.

For example, the term TITLE by itself does not make a good start tag, because it is a common word and ConText would record the start of a new section each time the term was encountered in the text. A better start tag would be the string <TITLE> or TITLE:.

Top-level Sections

A top-level section is only closed (implicitly) by the next occurring top-level section or (explicitly) by the occurrence of the end tag for the section. End tags are not required for top-level sections. In addition, a top-level section implicitly closes all sections that are not defined as top-level.

Top-level sections cannot enclose themselves or each other. As a result, if a section is defined as top-level, it cannot also be defined as self-enclosing.

Self-Enclosing Sections

A self-enclosing section is only closed (explicitly) when the end tag for the section is encountered or (implicitly) when a top-level section is encountered. As a result, end tags are required for sections that are defined as self-enclosing.

Self-enclosing sections support defining tags such as the table tag <TD> in HTML as a start tag. Table data in HTML is always explicitly ended with the </TD> tag. In addition, tables in HTML can have embedded or nested tables.

If a section is not defined as self-enclosing, the section is implicitly closed when another start tag is encountered. For example, the paragraph tag <P> in HTML can be defined as a start tag for a section that is not self-enclosing, because paragraphs in HTML are sometimes explicitly ended with the </P> tag, but are often ended implicitly with the start of another tag.

Limitations

Sections have the following limitations:

Implicit Start of Body Sections

ConText does not recognize the start of a body section after the implicit end of a header section.

For example, consider the following e-mail message in which FROM:, SUBJECT:, and NEWSGROUPS: are defined as start tags for three different sections:

From: jsmith@ABC.com
Subject: New teams
Newsgroups: arts.recreation, alt.sports

New teams have been added to the league.

All of the text following the NEWSGROUPS: header tag is included in the header section, including the body of the message.

Multi-word Start and End Tags

ConText does not support start and end tags consisting of more than one word. Each start and end tag for a section can contain only a single word and the word must be unique for each tag within the section group.

For example:

problem description: Insufficent privileges
problem solution: Grant required privileges to file

The strings PROBLEM DESCRIPTION: and PROBLEM SOLUTION: cannot be specified as start tags.

Identical Start and End Tags

ConText does not recognize sections in which the start and end tags are the same.

For example:

:Author:
Joseph Smith
:Author:
:Title:
Guide to Oracle
:Title:

The strings :AUTHOR: and :TITLE: cannot be specified as both start and end tags.

Section Groups

A section group is the collection of all the sections for a text column. Section groups are assigned by name to a text column through the Wordlist preference in the policy for the column.

Sections in Section Groups

The start and end tags for a particular section must be unique within the section group to which the section belongs. In addition, within a section group, no start tag can also be an end tag.

Section names do not have to be unique within a section group. This allows defining multiple start and end tags for the same logical section, while making the section details transparent to queries.

Section Group Management

Section groups can be created and deleted by ConText users with the CTXADMIN or CTXAPP roles. In addition, users with CTXADMIN or CTXAPP can add and remove sections from section groups. Section group names must be unique for the user who creates the section group.

See Also:

For examples of creating and deleting section groups, as well as adding and removing sections in section groups, see "Managing Document Sections" in Chapter 9, "Setting Up and Managing Text".

Startjoin and Endjoin Characters

To enable defining document sections, ConText supports specifying non-alphanumeric characters (e.g. hyphens, colons, periods, brackets) using the startjoins and endjoins attribute for the BASIC LEXER Tile.

When a character defined as a startjoins appears at the beginning of a word, it explicitly identifies the word as a new token and end the previous token. When an character specified as an endjoins appears at the end of a word, it explicitly identifies the end of the token.

Note:

Characters that are defined as startjoins and endjoins are included as part of the entry for the token in the ConText index.

Text Filtering

Section searching requires the start and end tags for the document sections to be included in the ConText index. This is accomplished through the use of ConText filters and the (optional) definition of startjoins and printjoins for the BASIC LEXER Tile.

For HTML text that uses the internal HTML filter, document sections have an additional requirement. Because the internal HTML filter removes all HTML markup during filtering, you must explicitly specify the HTML tags that serve as section start and end tags and, consequently, must not be removed by the filter.

This is accomplished through the keep_tag attribute for the HTML FILTER Tile. The keep_tag attribute is a multi-value attribute that lets users specify the HTML tags to keep during filtering with the internal HTML filter.

For HTML filter that is filtered using an external HTML filter, the filter must provide some mechanism for retaining HTML tags used as section start and end tags.

Document Section Setup

The process model for creating sections and enabling section searching is as follows:

Use CTX_DDL.CREATE_SECTION_GROUP to create a section group for the sections.
When you call CREATE_SECTION_GROUP, you specify the name of the section group to create.
Call CTX_DDL.ADD_SECTION for each section that you want to create in your section group.
When you call ADD_SECTION, you specify the name of the section, the start and end tags for the section, and whether the section is top-level or self-enclosing.
If you are creating sections for HTML documents and you use the internal HTML filter, set the keep_tag attribute (HTML FILTER Tile) once for each of the HTML tags that the filter retains for use as section start and end tags.
Then create a Filter preference for the Tile.
If necessary, specify values for the startjoins and endjoins attributes (BASIC LEXER Tile) and create a Lexer preference for the Tile.
Use the section_group attribute (GENERIC WORD LIST Tile) to specify the name of the your section group and create a Wordlist preference for the Tile.

Create a policy that includes the section-enabled preferences (Filter, Lexer, and Wordlist) that you created.

When you create the policy, you specify the text column where your HTML text is stored.

See Also:

For examples of creating section groups and sections, as well as creating a section-enabled Wordlist preference, see "Managing Document Sections" in Chapter 9, "Setting Up and Managing Text".

For examples of specifying attributes for the HTML FILTER and BASIC LEXER Tiles, see "Filter Examples" and "Lexer Examples" in Chapter 7, "Understanding the ConText Data Dictionary: Indexing".

Predefined HTML Section Group and Sections

ConText provides a predefined section group, BASIC_HTML_SECTION, which enables section searching in basic HTML documents.

BASIC_HTML_SECTION contains the following section definitions:

Section Name Start Tag End Tag Top Level Self-Enclosing

HEAD

<HEAD>

</HEAD>

Yes

No

TITLE

<TITLE>

</TITLE>

No

No

BODY

<BODY>

</BODY>

Yes

No

PARA

<P>

</P>

No

No

HEADING

<H1>

</H1>

No

No

<H2>

</H2>

No

No

<H3>

</H3>

No

No

<H4>

</H4>

No

No

<H5>

</H5>

No

No

<H6>

</H6>

No

No

In addition, the following predefined preferences have been created to support ready-to-use basic HTML section searching:

Filter preference - BASIC_HTML_FILTER
Lexer preference - BASIC_HTML_LEXER
Wordlist preference - BASIC_HTML_WORDLIST

Section Searching

A query expression operator, WITHIN, is provided for restricting a text query to a particular section.

See Also:

For more information about the WITHIN operator and performing text queries using document sections, see Oracle8 ConText Cartridge Application Developer's Guide.

Table Name	Column Name	Datatype	Description
MASTER	PK	NUMBER	Primary key for table
	AUTHOR	VARCHAR2	Document author
	TITLE	VARCHAR2	Document title
DETAIL	FK	NUMBER	Foreign key to master.pk
	LINENO	NUMBER	Detail information for document
	TEXT	VARCHAR2	Text column

Format	Version
AmiPro for Windows	1, 2, 3
Lotus 123 for DOS	4, 5
Lotus 123 for Windows	2, 3, 4, 5
Microsoft Word for DOS	5.0, 5.5
Microsoft Word for Macintosh	3, 4, 5.x
Microsoft Word for Windows	2, 6.x, 7.0
WordPerfect for DOS	5.0, 5.1, 6.0
WordPerfect for Windows	5.x, 6.x
Xerox XIF for UNIX	5, 6

Section Name	Start Tag	End Tag	Top Level	Self-Enclosing
HEAD	<HEAD>	</HEAD>	Yes	No
TITLE	<TITLE>	</TITLE>	No	No
BODY	<BODY>	</BODY>	Yes	No
PARA	<P>	</P>	No	No
HEADING	<H1>	</H1>	No	No
	<H2>	</H2>	No	No
	<H3>	</H3>	No	No
	<H4>	</H4>	No	No
	<H5>	</H5>	No	No
	<H6>	</H6>	No	No

6 Text Concepts

The ConText Data Dictionary

Text Operations

Text Loading

DDL

DML

Automatic DML Queue Notification

Manual DML Queue Notification

Immediate DML Processing

Batch DML Processing

Concurrent Index Creation

Text/Theme Queries

Two-step Queries

One-step Queries

In-memory Queries

Stored Query Expressions

Linguistics Requests

Text Columns

Supported Datatypes

Textkeys

Composite Textkeys

Column Name Limitations

Column Length Limitations

Text Loading

Plain Text Loading

Batch Loading

SQL*Loader

ctxload Utility

Automated Batch Loading

User-Defined Translators

Error Handling

Text Storage

Direct Storage

External Storage

Master-Detail Storage

Policies on Columns in Master Table

Advantages

Limitations

Policies on Columns in Detail Table

Disadvantages

External Text

Text Stored as File Names

File Names

Directory Path Names

File Access

File Permissions

Text Stored as URLs

Hypertext Transfer Protocol (HTTP)

File Transfer Protocol (FTP)

File Protocol

Intranet Support

Document Access Using HTTP or FTP

Proxy Servers

Multi-threading

Redirection

Timeouts

Exception Handling

Text Filtering

Internal Filters

Plain Text Filtering

HTML Filtering

Formatted Text Filtering

External Filters

External Filter Requirements

Performance Issues

Using External Filters

Filters for Single-Format Columns

Filters for Mixed-Format Columns

Autorecognize Filter (Internal)

External-Only Filters

Internal and External Filters

Supported External Filter Formats for Mixed-Format Columns

Supplied External Filters

Supplied External Filters Installation

Supplied External Filter Setup

Supplied External Filter Usage

Limitations

ConText Indexes

ConText Index Tables

Creating Empty ConText Indexes

6
Text Concepts