GMD - German National Research Center for Information Technology
Schloss Birlinghoven, Sankt-Augustin, D-53754 Germany
http://allanon.gmd.de/and/and.html
E-mail:
gennady.andrienko@gmd.de
Tel: +49-2241-142329
Fax: +49-2241-142072
2. Related research
A pioneering work in automated knowledge-based visualization design was
done by J.Mackinlay. His software system APT can encode data variables,
according to their types and cardinality, by J.Bertin's visual
variables and construct graphical displays combining these visual
variables. This approach was adapted by F.Zhan and B.Buttenfield for
selection of an appropriate cartographic presentation method for one
spatially referenced data variable. Later V.Jung developed the system
Vizard capable of automated mapping of several independent variables.
Vizard accounts for not only data characteristics but also user's
objectives though the latter are indicated in terms of predefined generic
tasks being either rather primitive or rather abstract: lookup, locate,
compare, see distribution.
Descartes takes into account data characteristics and conceptual relationships among data variables. For example, the system can "understand" that the database fields with numbers of female population from 0 to 14 years, female population from 15 to 64 years, male population from 0 to 14, and so on, essentially refer to one and the same variable "population number" measured for different age and sex groups, and that these groups are parts of the whole population. This kind of knowledge allows grounded selection of particular presentation techniques such as maps with pie charts or segmented bars. The same knowledge can be effectively used to guide the user in data analysis through communicating with her/him about her/his objectives in domain-specific terms rather than on an abstract level. On this potential we intend to base the further advancement of Descartes.
The use of direct manipulation techniques for visual data exploration was originally proposed in statistics by W.Cleveland. The most widely known is his idea of visual linking of several graphical displays by means of brushing. M.Monmonier suggested to apply this technique to maps linked with non-cartographic displays. Later the idea of linking between different maps and other graphics was implemented by J.Dykes in his CDV system. CDV also offers facilities for interactive change of map symbolism, investigating contiguity relationships, and some others. It's worth saying that interactive tools for changing presentation parameters with the aim of making maps more expressive was proposed by T.Yamahira et al. much earlier than the notion of dynamic displays emerged. These researches developed a histogram interface for selection of intervals for a classed choropleth map. Later S.Egbert and T.Slocum considered interactive classification as exploratory task.
A well-known group of dynamic manipulation techniques is devoted to database querying: the user is given convenient graphical widgets to alter query conditions and can immediately observe corresponding changes in graphical presentation of search results ("Dynamic Query" proposed by B.Shneiderman and S.Ahlberg, "Attribute Explorer" and "Influence Explorer" by H.Dawkes et al.).
Descartes offers a number of interactive exploratory techniques:
In connection with the user guidance we intend to implement the earlier mentioned Vizard system can be referred to. This system not only designs maps but also explains why this or that solution is proposed and which opportunities for analysis it offers. However, the parts concerning analytical opportunities are merely general descriptions of cartographic visualization methods with no regard to user's specific data and goals. Our plan is to guide the user by proposing her/him a number of analysis scenarios specifically allowed by data at hand. Such scenarios are automatically constructed on the basis of system's knowledge about the data and the underlying problem domain, about potential capabilities of different presentation methods, about available dynamic manipulation techniques and other system functions.
3. User guidance: why and how.
Comprehensive data analysis usually requires quite a number of operations
with data and their display. Accordingly, the functions and facilities
available in Descartes are numerous. This means that the user should learn
them and always keep in mind. Further, a rather long sequence of
operations is often needed to proceed from source data to a useful
presentation. For example, it may be necessary to transform absolute
values to percentages, calculate differences or ratios, filter database
records, etc. We intend to "wrap" such operation sequences into analysis
scripts presented to the user as various analytical tasks formulated in
terms of analyzed data and domain notions. These scripts will, first,
simplify the acquaintance with the system and release the users from
memorizing its capabilities and, second, save time and efforts of even
experienced users.
The following example explains our idea. Suppose that a dataset under analysis contains earlier cited fields with absolute population number in sex-age population groups for different countries. The system can foresee several analytical tasks that can be done with the use of these data: "study how sex structure varies depending on age", "study how age structure varies depending on sex", "study sex (or age) structure across countries irrespective of age (or sex)", "examine a particular age group", etc. These or similar formulations are proposed to the user as alternatives to select from. Standing behind each task is a sequence of operations resulting in potentially useful presentation or several presentations and, possibly, some recommendations how to use them and how to proceed further.
Suppose that the user has selected the first "task", study of dependency of sex structure on age. In response the system automatically calculates percentages of male and female in all age groups and creates a map with segmented bars: bars correspond to age division, and segments show proportions of male and female. Note that automation of calculating percentages and selection of this type of presentation really requires knowledge of conceptual relationships among fields.
Displaying the map to the user, the system supplies it with a brief comment explaining that this map is suitable for seeing local differences in sex structure depending on age in each country or for pairwise comparison of countries. It does not help in seeking for spatial patterns and trends. Thus the system offers as a direction for further investigation to take separately male or female percentages and consider their spatial distributions for different ages. Alternatively, the user may be proposed to concentrate on studying differences in percentages of male and female population depending on age. For the first task a series of choropleth maps would be suitable. In the second case the system would automatically calculate the differences and represent them by bar chart map. At the next step the system may propose the user to study spatial distributions of differences for the age groups.
User guidance applies also to the utilization of dynamic manipulation facilities for data analysis. Again, the system can help the user not only by a general description of this or that tool ("static" on-line help) but also with some data- and analysis context-specific recommendations. For instance, if in the course of analysis a ratio of two numeric fields was calculated and presented, the system can propose to apply visual comparison with the value 1; for a difference of two fields visual comparison with 0 is reasonable. In both cases the map will change so that the geographical objects will be visually classified into 3 groups: 1) field1<field2; 2) field1=field2; 3) field1field2. The system can also automatically detect cases when dynamic outlier removal is necessary and propose the user to do this.
It should be noted that the use of guidance is optional: the user does not have to analyze data according to proposed scenarios. S/he always has the possibility to apply any of the available functions in any order. This is important as we cannot guarantee that it is possible to foresee all imaginable analysis tasks. Yet, since the guidance is proposed stepwise the scripts may occur to be useful for partial automation of rather sophisticated investigations.
In guiding the user the system utilizes the following kinds of knowledge:
A) Generic analysis tasks such as "Local comparisons of values of attributes", "Looking at spatial distribution of values of an attribute", "Local consideration of proportions" etc. The tasks may have applicability conditions. For example, the latter task is meaningful for a set of data fields that together constitute a meaningful whole. Unlike the generic tasks in the Vizard system, our tasks are patterns rather than simply abstract statements. The patterns have slots filled with appropriate domain notions when the system proposes analysis scenarios to the user.
B) Knowledge about methods of cartographical and graphical presentation available in the system: which generic analysis tasks are enabled by each of the methods. For example, "Parallel bars" "Local comparisons of values of attributes", "Choropleth map" "Study spatial distribution of values of an attribute", "Scatter plot" "Look for relationships between two attributes". Some presentation methods offer different opportunities depending on data they applied to. For example, "Pie charts"/absolute quantities "Local consideration of proportions", "Comparison of totals"; "Pie charts"/percentages "Local consideration of proportions", "Comparison of proportions for pairs of geographical objects".
C) Knowledge about potentially useful operations with data: for what generic tasks they can be applied and how to perform each operation with the use of available functions. An example of such an operation is proceeding from absolute values to percentages. This operation is helpful, in particular, in the task of studying proportions (other variants of application are also possible). It is performed with the use of the calculation function of the system.
D) Knowledge about dynamic manipulation facilities available in the system: possible ways of use depending on the analysis context. Here belong the earlier mentioned heuristics about visual comparison with 1 for calculated ratios and with 0 for calculated differences. Another example concerns the application of dynamic classification tool for investigating relationships between one attribute selected as a base of classification and some other attributes for that class statistics is calculated and displayed. A reasonable strategy is to try to increase the number of classes and move class boundaries to probe the robustness of the demonstrated relationship, if any.
E) Knowledge about data and underlying problem domain. This knowledge, besides selection of proper visualization methods, allows to formulate analysis tasks in a way easily understandable by the user. Thus, the generic task "Local estimation of proportions" may have a formulation "Consider proportions of age groups 0-14 years, 15-64 years, 65 and more years in population of each country of Europe" or "Consider proportions of classes of industry X, Y, ..., Z in overall industrial product of main cities of Germany", depending on the application domain. The knowledge about data is used in automatic application of such system functions as calculations, querying, classification according to the pursued analysis scenario.
The utilization of these kinds of knowledge for generating guidance
proposals on different steps of user's work may be governed by rules with
following structure:
IF [applicability conditions] THEN
[recommendation],
where
[applicability conditions] may include one or more of the following:
a) required data characteristics and relationships;
b) characteristics of currently considered presentation;
c) currently pursued generic task;
[recommendation] may be either one or more generic tasks to proceed
to or a hint concerning the use of dynamic map manipulation facilities.
In map design the system relies upon conceptual knowledge about data under analysis. Such knowledge need not to be very extensive, but for each application of Descartes a formalized description of the application domain (relevant notions and relationships IS-A, PART-OF among them) and the database structure (correspondence of database fields to domain notions) should be provided. The utilization of domain knowledge can be substantially extended. We have shown that on the basis of this knowledge the system can offer an intelligent guidance to the user in the course of data analysis.
The dynamic map manipulation facilities available in the system are rather innovative, and therefore there is a probability that even people experienced in the use of maps (or GIS) for data analysis will not try to actively use them. Therefore we consider it necessary also to give the user apt hints concerning the employment of the dynamic facilities in analysis.
Though it is impossible to guarantee interesting findings in any data, we believe that further development of the intelligent capabilities of the system will make it more helpful as an environment for visual data exploration.