ASSESSING THE IMPACT OF MINIMALLY INTELLIGENT,
COMPUTER-GENERATED, IMMEDIATE FEEDBACK
ON AN ILL-DEFINED PROBLEM SOLVING TASK



by

John William Connelly III

A.B., Princeton University, 1989

M.S., University of Pittsburgh, 1994




Submitted to the Graduate Faculty of

Arts and Sciences in partial fulfillment

of the requirements for the degree of

Doctor of Philosophy




University of Pittsburgh

2001



UNIVERSITY OF PITTSBURGH
____________

FACULTY OF ARTS AND SCIENCES



This dissertation was presented

by

John William Connelly III
________________________________



It was defended on

August 14, 2001
________________________________



and approved by


Robert Glaser, Ph.D.        
________________________________


Jonathan W. Schooler, Ph.D. 
________________________________


Daniel D. Suthers, Ph.D.    
________________________________



________________________________



________________________________


Alan M. Lesgold, Ph.D.      
________________________________
Committee chairperson           



Copyright by John William Connelly III
2001



Alan M. Lesgold




ASSESSING THE IMPACT OF MINIMALLY INTELLIGENT,
COMPUTER-GENERATED, IMMEDIATE FEEDBACK
ON AN ILL-DEFINED PROBLEM SOLVING TASK

John William Connelly III, Ph.D.

University of Pittsburgh, 2001



Computerized learning environments successfully support problem solving in many well-defined domains. Many such environments provide intelligent, immediate feedback whenever problem solvers commit errors that impede their progress toward a well-defined goal. The present research seeks to apply immediate feedback principles to ill-defined problem solving with Belvedere, a graphical environment to support scientific inquiry. Belvedere's normally on-demand help facility (the Coach) was modified to present immediate, intrusive feedback on basic inquiry principles while users created diagrams showing evidential relations among hypotheses and data. I predicted positive effects of immediate coaching on performance measures such as diagram completeness, diagramming errors, and information search measures, but negative effects on student attitudes toward Belvedere and its Coach. Undergraduate students in Experiment 1 used Belvedere with either intrusive coaching or no coaching. Performance measures showed mainly nonsignificant trends, most of which were in the predicted direction. Attitude measures showed more significant predicted differences. Latency measures suggested that students may not have attended to all of the immediate coaching feedback. Students in Experiment 2 used Belvedere with either intrusive coaching or standard on-demand coaching plus reminder prompts. Results were similar to Experiment 1 on both performance and attitude measures, with more marked group differences on the latter. Latency analyses suggested that students spent significantly more time processing requested coaching than intrusive coaching. Limitations of the Coach and of the problem solving task may explain the relatively few significant differences in performance measures. Results suggest that the affective costs of immediate feedback may outweigh its performance benefits for ill-defined problem solving, although this suggestion may be specific to the experimental task or to the relatively unsophisticated feedback currently provided by Belvedere's Coach.



FOREWORD


As I reflect upon my graduate schooling, it occurs to me that I could probably write a second dissertation in thanking all of those who made it possible for me to finish this one. I will, of course, refrain from doing so; however, I would be remiss if I made only a blanket statement of gratitude.

First and foremost, I wish to acknowledge the members of my Doctoral Committee, not only collectively but also individually. I owe thanks to Bob Glaser, for his significant contributions to the literature that motivated my research and for serving also on the committee overseeing my doctoral comprehensive exam, from which my research ideas were drawn; to Dan Suthers, for his tireless work on Belvedere while a colleague at LRDC, for his assistance in conceptualizing and implementing my pilot experiments, and for agreeing to come back here from afar to serve on my committee; to Jonathan Schooler, not only for his efforts in enabling me to finish my degree but also for serving on my master's committee as well, and for being the quintessential role model of an active and inquisitive cognitive researcher; and last but by no means least, to Alan Lesgold, for remaining my primary advisor throughout my entire graduate-school tenure even after he assumed new responsibilities, and for not giving up on me even after I felt I had overstayed my welcome.

In addition to Alan and Dan, I wish to thank the other former members of the Argumentation Group at LRDC throughout its various incarnations. Special thanks go to Arlene Weiner and Eva Toth for compiling our web-based materials and for sharing earlier experiences with Belvedere "in the trenches"; to Sandy Katz for her ideas on coaching and for our discussions about earlier conceptions of intrusive advice-giving; to Violetta Cavalli-Sforza, whose dissertation research with Belvedere informed my own; to the many others responsible for developing Belvedere and its Coach, including Massimo Paolucci, Kim Harrigal, Joe Toth, and especially Dan Jones, who worked well beyond the call of duty to help me resurrect and maintain the hardware and software needed to conduct my research.

Special thanks are due also to Squeak, for making me try to keep things in perspective; to Jenn Gross, for her 11th-hour proofreading assistance and for just being there; to R.B., for literally saving my life as I was preparing to enter the home stretch, and to R.B. Junior for getting me home; to my unlicensed "therapists" along the way (Babe, Rosie, & Mary; Dave & Joey; and Matt & John); to Carlton Hicks for providing the mantras that helped me get through this; to the other few, proud comrades I have had the privilege of calling sum potase; and finally, to the mighty Excalibur, for serenading me through yet a few more all-nighters.



TABLE OF CONTENTS


LIST OF TABLES


LIST OF FIGURES

Introduction

The advent of increasingly powerful and inexpensive computer hardware in the late 20th century has enabled the development (and, in some cases, deployment) of advanced instructional software tools based on the principles of artificial intelligence (AI). Such tools have appeared over the last few decades under various labels, including: intelligent computer-assisted instruction (ICAI) systems; intelligent tutoring systems (ITSs); microworlds or discovery worlds; coached apprenticeship systems; reactive learning environments; and, more broadly, intelligent learning environments (ILEs). Generally speaking, each software system was designed to foster or support some form of human problem solving in a particular domain by using particular support strategies of varying levels of domain-specificity, although the pedagogical approaches used by these tools are as diverse as their labels may suggest (Connelly & Lesgold, 1999).

Different theories of learning or instruction underlie the various software systems, from production-system models of individual instruction (e.g., Anderson, Boyle, & Reiser, 1985) to theories of cognitive apprenticeship and situated cognition, often involving groups (e.g., Brown, Collins, & Duguid, 1989). The different theories have led to variations in many aspects of system design, from expert and student modeling components to user interfaces (Clancey, 1986; De Corte, 1996; Polson & Richardson, 1988; Reusser, 1996; Wenger, 1987). Among the most salient differences between the various pedagogical approaches are the content, amount, timing, and control (i.e., user- vs. system-initiated) of the feedback delivered by the system's intelligent agent(s) to users engaged in problem solving activities with the system.

The research described herein was motivated by a literature review comparing various approaches to automated feedback delivery across several different problem solving domains (Connelly, 1997; see also Connelly & Lesgold, 1999). That review, which was drawn from a cross-sectional survey of empirically evaluated ILEs for which evaluation results were readily available, indicated that the majority of existing systems supported well-defined problem solving tasks and domains (see also Seidel & Park, 1994). The present research explores whether certain feedback-delivery strategies that have proven successful in well-defined problem domains (i.e., domains in which problems have a single or a finite set of correct answers, such as mathematics, physics, or computer programming) might be extended into a more ill-defined problem solving context where there are no generally accepted "right answers" (e.g., Voss & Post, 1988).

Specifically, I chose to manipulate and evaluate the impact of the feedback provided by Belvedere, a software system designed to foster argumentation and inquiry skills in users trying to solve ill-defined scientific problems (Suthers et al., in press; Suthers, Weiner, Connelly, & Paolucci, 1995). Belvedere enables users to construct on-screen diagrams representing the relationships between hypotheses and evidence for any number of open-ended scientific debates. The Belvedere system includes an online Coach that continually analyzes the evolving argument diagrams in terms of general principles of scientific inquiry. The Coach is capable of providing feedback to users on demand, in the form of hints or suggestions, to help guide their ongoing inquiry and diagram construction activities. My research in this dissertation focuses on the extent to which feedback from the Coach appears to help Belvedere's users during problem solving, and on whether certain feedback variations used in well-defined domains may also work for the types of domains and ill-defined problem solving skills Belvedere was designed to support.

I begin this dissertation with a brief overview of some customary approaches to providing system-generated feedback. I then focus and elaborate on one of them, the immediate feedback approach, which is used by many successful systems that foster problem solving in well-defined domains. After briefly describing how one might incorporate immediate feedback into a design approach that is more conducive to fostering problem solving in ill-defined domains, I then describe Belvedere and its feedback characteristics in more detail, including findings from some formative evaluations of the system. I then present a brief outline of my experiments, addressing some general issues pertaining to system evaluation where relevant and appropriate (Legree, Gillis, & Orey, 1993; Mark & Greer, 1993; Shute & Regian, 1993; Twidale, 1993), followed by a brief discussion of my experimental measures and predictions. After some brief technical details, I report each experiment in turn, followed by my general discussion of findings.

Approaches to Automated Feedback

Many intelligent instructional software tools are complex systems consisting of several different components or modules. Most ITSs, for example, are comprised of four main components: (a) a domain knowledge or expert module, which contains the target knowledge of the domain that the system is designed to teach; (b) a student model, which assesses a student's emerging knowledge or competence in the target domain by using diagnostic techniques such as model tracing (matching a student's solution steps to those of an expert problem solving model; Merrill, Reiser, Ranney, & Trafton, 1992); (c) a tutoring or pedagogical module, which structures the interaction between the system and the user, deciding at various points in the interaction which task material to present and what kind of feedback to provide, if any; and (d) a user interface, which serves as the means by which the user and the system communicate (Polson & Richardson, 1988; Reusser, 1996). Although the various system components are usually interrelated in function and often in features (Katz & Lesgold, 1993), generally feedback-delivery decisions are coordinated by the pedagogical component, the relative effects of which tend to vary with the specifications of its underlying instructional approach (Connelly & Lesgold, 1999).

Pedagogical styles can differ along such non-orthogonal dimensions as guided versus unguided (e.g., model tracing vs. discovery learning), tutoring versus coaching, and student-directed versus system-directed (De Corte, 1996; Reusser, 1996). Put another way, some pedagogical approaches are more directive (e.g., Anderson et al., 1985), some are noninterventionist (e.g., De Corte, 1996), and some are in between, such as the cognitive apprenticeship teaching methods of scaffolding and fading (Brown et al., 1989; Collins, 1996). Thus, feedback from a system can play any number of roles, from corrective to regulative to informative (Fischer & Mandl, 1988; Wenger, 1987), and it may differ in relative amount and timing (e.g., immediate vs. delayed vs. on demand; Collins, 1996; Kulik & Kulik, 1988; Merrill et al., 1992; Schooler & Anderson, 1990). At one extreme are systems that provide detailed feedback at many points during interactive sessions with their users, while systems at the other extreme may provide no explicit feedback at all, in some cases providing implicit pedagogical support through various interface features instead (Merrill et al., 1992; Twidale, 1993). Some systems may even fade or otherwise vary their default feedback delivery strategies or methods during the course of an interactive session with a user (e.g., Chu, Mitchell, & Jones, 1995; Shute & Glaser, 1990; VanLehn, 1996; VanLehn et al., 2000).

Given the range of disciplines and tasks for which ILEs have been built, it is difficult to compare approaches to ILE design and delivery of system feedback without accounting for the types of skills they support. Clancey (1986) describes how problem solving operators and inference procedures differ between formal, closed domains such as mathematics and natural, open domains such as medical diagnosis. McKendree (1990) suggests that more complex or ambiguous tasks may require a greater degree of informative feedback than more constrained ones, for which more directive feedback often suffices. Others in the field have described the process of learning from an ILE as a four-way interaction of learner style, desired knowledge outcome, type of instructional environment, and subject matter (Shute & Glaser, 1990). After reviewing several different systems that were designed to support problem solving in a variety of domains, we noted that "it is difficult to identify any domain-specific effects of, or any clear preferences between, the various approaches to providing feedback" (Connelly & Lesgold, 1999, p. 539). However, we believed that to be due partly to the overrepresentation of well-defined domains in the literature. Problems in such domains have a constrained set of correct answers, making them amenable to expert and student modeling. However, problems in more ill-defined domains are usually not as clear-cut, with multiple solutions (as well as multiple paths leading to those solutions) that can be reached only by using rough heuristics rather than algorithms (Voss & Post, 1988). For these reasons expert and student modeling are often intractable, giving the pedagogical component of an ILE for such a domain less to work with in deciding what feedback to present to users. A question posed of the present research is to what extent feedback approaches that have proven beneficial in many well-defined domains can also be of help in a more ill-defined domain. I turn now to one such approach.

Immediate Feedback

A major issue in the design of interactive learning environments is that of deciding when a system should provide feedback to its user(s). Many successful systems provide corrective feedback immediately after their users make any mistakes. For example, most model tracing tutors work by generating feedback any time a student's solution path deviates from a path that will lead to a correct answer (Merrill et al., 1992). One reason for preferring this approach is to ensure that feedback is delivered in the context in which it is needed: that of the student's current goal and working memory states (Anderson, Corbett, Koedinger, & Pelletier, 1995; Corbett & Anderson, 1992). Another reason to provide corrective feedback immediately is to prevent students from floundering while trying to recover from lengthy incorrect solution paths (Anderson, Boyle, Farrell, & Reiser, 1984; Corbett & Anderson, 1992; Gertner & VanLehn, 2000; McKendree, 1990).

Although used to varying extents in some operative skill tutors (Chu et al., 1995; Legree et al., 1993) and in limited ways in an economics microworld (Shute & Glaser, 1990), immediate feedback approaches dominate some physics tutors (e.g., Gertner & VanLehn, 2000; VanLehn, 1996) and many of the tutors for programming, geometry, and mathematics, including all of the tutors based upon the ACT* theory of cognitive skill acquisition (e.g., Anderson & Reiser, 1985; Anderson et al., 1985; Koedinger & Anderson, 1993b; Koedinger, Anderson, Hadley, & Mark, 1995; McKendree, 1990). Indeed, the ACT* commitment to providing immediate feedback in its tutors is one of the theory's most controversial features (Anderson, et al., 1995; Corbett & Anderson, 1992). Although the revised ACT-R theory and its newer tutorial instantiations permit off-path problem solving (Anderson et al., 1995), they still focus students toward correct solution paths, and immediate feedback still plays a major role in the interaction.[1]

However, research has shown immediate feedback to be disadvantageous in certain situations and with particular tasks (Kulik & Kulik, 1988; VanLehn et al., 2000). In one experiment using a modified version of the ACT* group's famous LISP Tutor for programming (Anderson & Reiser, 1985), students who received immediate feedback solved training problems faster than students who received delayed feedback, but when solving test problems took more time and made more errors than delayed-feedback students (Schooler & Anderson, 1990). In addition, delayed-feedback students seemed to be better at planning problem solutions than were immediate-feedback students. The authors argued that the absence of immediate feedback in the delayed condition allowed students to redeploy their working memory resources toward developing secondary skills such as error detection and correction. A study comparing two versions of the GIL tutor (Graphical Instruction in LISP; Reiser et al., 1988) provides further evidence of this: Students who did not receive GIL's immediate model-tracing feedback scored better on a transfer test of program debugging skills than those who did (cited in Merrill et al., 1992).

With even more complex tasks, feedback may be best left for post-problem reflection, when working memory resources are no longer being taxed by immediate problem-solving demands (Lesgold, 1994a; Sweller, 1988). Sherlock II, an ILE for training a complex avionics troubleshooting task, has facilities to support reflective follow-up after problem solving, including goal-related presentations such as intelligent replays of problem solving steps, critiques of those steps, and information about what an expert might have done (Lesgold, 1994b). These capabilities were added to help compensate for the learning opportunities that are precluded by the high cognitive effort expended during problem solving (Lesgold, 1994a; Lesgold, Katz, Greenberg, Hughes, & Eggan, 1992), as well as to coach situations in which students were able to solve the problems but did so in a non-optimal way (Gott, Lesgold, & Kane, 1997).

In short, the value of immediate feedback seems to vary with not only the task but also the desired learning outcomes of the intervention. Nevertheless, for many systems that support well-defined problem solving, the immediate delivery of feedback has proven beneficial in fostering user attainment of the primary skills that the system was designed to teach, based on various achievement measures. Such measures from within the laboratory include shorter time to problem completion, fewer errors committed, less time needed to correct errors, and in some cases higher post-test scores, than appropriate controls (Anderson et al., 1985; Connelly, 1989; Corbett & Anderson, 1992; McKendree, 1990). In some studies laboratory measures were supplemented by classroom achievement measures, ranging from better exam scores and course grades to higher levels of participation than controls (Koedinger et al., 1995; Schofield, Evans-Rhodes, & Huber, 1990; Wertheimer, 1990). Based on system evaluations using these measures, it is commonly accepted that many of these systems "have an enviable track record" (VanLehn et al., 2000, p. 475).

Feedback at What Cost?

Costs to the User

Even when achievement measures indicate clear benefits of providing immediate feedback, we must also consider the potential costs to the user of providing that feedback. Do users of intelligent learning environments desire immediate feedback, even if it is ultimately helpful? In a series of experiments manipulating feedback delivery by the LISP Tutor (Corbett & Anderson, 1990, 1992), students rarely requested immediate feedback from the ITS; most of them wanted feedback only when they were finished coding a problem (see also Anderson et al., 1995). Students using the PACT Geometry Tutor have shown a similar resistance to immediate online help (Aleven & Koedinger, 2000). Although there are many possible reasons for this, it is clear that "developers must assess not only the effectiveness of a system but also the likelihood that it will be fully accepted into the culture or domain of the target audience" (Connelly & Lesgold, 1999, p. 531), in order to avoid the danger of having the costs of interacting with an ILE outweigh its benefits for the users (see also Mark & Greer, 1993).

A list of ILE evaluation criteria by Mark and Greer (1993) focuses on measures of both achievement and affect. Affective measures include student motivation, self-esteem measures, attitude measures, and time on task. While most achievement measures are objective, many affective ones are subjective and usually gathered by questionnaire. Thus, affective measures may be open to interpretation and are best used as supplements to the more tangible achievement measures, especially because the two may not always correlate (Corbett & Anderson, 1990; Mark & Greer, 1993). For example, in an evaluation study of the RAND Algebra Tutor (Stasz, Ormseth, McArthur, & Robyn, 1989), despite some modest increases in achievement scores many students felt that the tutor did not help them learn algebra. Moreover, the few students who reported thinking that the tutor did help them learn algebra actually received lower course grades. Therefore, in an effort to obtain a broader sense of the relative costs and benefits to the user of interacting with Belvedere, the research I describe herein is motivated by ILE evaluation studies in which attitude measures supplement more objective performance measures (e.g., Burton & Brown, 1982; Chu et al., 1995; Corbett & Anderson, 1990; Fix & Wiedenbeck, 1996; Reusser, 1996; Stasz et al., 1989; Wan & Johnson, 1994).

Costs to the Developer

While weighing the potential costs and benefits to the user of providing automated feedback, often system developers must also assess the costs to themselves relative to both potential and actual benefits for the user (Suthers et al., in press). The engineering of sufficient domain knowledge for a computerized tutor to solve problems and diagnose student errors can be quite an expensive proposition for system developers, although in some domains the benefits can justify the costs. As stated earlier, most systems that employ immediate feedback do so in the context of problems with a finite set of correct answers (Connelly, 1997; Connelly & Lesgold, 1999), and among these systems, those that do model tracing generate feedback based on deviations of a student's solution path from a path that will lead to a correct answer (Merrill et al., 1992). Often the amount of knowledge needed to solve such problems is constrained enough that all of it can be represented by the system, allowing it to infer the genesis of a student's incorrect solution path. However, as problems become less constrained and more ill-defined, and both the knowledge and the cognitive skills needed to solve the problems become more complex, the prospects for knowledge engineering and student modeling become more unwieldy (Aleven & Ashley, 1995). When the knowledge bases required to tackle such tasks are too large for model tracing to be a viable option, on what should immediate feedback be based?

A Solution: Balancing Costs and Benefits

The main issue in trying to realize the benefits of immediate feedback in an ill-defined problem solving domain is: For ill-defined problems such as scientific inquiry tasks, for which there are usually no right answers and in which the amount of knowledge that must be brought to bear may be too large to be represented in the system, on what basis and to what extent can immediate feedback be of any use? Obviously, feedback on open-ended scientific problems that lack any "correct" solutions cannot be based on student deviations from an ideal solution path. Additionally, an ILE that lacks complete knowledge of a problem domain would obviously be limited in its ability to provide domain-specific feedback. However, to the extent that particular solution processes or components of solutions applicable across different domains can be identified as ideal or correct, and to the extent that those general processes can be represented by the system, feedback could be delivered on that basis (Aleven & Ashley, 1995; Conati & VanLehn, 1999). Naturally, the representation of detailed domain knowledge in the system would enable it to tailor such feedback to the specific problem at hand, when appropriate. However, for a system user struggling with a complex argumentation task, timely feedback about even domain-general solution processes or components, based on general principles of scientific inquiry and evidential reasoning, could still be helpful.

We describe elsewhere (Suthers et al., in press) an approach to ILE design that we characterize as "minimalist" AI and education, a way of applying basic AI principles to ILEs while circumventing the aforementioned knowledge representation and student modeling problems. Instead of attempting to build relatively complete knowledge representations, reasoning capabilities, or pedagogical agent functionality characteristic of model tracing tutors, this alternative approach provides ILEs with minimal abilities to respond (in a manner believed to be pedagogically relevant) to selected components of student activities and constructions, such as their basic syntax or some other categorical, easily discernible features. The feedback provided by a minimalist approach may be characterized as "state-based" rather than "knowledge-based" (Nathan, 1988): The software helps students recognize important features of their problem solving state, leaving most of the burden of knowledge representation and management in the hands of the students. This is the approach taken by the developers of Belvedere, which delivers feedback via an online Coach "that can provide reasonable advice with no domain specific knowledge engineering" (Suthers et al., in press). More specifically, Belvedere represents an incremental design approach that seeks to determine the value of low-cost, domain-general feedback alternatives before trying to assess the potential value added by more expensive domain-specific knowledge[2] to the system, an important consideration when working with broad, ill-defined problem domains that are not conducive to model tracing. I turn now to a brief overview of Belvedere.

Belvedere

Overview

Belvedere (Suthers & Jones, 1997; Suthers, Toth, & Weiner, 1997; Suthers & Weiner, 1995; Suthers et al., 1995) is a networked graphical environment designed to foster scientific argumentation skills in students of middle-school age and older. Students use Belvedere's on-screen node and link primitives (e.g., Hypothesis, Data, For, Against) to construct graphical argument representations of ill-defined problems in scientific domains, either individually or collaboratively over a network. These problems are presented as inquiry exercises in which students are asked to seek out and map the relationships between relevant hypotheses and evidence, using actual unsolved scientific "mysteries" as domain content. Problems can come from any source, although Belvedere's developers have created specialized, self-contained hypertext databases about several scientific debates, which are accessible via standard web browsers[3] (see Appendix A for an illustrated excerpt of a typical problem-solving session using Belvedere and Netscape). Belvedere's graphical interface was designed to resemble that of familiar computer drawing programs, so that students can learn to create argument diagrams with only minimal training. With a web browser and Belvedere's diagram and minimal chat[4] facilities running concurrently, Belvedere enables students to synchronously discuss and reflect upon their argumentation processes and products while exploring alternative answers to the given problems.

Coaching

In the standard Belvedere system, a computerized Coach is available on demand to provide guidance for developing argument diagrams (Paolucci, Suthers, & Weiner, 1996). When asked for advice, Belvedere's Coach presents feedback, in the form of suggestions or questions in a dialog box, about the current state of the evolving diagram. Primarily the Coach looks for possible deviations of a user's diagram constructs from those that represent the good argumentation and inquiry practices embodied in its pattern-matching rules. For example, if a user appears to be succumbing to confirmation bias (i.e., inclusion of evidence in favor of a hypothesis but no evidence against it), the Coach will suggest that disconfirming evidence be considered as well. Belvedere does no student modeling or diagnosis (e.g., Clancey, 1986; VanLehn, 1988a); it bases its coaching advice on the structural features of diagrams alone. Belvedere's Coach generates advice by applying its 20 syntactic rules to features of the current diagram after each incremental change to it. Appendix B shows the textual message contents of the advice associated with each coaching rule. The Coach provides advice about abstracted patterns of relationships among statements, but it does not address the specific contents of these statements. Its strengths are in its potential for pointing out principles of scientific inquiry in the context of students' own evidential reasoning, and its generality and applicability to new topics or domains with no additional knowledge engineering. These are the qualities that make Belvedere's feedback state-based as opposed to knowledge-based (Nathan, 1988).

Although the knowledge-blind characteristic of the evidence pattern Coach allows it to work effectively for problems in virtually any scientific domain, it does have one potential drawback: The advice it presents may be irrelevant if the pertinent node primitives are used incorrectly (see also Wan & Johnson, 1994). Examples of such incorrect usage from sessions with an older version of Belvedere include typing a hypothesis into a Theory statement box or drawing an Explains link in one direction when a Supports link in the opposite direction would be more appropriate. We have considered redesigning Belvedere to enforce "correct" usage of primitives via immediate feedback and of coherent argument patterns via delayed feedback (Suthers et al., 1995). However, given the Belvedere project's overall focus on supporting collaborative discussion, such interventionist measures remain unimplemented in the standard Belvedere system (Suthers & Weiner, 1995). Instead, we minimized potential occurrences of this problem by significantly reducing the number of primitives in Belvedere's diagramming palette (cf. Cavalli-Sforza, 1998), including the removal of directional links. With only three types of statements and three types of links from which to choose in the version I used, there are no errors of subtle distinction that would affect coaching relevance. Although coaching-relevant usage errors are still possible (e.g., labeling evidence as a Hypothesis or using a For link instead of Against to show a negative relationship between statements), now they are far less likely to occur.[5]

Early Formative Evaluations

Several formative evaluation studies of an earlier incarnation of Belvedere were conducted with middle- and high-school students (Suthers et al., 1995). The first was a laboratory study in which single students worked on a scientific problem. In the second study, some of the same students came back to work on a different problem, in pairs, at a single computer. Without prompting from the experimenters, students would divide the labor between themselves; one would control the mouse, while the other would use the keyboard. This often led to censorship, favoring the student who controlled the keyboard (cf. Wertheimer, 1990). A third laboratory study and a subsequent school study had dyads work together on a problem from separate computers. These students had their own input devices, alleviating (but not completely eliminating) the censorship problem (Suthers & Weiner, 1995), and their monitors were situated such that students could point to each other's screens while discussing their shared diagrams.

Most students required little or no assistance from the experimenters to begin using Belvedere. They varied in their willingness to add information by typing it themselves, many preferring to copy text from the online databases provided by the developers. Students used the older Belvedere's many node and link primitives in ways that were inconsistent both with their intended usage and with their own and other students' usage. Although Belvedere's developers concluded that such unintended usage actually served to stimulate collaborative discussions (Suthers & Weiner, 1995), such usage caused some problems for the older Belvedere's automated Coach, motivating the aforementioned reduction in number of primitives.

The redesigned Belvedere was made available in Department of Defense dependent school (DoDDS) classrooms in Germany and Italy, in part to empirically evaluate its Coach (Suthers et al., in press). Data available to us from DoDDS in the form of limited personal observations, third party observations, videotapes, and computer logs indicate that (a) the on-demand Coach was almost never invoked; (b) there were situations where students did not know what to do next in which the Coach would have helped if it had been invoked; and (c) the Coach's advice and its relevance to the students' activities was sometimes ignored as if not understood. Items (a) and (b) indicate that, in spite of the developers' initial reluctance to interfere with students' deliberations, unsolicited advice is sometimes needed (see also Aleven & Koedinger, 2000; VanLehn et al., 2000).

Intrusive Coaching

Early attempts to address the issue of unsolicited advice involved the use of a "minimally intrusive" Coach, such that the Belvedere menu icon that is used to invoke the Coach (a light bulb) would slowly blink on and off when the Coach had something important to say (Suthers & Jones, 1997). Some of the coaching rules (6 of 20) were deemed important enough to warrant such an immediate interruption. Examples include the rules that look for confirmation bias and the need for discriminating evidence between two hypotheses (see Appendix B). However, in early usability studies of the so-modified Belvedere, very few students reported ever noticing the blinking icon. Therefore, for the current research the Coach was further modified to immediately interject advice (i.e., without waiting for the user to ask for it). Furthermore, rather than restrict intrusive advice to the subset of rules for which the minimally intrusive Coach would have blinked, the new intrusive Coach presents advice whenever it has anything appropriate to say based on any of its rules.

Because each coaching pattern-matching rule responds to different aspects of the diagram state, only a subset of the 20 rules will apply to a user's diagram at any given time. In order to ensure that the Coach does not present newly applicable advice that could be rendered irrelevant by the user's next diagramming action, each rule has an associated delay factor. This delay factor, defined as the number of subsequent diagramming actions through which the rule must apply (ranging from 0 to 4), governs how long the Coach will wait before it will consider presenting the applicable advice. Furthermore, many rules have assigned priority levels that reflect the relative importance of their associated advice. Advice selection by the Coach is performed by a preference-based quick-sort algorithm, following a mechanism used by Suthers (1993) for selecting between alternate explanations. Preferences take into account factors such as prior advice already given, recency of the relevant diagram changes, and various categorical attributes of the applicable advice (Suthers et al., in press). The result of the algorithm is a sorted list of advice rules that apply (after any applicable delay factors) to the current diagram state. When such a list exists, the modified intrusive Coach will provide immediate feedback to the user by automatically presenting the advice at the top of the list.

Assessing the Effects of Belvedere's Coaching

Because Belvedere's standard Coach delivers advice only on demand, and because the data we have on hand (previously discussed in the Early Formative Evaluations section) show that the on-demand Coach is rarely invoked, it has been difficult for us to get some objective notion of its effectiveness during sessions with target users. Coaching has been a focus of at least two studies conducted during Belvedere's development: one involving the coaching of students by actual human domain experts using Belvedere's chat facility (Katz & Suthers, 1998), and another involving offline comparisons of consistency relations in student diagrams to those of an expert, using a prototype extension of the automated Coach (Paolucci et al., 1996). However, neither study involved online user interactions with the automated Coach. The Coach was apparently used to some extent in a set of external evaluation studies conducted in Europe (Veerman, 2000), but the focus of these studies was on collaboration of dyads and small groups using Belvedere and on the nature of chat discussions between the collaborators, with only passing mention of the automated coaching. Another study conducted in the overseas DoDDS classrooms (Toth, Suthers, & Lesgold, in press) examined the effects of different representations (Belvedere vs. text) and of the users' reflective assessments of these representations. Assessments were based on rubrics that codified evaluation criteria, much like they are embodied within Belvedere's coaching rules. However, for this study Belvedere's automated Coach was disabled because there was no counterpart available for the text-based conditions. In short, we still lack a clear empirical picture of the Coach's effectiveness within the overall Belvedere framework, and this deficit was a motivation of the present research.

Two Experiments

A driving purpose of my dissertation research was to investigate the effects of automated coaching on user performance with Belvedere. Specifically, in an effort to extend immediate feedback principles into the ill-defined problem solving supported by Belvedere, I chose to manipulate two aspects of Belvedere's feedback delivery component (its Coach), and to measure the impact of its feedback on the problem solving behaviors of individual students working with Belvedere. I conceived of two experiments that hold constant every other aspect of the Belvedere software system, manipulating only the presence and the timing/control of coaching feedback, respectively.

Experiment 1. In an attempt to isolate the effects of coaching in the face of infrequent user requests for it, my first experiment compares students using Belvedere with immediate, intrusive coaching (as described earlier in the Intrusive Coaching section) to students using Belvedere without any coaching at all. By enforcing frequent feedback delivery to one group and denying it to the other, and then comparing users' problem-solving activities between groups, my goal was to isolate the gross, overall effects of immediate coaching feedback on user interaction with Belvedere.

In addition to investigating the role of immediate feedback, this first experiment also serves as an additive design manipulation (Legree et al., 1993), to help determine the value added by automated coaching to the overall Belvedere environment. Other researchers have performed the same manipulation by investigating the effects of removing tutorial feedback entirely from their systems (e.g., comparing versions of an ILE with and without feedback). An informal evaluation of the WEST system, which is a "guided discovery learning" environment (Burton & Brown, 1982, p. 80) for mastering the arithmetic strategies needed to play the simple board game How the West was Won, showed that students who used the system with coaching gained a broader understanding of the different moves in the game, as well as more favorable attitudes toward the game, than did those who used a version without coaching (Burton & Brown, 1982). The aforementioned study using the GIL programming tutor (cited in Merrill et al., 1992) compared students using the standard version of GIL to those using an exploratory version without model tracing feedback. Although the exploratory students scored as well as the model-tracing students on post-tests, they took twice as much time as the standard GIL students to complete the training problems. Thus, the value added by tutorial feedback was to cut the user's learning time in half. In comparison to these and many other systems, the coaching currently provided by Belvedere is relatively unsophisticated. Consequently, it is important to test its educational value and, consistent with the incremental design approach outlined earlier, to add more complex coaching functionality as needed to address deficiencies in the utility of the Belvedere system.

Experiment 2. With the expectation of having established some effects of coaching in the first experiment, I designed a follow-up experiment that compares Belvedere with immediate, intrusive coaching to the standard Belvedere system with on-demand coaching, with an added provision to encourage more frequent advice-seeking by users of the latter. This second experiment investigates the relative effectiveness of feedback timing and control (system- vs. user-initiated); that is, whether the type of feedback Belvedere provides is more useful when provided immediately and automatically, or whether it is best provided only upon users' requests, when they feel help is needed. This experiment permits me not only to compare the relative costs and benefits to Belvedere's users of both feedback approaches, but also to investigate under what circumstances Belvedere's users ask for feedback from the Coach.

Rationale for Single-User Sessions

Although the Belvedere system was designed partly to support collaboration among users, I chose to conduct each session in both experiments with single users, for several reasons. Firstly, Belvedere's online Coach does not foster collaboration explicitly; that is, all of its feedback is worded generically with respect to cardinality of users, applying equally well to both single and multiple users. Therefore, for the sole purposes of assessing the general effects of Belvedere's feedback, there was no principled reason to prefer collaborative sessions to single-user sessions. Secondly, single-user data reveal feedback effects with greater sensitivity than would be possible in multiple-user sessions, because more of each individual user's cognitive effort and time on task are devoted to explicit problem-solving actions than to verbalization, input censorship, and extraneous off-task discussion (cf. Suthers & Weiner, 1995). Therefore, not only does the Coach have more total input to which it can respond, but also are the participants not distracted by interactions with other users. Thirdly, because the standard Belvedere delivers coaching only on demand, even during collaborative sessions only the individual user who asks for it sees the advice on her screen. Therefore, because the argument diagram is a shared enterprise between the collaborating users, it would have been difficult to assess the effects of coaching on either individual or collective user activities. Fourthly, in some of our prior sessions with multiple users (Suthers et al., 1995) we observed students coaching each other, often just before entering information into their shared diagram (e.g., discussing which Belvedere box primitive to use for a given statement). Such peer coaching would limit the opportunities for an intrusive Coach to interject advice. Finally, data from collaborative sessions with Belvedere are inherently more difficult and expensive to collect and interpret than are data from single-user sessions. Except for some limited communications using the somewhat constraining Chat facility, none of the collaborative activities in which our Belvedere users have engaged were traceable by a computer. Our prior collaborative sessions required the use of at least one video camera, with an additional experimenter manning each camera, to capture user dialogue and gestures. I obviated the need to collect and analyze video protocols by limiting data collection to single-user sessions and by using computerized event logs, which captured all user browsing and diagramming actions in Belvedere, as the primary sources of data in my experiments.

Overview of the Task

The problem domain. The problem assigned to each participant was a specific scientific mystery: What caused the crash of TWA Flight 800? Information about the crash and its possible causes was presented to students in a self-contained web database. The database is an adaptation of one of several topic databases initially constructed by former members of the Belvedere research group (see Footnote 3). I chose this particular database for my experiments because it is smaller and more manageable than the other databases, many of which have required multiple sessions to explore thoroughly in previous investigations of students using Belvedere. Pilot testing also showed this database to be one of the more accessible ones to students, making it ideal for sessions of relatively short duration that focus not on learning domain knowledge but on applying principles of scientific inquiry.

The original database was constructed during the months following the July 1996 crash, while the investigation was still ongoing. Although the National Transportation Safety Board (NTSB) has since released at least two "final" reports[6] on their investigation, each of which names a most likely cause of the crash, I felt that our web database of several possible causes would still be a viable vehicle by which to test the coaching manipulation in my experiments. My decision was guided by the fact that, in my pilot studies, only 1 student out of 56 reported having heard of the NTSB's first final report when queried during debriefing. However, the NTSB released its most recent report in late August of 2000, shortly before I was to begin data collection for this research. Therefore, as detailed in the Method section of Experiment 1, students were queried at the end of their sessions to ascertain whether they knew about that report and whether it had any influence on their reasoning.

I adapted the original TWA problem database by reducing the grain size of information on each page and by introducing indexing conventions to make it easier to track user browsing. My version of the database consists of 38 individual web pages, accessible from a home page that divides the web browser window into panes with a menu in the narrow left pane and the main page content in the wider right pane (see Figure A1 in Appendix A). The menu pane includes six links to other pages, one of which (labeled "Consider Possible Causes") is a link to an index of four hypothetical causes of the crash (see Figure A6). Each hypothesis listed in this index is a link to a separate web page about a possible cause, and each such page includes two hyperlinks -- one leading to an index of evidence for the hypothetical cause, and one leading to an index of evidence against it (see Figure A1). Each such evidence index consists of hyperlinks to textual bits of evidence, each on its own web page. This level of indexing allows tracking of which hypothesis and which type of evidence a participant is considering at any given time. The hierarchical link structure of the database is represented in Appendix C.

Solving the problem. The problem solving task presented to participants was to try to "make sense" of the information in the web database (a la Toth et al., in press) and to try to determine the most likely cause of the crash based on the information available. As students worked their way through the database using a web browser (Netscape), they were asked to record their thoughts in a Belvedere diagram. That is, each time students came across a hypothesis, a piece of evidence, or any other type of information they deemed relevant to the problem, they were to insert the information into Belvedere using the appropriate box primitive (e.g., using a Data box to contain evidence). They were also asked to indicate the relationships between the statements they entered by interconnecting them using Belvedere's link primitives (e.g., using an Against link to show a contradiction). Students were told that they were not expected to come up with a definitive answer to the question of what caused the crash; rather, they simply had to try to sort through the information and determine what they thought was the most likely cause. No other endpoint was specified, so students proceeded with the diagramming task until they felt they had satisfied the goal.

Performance Measures

Of primary interest are the apparent effects of coaching feedback on student activity during the sessions. Direct effects of coaching were inferred in a number of ways: (a) by analysis of student activities following the presentation of coaching feedback; (b) by comparison of final student diagrams to an expert diagram; and, to the extent deemed necessary after analysis of diagramming session events, (c) by comparison of verbal argument summaries between students in the different feedback conditions. Therefore, I used a battery of dependent measures, drawn from various sources including: chronological records of relevant session events, culled from time-stamped log files of all user diagramming actions, browsing actions, and coaching feedback received; users' final Belvedere diagrams; and written notes and tape recordings of users' end-of-session verbal argument summaries. These multiple avenues allowed for analysis of possible coaching effects on both the processes and the products of the students' scientific inquiry activities.

Diagram-creation log files were used to determine user actions in a diagram before and after the delivery of selected coaching advice (e.g., to see whether students chose to implement actions recommended by the Coach). Similarly, web browsing log files were used to determine possible coaching effects on user navigation within the hyperlinked problem database. Final student diagrams were coded for numbers and types of elements present and were compared against an expert diagram (described in Experiment 1), using overlay conventions similar to those of Cavalli-Sforza (1998). Such conventions include noting which diagram elements are present in both student and expert diagrams, which expert elements are missing from student diagrams, and which additional student elements are extraneous to the expert diagram.

The verbal summaries were intended as a secondary data source because they can reveal nothing about the direct effects of coaching during a problem solving session with Belvedere. However, they were viewed as a complement to the diagrams and time-stamped event records so that, in the absence of clear and direct coaching effects, they might provide a rough gauge of participants' overall understanding of their inquiry diagram products, which could indirectly reflect coaching effectiveness. At the end of each diagramming session, verbal summary information was collected from each participant in three phases: (a) a free-form summary, without structured prompts and without the Belvedere diagram visible; (b) questions involving structured prompts about possible causes for the crash, again without the diagram visible; and (c) any additions or changes to the summary after redisplay of the diagram.

Affective Measures

As indicated in my cost/benefit discussion above, also of interest are the effects of coaching on users' attitudes about using Belvedere. Therefore, a brief battery of end-of-session attitude ratings[7] was collected from each participant, tailored to the condition to which she was assigned (i.e., only users who received coaching were asked to rate the Coach). As detailed in the Method section for Experiment 1, rating items were accompanied by standard nine-point Likert scales. The indirect effects of coaching on student attitudes toward using the Belvedere environment were inferred by comparison of attitude ratings between students in the different feedback conditions. In the first experiment, attitude ratings about the Coach from students in the coaching condition were compared to their overall ratings of Belvedere. Analyses of these measures are outlined in the following experiment sections.

Covariate Measures

As noted in similar studies, students' problem solving performance in ill-defined domains can depend on their reasoning ability (Means & Voss, 1996; Toth et al., in press). Therefore, I sought to include some kinds of ability measures for possible use as covariates in data analyses. I had considered several ability assessment measures with varying degrees of directness and relevance to my experimental task, such as: (a) having participants provide definitions of common argumentation terms (cf. Cavalli-Sforza, 1998); (b) presenting a textual debate and asking participants to identify relevant claims and evidence; (c) asking participants to give an open-ended analysis of a short, accessible article on a scientific topic; (d) presenting a partial Belvedere diagram and (after explaining the diagramming conventions in it) asking participants to identify its strong or weak points; and even (e) administering a standardized test (e.g., the California Critical Thinking Skills Test). However, regardless of their directness or relevance, each of these measures posed the dual danger for the participants of (a) "priming" them to interact with Belvedere in ways that would have reduced the impact of my coaching manipulations, and (b) diverting their cognitive effort away from their actual problem-solving session with Belvedere. Therefore, I felt it justified to settle for less powerful covariate measures that neither contaminated nor fatigued my participants. The measures I chose to collect from students were their current grade point average (GPA), their scores on the Scholastic Assessment Test (SAT) or American College Testing (ACT) examination, and their scores on the short form of the Need for Cognition (NFC) scale (Cacioppo, Petty, & Kao, 1984). The latter scale, which entails assignment of agreement ratings to a number of propositions, was presented to participants at the end of their sessions so as not to fatigue them before their interactions with Belvedere. It was hoped that some combination of these covariate measures would serve as surrogates for a more direct assessment of prior inquiry skill.

General Hypotheses

As described earlier, Belvedere's online Coach can recognize and provide feedback about abstracted patterns of relationships among the statements in a user's inquiry diagram. After every change to the diagram by the user, the Coach examines the types and configurations of boxes and links in the diagram, looking for indications that the user may not be employing good argumentation or inquiry practices. If it finds any such indications, it may want to suggest remediation to the user, depending on the severity of the deviations. Feedback messages on the inquiry patterns monitored by the Coach are presented in Appendix B.

As indicated earlier, in prior work with Belvedere there were many situations in which users could have been helped if they had sought coaching (Suthers et al., in press). Therefore, to the extent that my experiment participants lacked well-developed inquiry skills or familiarity with at least some of the rules of good scientific inquiry that are embodied within the Coach, I expected to find positive effects of coaching in my primary performance measures. My predictions were as follows:

On the other hand, based on the aforementioned findings from prior work with the standard Coach (Suthers et al., in press) as well as findings from some pilot work with my intrusive Coach, I had reason to predict some variability in my affective measures based on the type and amount of coaching received. As stated earlier, prior work with the standard Coach showed that its advice was often ignored by students as if not understood. In several rounds of piloting with both standard and intrusive coaching, I asked students to inform me when they did not understand the coaching feedback during their Belvedere sessions, and I reiterated this request at the end of each session while asking them for reflective follow-up comments about the Coach. Although most of my pilot-study participants reported that they understood the Coach, a recurring theme among many of their post-session verbal impressions of the intrusive Coach was that it was "annoying". More specifically, many students reported that it offered advice too often, in some cases even before they had had the opportunity to follow up on its earlier advice. Some pilot participants also likened the Coach to similar advice-giving features of some commercial software packages.[8] Therefore, to the extent that students in my intrusive coaching conditions found the advice to be unwanted, I predicted the following:

These predictions suggest a potential cost/benefit tradeoff in my experiments: The automatic delivery of coaching feedback may better enable users to master the target inquiry skills, but it may also leave them with less favorable impressions of the Belvedere system or of the task at hand. It could be argued that such a tradeoff may be specific to user skill level, with lower-ability students finding the frequent coaching to be more helpful (and, therefore, less annoying) than the higher-ability students find it to be. Similar research with other ILEs on the motivational consequences of feedback on lower-ability students (cited in Merrill et al., 1992) supports this notion. However, based on my pilot work with the intrusive Coach, I predict that apparently repetitive feedback could be delivered often enough for even very low-ability students to find that the costs of processing the feedback outweigh the benefits gained from the initial feedback presentations. I therefore predicted to find an attitude-performance tradeoff across the board in the intrusive conditions. Only after comparison of the relative costs and benefits of my feedback-delivery approaches can I assess the full impact of such a tradeoff on the utility of the Belvedere system.

Technical Details

Belvedere. Belvedere version 2.0.1 was used for both experiments, on two different platforms: the Windows client for my students and the less robust Solaris client, which was more susceptible to crashing, for myself. Only the Inquiry Diagram component of the Belvedere system was used; the Chat facility was not. In this version of Belvedere, nondirectional links replaced the directional links of its immediate predecessor, further simplifying the graphical language of the diagrams. Although a more recent Belvedere version (2.1) was available at the time I began this research, I chose not to use it for practical reasons. In the newer version, the Coach is integrated into the Java-based Belvedere client itself, whereas the older version uses a Coach that runs as a LISP process on a separate computer. The behavior of the Coach was much easier to modify in the LISP environment without affecting the rest of the Belvedere system. The Coach ran in a Common LISP environment (via Harlequin LispWorks v3.2.2), also using LOOM version 3.0, on a Sun SPARC workstation running Solaris UNIX. This workstation was the same one on which I ran the Solaris Belvedere client to monitor student diagram construction.

Yet a third networked computer was required to run my experiments. This third computer, also a Sun SPARC, ran a Java-based "Connection Manager" that allowed diagram updates in the students' Belvedere client to automatically display in my client. Belvedere diagram information was stored on this third computer using the Postgres95 database management system, which maintains information about each element added to or changed in each Belvedere diagram. Finally, this SPARC also ran the web server for the TWA problem database. All browser page accesses by the students were logged to this machine, as were all of their major Belvedere actions.

Netscape. Students used Netscape Communicator for Windows (version 4.74) to browse the self-contained database about the TWA crash. To ensure that I would be able to track participants' web browsing activity from the server side, I reduced the user's Netscape memory and disk caches to zero and configured Netscape to load from the network every time a web page was visited. I also removed from view all of Netscape's toolbars except the navigation toolbar with the directional buttons, to facilitate web navigation for the students.

Internet Explorer. The end-of-session survey presented to each participant was contained within a web form with radio buttons, configured such that survey responses would be sent to me electronically only after responses were indicated for all items. That is, any attempt to submit an incomplete form would result in a browser error message. To make it easier for participants to correct such oversights, I needed to use a separate browser with caching enabled. Therefore the survey was presented using Microsoft's Internet Explorer browser (version 5.00) with standard cache settings.


Experiment 1

Method

Participants

Participants were 37 undergraduate students from the University of Pittsburgh (17 males and 20 females). All students were fluent in English and had normal or corrected vision. Each student participated in a single individual session conducted by myself. As compensation, each student received two research participation credit hours for Introductory Psychology; therefore, time was closely monitored to ensure that overall session duration did not exceed two hours.

Design

The experiment employed a single-factor between-subjects design. Participants were block-randomly assigned to either Condition IC (intrusive coaching) or Condition NC (no coaching). In Condition IC, participants received intrusive coaching feedback from Belvedere while constructing their inquiry diagrams. Coaching feedback was also available on demand in this condition. In Condition NC, Belvedere's automated coach was disabled, and the button from which it can normally be invoked on demand was removed from the Belvedere diagramming interface.

Apparatus

I conducted each session in a single room, with one computer and desk for the participant and another for myself. Throughout the session the participant used a Pentium® tower with a 15" color monitor, standard keyboard, and a two-button mouse; from the other computer (a SPARC workstation) I monitored the data and coaching logs as well as the participant's diagram construction using Belvedere's networking capabilities. The room was configured such that I also was able to monitor the participant's screen surreptitiously from my desk across the room. The participant's screen layout was configured such that the Belvedere Inquiry Diagram window filled one vertical half of the screen and the Netscape Navigator window filled the other half (see Figure A1 in Appendix A for an example setup; the two applications were reversed in my sessions). This configuration allowed the user to browse information in Netscape and insert it easily into Belvedere, if desired, without having to switch between applications. Toward the end of each session, I used a standard cassette recorder and written notes to capture participants' verbal argument summaries.

Materials

As discussed in my Introduction, participants were asked to construct their Belvedere diagrams based on the contents of a self-contained web database about a scientific mystery: the possible causes for the crash of TWA Flight 800. Appendix D is a generalized form of the actual script I used during verbal interactions with participants, including condition-dependent instructions to the participants and requests for information from them. For brevity, I modified the script as it appears in Appendix D to apply to both experiments. An end-of-session survey was given to each participant. The survey consisted of: (a) the screening questions (mentioned in the Introduction) regarding the user's familiarity with the NTSB's final report on the crash; (b) a short battery of attitude rating items regarding the user's interaction with Belvedere and, for users in Condition IC, its Coach; and (c) the 18-item short form of the NFC rating scale (Cacioppo et al., 1984). The attitude and NFC items used a common, nine-point Likert rating scale, with response anchors modeled after those used by Cacioppo and Petty (1982) with their longer, 34-item form of the NFC scale. My response scale ranged from 1 (very strongly disagree) to 9 (very strongly agree), with 5 being neutral and with intervening anchors qualified with strongly, moderately, and slightly. Two versions of the survey were constructed, one with and one without the Coach-related attitude statements. The rating scale and an abbreviated form of the survey (with the Coach items but without the NFC items) may be found in Appendix E.



Figure 1. Generalized outline of the experimental procedure. Steps that apply to only the intrusive coaching (IC) condition in either experiment or to the on-demand coaching (DC) condition in Experiment 2 are so marked.

Procedure

My experimental procedure is summarized in Figure 1. Prior to the participant's arrival, I launched the version of Belvedere corresponding to her assigned condition (i.e., either with or without the Coach icon). In addition, I launched Netscape and I cleared its location bar and browse history so that all links in the database would appear as not yet having been accessed. Upon the participant's arrival I seated her at her desk in the room. I then recited a very brief introduction about the nature of the scientific inquiry task to be performed, pausing to deliver a brief overview of how to navigate through web pages (e.g., clicking on hyperlinks, Back and Forward buttons) if the participant had not used a web browser before (see my run script in Appendix D). I then asked for and recorded her self-reported current GPA and scores on the SAT (or ACT) for later use in covariate data analyses. I then presented a brief verbal overview of the particular scientific problem to be tackled (the crash of TWA 800) and I instructed the participant to record her thoughts in Belvedere as she worked through the problem by considering possible causes and the evidence surrounding them. I presented a brief introduction to Belvedere, in which I explained what the software is for (in short, drawing argument diagrams) and discussed the meaning and use of each icon in Belvedere's palette. I explained in some detail how to draw links, because Belvedere's somewhat non-intuitive link-drawing conventions have troubled users in the past. I then presented a very brief sample text about an unrelated scientific problem (global warming) and showed its corresponding graphical representation as a small Belvedere diagram[10] (see Appendix F).

I informed each participant in Condition IC that a computerized Coach would be monitoring her diagram construction, and that periodically it may want to suggest possible ways to improve her diagram. I then told her that the Coach is available on demand by clicking the light bulb icon, and that it may also "speak up" on its own even when she does not click on the light bulb. I explained the appearance of the Coach's feedback (i.e., it appears in a pop-up dialog box and may highlight some diagram elements in yellow) so that the participant would be able to recognize it as such. Participants in Condition NC did not receive any information about coaching, and they used a modified version of the interface with the light bulb icon removed.

After asking the participant if she had any questions, I brought up the TWA problem's home page in Netscape to begin the diagramming session and then retired to my own desk to monitor the session. The participant worked through the problem as she saw fit, with me intervening only to help her with any problems she may have had with her computer or with the Netscape or Belvedere software.[11] I also remained available to answer any questions the participant may have had during the session. The session proceeded with the participant creating and incrementally refining an argument diagram of the problem, with participants in Condition IC receiving periodic, usually intrusive feedback from the Coach. The diagramming session ended when the participant verbally indicated to me that she thought she was done working on the problem, or when total session time neared the end.

At the conclusion of the diagramming session I cleared the participant's screen, removing her Belvedere and Netscape windows from view, and I turned on the tape recorder. I then asked the participant to provide a free-form verbal summary of the argument she had constructed during her diagramming session. When the participant indicated that her summary was complete, I used structured verbal prompts as needed to encourage the participant to evaluate each of the possible causes in turn and to select a "winning" causal hypothesis, if she had not already done so in her free-form summary. I then restored the Belvedere window containing her final argument diagram and asked if she wished to change or add anything to her summary. I followed each significant pause between the participant's utterances with the content-free prompt "anything else?" (a la Means & Voss, 1996), until she indicated that she was finished. I then turned off the tape recorder.

Finally, I launched the Internet Explorer web browser (with standard cache settings) and presented a web form to the participant containing the end-of-session survey[12] (see the Technical Details in the Introduction for more details). After the participant successfully submitted her survey responses, I presented a consent form and asked her for written permission to access her official GPA and SAT scores from the University. I then debriefed the participant, asking about her knowledge of the final NTSB report and about her reactions to the Coach, and I gave her a credit slip for her participation.

Results and Discussion

Data Limitations

Data omission. An unusual client-server network communication error occurred during one of the 20 coached sessions, resulting in the Coach's failure to recognize one of the participant's Data statements. I was alerted to the error by a series of "bogus edge" warning messages in the coaching log for that session, which began to appear after the participant drew a For link between that Data statement and one of her Hypothesis statements (the only other statement to which it was ever linked). The warning message recurred each time Belvedere redrew that link to the phantom Data statement, which apparently existed in the participant's diagram but not in the Postgres95 relational database on the server. Therefore, when generating advice messages the Coach was unable to account for both that Data statement and the For link between it and the Hypothesis. Because the error occurred early in the diagramming phase of her session (after only 5 of 47 diagram actions and only 1 of 25 total coaching messages), and because a later recreation of her diagram without the error produced a different series of messages from the Coach, her data were excluded from all analyses. After her omission, data from 36 participants (19 coached students and 17 uncoached students) remained.

Covariate data. I was granted access to official student records for all but one participant; therefore, I disregarded student self-reports of GPA and SAT scores in favor of the official figures. One participant denied me access to both his SAT scores and his GPA (he also did not self-report them). Two other participants lacked any SAT scores, but their ACT composite scores were available. Using a recent concordance table published by the College Board (Table 3 of Schneider & Dorans, 1999), I determined the equivalent SAT total scores for their ACT composites and substituted them in covariate analyses where feasible. However, I was unable to determine any equivalent SAT Math or Verbal subscores for these students. Remaining after these limitations were 35 viable GPAs and SAT total scores, as well as 33 viable SAT Math and Verbal subscores. Data from the NFC scale were available for all 36 viable participants.

Downtimes due to software failures. Software crashes occurred during the diagramming phases of two uncoached sessions and one coached session. The respective crash downtimes were 0.95, 6.37, and 4.88 min. In the first case only Netscape was affected, but the other two cases required me to reboot the student's computer, thereby affecting both Belvedere and Netscape. Fortunately, in all cases I was able to restore both the student's most recently browsed web page in Netscape and the student's Belvedere diagram in its entirety. Therefore, as detailed below, the impact of these failures seems to have been limited to inflated session times.

Statistical Notes

An alpha level of .05 was used for all statistical tests. Unless otherwise indicated, each of my between-group statistical tests involved a separate test for equality of group variances (e.g., Bartlett's F-test). For cases in which the equality test indicated a violation of the equal-variance assumption, I report the more conservative statistics and probability values assuming unequal variances. In most such cases, adjusted degrees of freedom using Satterthwaite's approximation (as shown in Snedecor & Cochran, 1980) were computed as decimal numbers. Therefore, any degrees of freedom I express as a decimal number implies unequal variances.

Covariates: Descriptive Statistics

NFC scale. Each participant's ratings of the 18 items on the NFC short form (Cacioppo et al., 1984) were averaged into a single composite measure, equal to the arithmetic mean of the 18 individual ratings, after reverse scoring of the 9 negatively worded items. The 36 NFC composite scores ranged from 4.11 to 8.22 with a mean of 6.19 (SD = 1.01, Mdn = 5.97), indicating on average a modest need for cognition among the students. The mean NFC score was slightly but not significantly higher for uncoached students (M = 6.28) than for coached students (M = 6.11).

GPA. Student GPAs (at the end of the term during which the experiment was conducted) ranged from 1.64 to 3.95, with a mean of 2.85 (SD = 0.58, Mdn = 2.88). There was no significant difference between mean GPAs of coached (2.82) and uncoached (2.89) students.

SAT scores. I noted the students' most recent SAT subscores and, if different, their highest subscores from any prior test administrations. Math subscores ranged from 380 to 740 with a mean of 548 (SD = 87, Mdn = 550), and Verbal subscores ranged from 340 to 750 with a mean of 573 (SD = 92, Mdn = 570). The mean total SAT score including the two equated ACT scores was 1120 (n = 35), one point less than the mean sum of subscores (n = 33). The means of students' highest Math and Verbal subscores were 559 and 586, respectively. Means for uncoached students were slightly but nonsignificantly higher than those for coached students on all SAT measures (see Table 1).

Table 1
Mean SAT Scores by Condition in Experiment 1


Condition SAT-M SAT-V SATtota HiSATM HiSATV
IC M 533 559 1092 547 573 SD 102 93 168 90 76
NC M 565 589 1153 573 602 SD 65 92 130 68 83
Note. Means did not differ significantly between groups. IC = intrusive coaching; NC = no coaching. an = 35 after converting ACT scores of two students.
Correlations among covariates. Using a two-tailed rejection criterion, analyses of the 33 students for whom all subscores were available showed positive correlations of GPA with SAT Verbal subscores (.52, p < .005) and with SAT total scores (.46, p < .01) but not with SAT Math subscores (.28, p = .11). SAT Math and Verbal subscores were positively correlated with each other (r = .54, p < .005). NFC composite scores were uncorrelated with the other covariate measures.

Median splits. In addition to covariate analyses, I performed median splits using each of the three covariate measures, for use in two-way (with coaching condition) analysis of variance (ANOVA) analyses on my dependent measures. Any significant interactions revealed by these analyses are reported throughout. Note that median-split sample sizes were 18 per cell for NFC and SAT totals, but there were 19 students in the low-GPA group and 17 in the high-GPA group. F-ratios for these analyses were approximate due to unequal sample sizes, so any interactions significant at or near the .05 level should be interpreted with caution.

Session Durations

I recorded the approximate total duration of each participant's experiment session, rounded to the nearest 5-minute increment. Session duration (including the three crash downtimes) ranged from 40 to 95 min, with a mean of 64.72 min (SD = 13.57, Mdn = 65). Subtracting the downtimes slightly reduced the mean session duration to 64.38 min. Sessions involving the Coach tended to last longer (M = 67.11, SD = 13.78) than uncoached sessions (M = 61.33, SD = 12.67), although the difference was not statistically significant, t(34) = 1.30, p = .20. Possible reasons for the tendency include (a) addition of the coaching overview to the verbal introductions in the IC sessions, (b) addition of Coach-related items to the IC end-of-session surveys, and (c) interaction with the Coach itself.

I used the time-stamped log files to more precisely determine the duration of the diagramming phase of each session. I measured the time interval between the initial browse of the TWA problem home page and the student's final action, either in Belvedere or in Netscape. Diagramming durations ranged from 19.50 to 64.57 min, with a mean of 40.24 min (SD = 12.62, Mdn = 38.07). Diagram sessions involving the Coach tended to last longer (M = 42.06, SD = 13.97) than the uncoached sessions (M = 38.20, SD = 10.98), possibly due to interaction with the Coach. However, this difference also was not significant (t < 1).

Amount and Frequency of Coaching

The 19 viable coached students received a total of 471 messages from the Coach, of which 434 (92%) were presented intrusively and 37 (8%) were presented upon request. Of the 471 messages, 462 were substantive and 9 were null advice messages (i.e., instances in which the message was "The coach doesn't have anything to suggest."). A null message is the result of a user request for advice when the Coach has no list of rules that currently apply to the diagram. By design all intrusive messages from the Coach were substantive; therefore, only 28 of the 37 requested messages (76%) were substantive. The total number of substantive coaching messages displayed to each coached student ranged from 6 to 43, with a mean of 24.32 (SD = 9.78) and a median of 24.

To determine how often advice was presented during the coached sessions, I defined two versions of an inter-coaching interval (ICI) measure between successive advice presentations: one for elapsed time (ICIt) in seconds[13] and one for the number of diagramming events (ICIe). More specifically, for each substantive coaching instance I measured the interval between it and the previous substantive coaching instance (or, in the case of the first instance, the interval between it and the first diagramming action). The mean of the individual student means for ICIt and ICIe were 91.95 s and 1.87 diagram events, respectively. When I restricted calculations to intrusive messages only, the per-student mean ICIe was 1.97 (i.e., on average the Coach presented an intrusive advice message after every other diagramming action). Both ICI measures show that, as expected, the frequency of advice presentation by the intrusive Coach was relatively high on average.

Completeness of Final Diagrams

Total element count. As a first step in trying to determine the effects of coaching on diagram completeness, I computed some gross, overall "body count" measures by totalling the numbers of boxes and links in students' final diagrams (a la Toth et al., in press). The total number of boxes in each final diagram ranged from 7 to 26 with a mean of 16.31 (SD = 5.26, Mdn = 17.50). Total number of links ranged from 4 to 36 with a mean of 19.11 (SD = 7.79, Mdn = 17.50). Contrary to my expectations, the final diagrams of uncoached students tended to have more boxes and more links (Ms = 17.12 and 19.47, respectively) than did those of coached students (Ms = 15.58 and 18.79, respectively); however, neither difference in means was statistically significant.

Box types. I further analyzed counts of diagram elements by the specific type of box. The number of Hypothesis boxes per final diagram ranged from 1 to 7, with a mean and median of 4.00 (SD = 1.41). There was no significant difference between coached and uncoached means (3.95 and 4.06, respectively). The number of Data boxes per final diagram ranged from 5 to 22, with a mean of 11.56 (SD = 4.46, Mdn = 12). Respective means of coached and uncoached students (10.89 and 12.29) did not differ significantly. The number of Unspecified boxes per final diagram ranged from 0 to 5, with a mean of 0.75 (SD = 1.36, Mdn = 0). There was no significant difference between coached and uncoached means (0.74 and 0.76, respectively). I also counted the total number of unlinked boxes (i.e., boxes with no relational link of any kind) in each final diagram, regardless of box type. The number of unlinked boxes per final diagram ranged from 0 to 2, with a mean of 0.31 (SD = 0.58, Mdn = 0). Respective means of coached and uncoached students (0.32 and 0.29) did not differ significantly. Of the 11 total boxes left unlinked by the 36 viable students, 3 were Unspecified boxes, 5 were unique Data boxes, 2 were unique Hypothesis boxes, and 1 was a duplicate of a Data box appearing elsewhere in a very large diagram (scrolling was required to see both copies of the box). Except for the slightly lower coached mean number of Unspecified boxes, each nonsignificant trend among box types ran counter to my predictions. I revisit the issue of unlinked boxes in my discussion of diagram errors below.

Link types. I also coded diagram relations by type of link. The number of For links per final diagram ranged from 1 to 22, with a mean of 10.08 (SD = 5.36, Mdn = 9.50). There was no significant difference between coached and uncoached means (10.37 and 9.77, respectively), but the trend was in line with my predictions. The number of Against links per final diagram ranged from 1 to 17, with a mean and median of 7.50 (SD = 3.51). There was no significant difference between coached and uncoached means (7.74 and 7.24, respectively), but this trend was predicted as well. The number of And links per final diagram ranged from 0 to 7, with a mean of 1.53 (SD = 1.96, Mdn = 1). Coached final diagrams had significantly fewer And links (M = 0.68, SD = 0.89) than did uncoached final diagrams (M = 2.47, SD = 2.40), t(19.9) = 2.90, p < .01. This difference, which accounted for the unexpected trend in total link counts, could be due to specific coaching on And links (or, more to the point, to the lack thereof for uncoached students). Of the 36 viable students, 25 (12 coached and 13 uncoached) used at least one And link during their diagramming sessions. However, 1 of the 20 coaching rules (conjunct-for-hypothesis?) specifically targets the resolution of ambiguous support relationships involving And links (see Figure 2), and this rule was triggered at least once during 9 of the 19 viable coached sessions. One of the simplest ways for a user to address the advice is to simply delete the And link in question. Indeed, five of the nine students who were so coached on their And link(s) deleted at least half of them before the end of their diagramming sessions, whereas the uncoached students had no such impetus to do so.

Correlations with covariates. I noted significant or near significant correlations between several of my gross diagram completeness measures and my covariate measures. NFC (n = 36) was positively correlated with number of Data boxes (.36) and number of Hypothesis boxes (.34), ps < .05, and it showed marginal positive correlations with total numbers of boxes (.32, p = .06) and links (.28, p = .10) and a marginal negative correlation with number of Unspecified boxes (-.32, p = .06). GPA (n = 35) showed marginal positive correlations with number of Data boxes and with total number of boxes, rs = .30, ps = .08. SAT total score (n = 35) was positively correlated with both number of Data boxes and total number of boxes (rs = .44, p < .01), and with numbers of For links (.39) and Against links (.42) as well as total number of links (.42), ps < .05.

Figure 2. Coaching on an ambiguous support relation involving an And link.
SAT Math subscore (n = 33) was correlated with the same measures as SAT total score at the .05 level or better. SAT Verbal subscore, however, had only marginal correlations (p = .07) with all but one of those measures, the total number of boxes (r = .35, p < .05). It also showed a significant negative correlation with total number of unlinked boxes, r(31) = -.35, p < .05. Highest SAT Verbal subscore showed an even stronger negative correlation with this measure, r(31) = -.42, p < .05. The students' highest Math and Verbal subscores correlated less strongly with all other completeness measures than did their most recent counterpart subscores. For these reasons, as well as to maximize sample size, SAT total score was the preferred SAT covariate measure used for analyses. Having noted the above correlations, I decided to perform analyses of covariance (ANCOVAs) on each completeness measure using NFC, QPA, and SAT totals as covariates, dropping least significant covariate terms until the overall F-test for covariates reached significance at or beyond the .05 level.

ANCOVAs on box counts. An ANCOVA on total number of boxes using NFC and SAT totals showed a significant overall effect of covariates (F(2, 32) = 5.76, p < .01), with significant effects of both NFC (t = 2.05) and SAT total (t = 2.72), ps < .05. Adjusted means were 16.15 boxes for coached students and 16.54 boxes for uncoached students, reducing the still nonsignificant (F < 1) unexpected trend favoring uncoached students. An ANCOVA on number of Hypothesis boxes showed a significant effect of NFC, (F(1, 33) = 4.20, p < .05), deflating the difference between coached and uncoached means (3.99 and 4.02, respectively). An ANCOVA on number of Data boxes showed a significant overall effect of the covariates (F(3, 31) = 4.61, p < .01), with significant effects of SAT total (t = 2.28) and NFC (t = 2.33), ps < .05. Adjusted coached and uncoached means were 11.39 and 11.80, respectively, also reducing the unexpected trend favoring uncoached students. An ANCOVA on number of Unspecified boxes showed an almost significant effect of NFC with a negative regression coefficient, F(1, 33) = 3.78, p = .06. There was still no significant effect of coaching condition (F < 1), but the predicted difference between adjusted coached and uncoached means (0.70 and 0.80, respectively) was larger than that of the unadjusted means. ANCOVAs on the number of unlinked boxes did not show a significant effect of covariates using any regression model. However, a two-way ANOVA revealed a significant interaction of coaching condition and median-split SAT total, F(1, 32) = 7.42, p = .01. Uncoached students left more boxes unlinked if they had high SAT totals (M = 0.56) than if they had low SAT totals (M = 0), whereas coached students left more boxes unlinked if they had low SAT totals (M = 0.50) than if they had high SAT totals (M = 0.11). Please note, however, that across conditions only 11 total boxes were left unlinked, so the subsample sizes here are small.

ANCOVAs on link counts. An ANCOVA on total number of links showed a significant overall effect of the covariates (F(3, 31) = 3.45, p < .05), with a significant effect of SAT total (t = 2.45, p < .05) and a marginal effect of NFC (t = 1.80, p = .08). Although there was still no significant effect of coaching condition (F < 1), the adjusted means (19.64 for coached and 18.62 for uncoached students) were in line with my predictions, unlike the unadjusted mean link totals. An ANCOVA on number of For links using NFC and SAT totals showed a significant overall effect of covariates (F(2, 32) = 3.36, p < .05), with a significant effect of SAT total (t = 2.52, p < .05). Although there was still no significant effect of coaching condition (F < 1), adjusted means showed inflated differences in the predicted direction (10.88 for coached and 9.26 for uncoached students). An ANCOVA on number of Against links showed a significant overall effect of the covariates (F(3, 31) = 3.73, p < .05), with a significant effect of SAT total (t = 2.71, p < .05) and a marginal effect of NFC (t = 1.70, p = .10). Although there was still no significant effect of coaching condition (F(1, 32) = 1.58, p = .22), adjusted means showed an even stronger trend in the predicted direction (8.14 for coached and 6.84 for uncoached students). ANCOVAs on the number of And links did not show a significant effect of covariates using any regression model.

Expert Diagram Comparisons

Figure 3 shows an expert diagram for the TWA 800 problem. This diagram is an extension of an earlier expert representation that was compiled by former members of the Belvedere research group, prior to my reindexing of the evidence in the TWA 800 problem database. The four hypothetical causes for the crash appear near the center of the diagram, with supporting and contradicting evidence surrounding them along the periphery. The thicker For and Against links in the diagram represent relationships made explicit in the reindexed hyperlink structure of the database, while the thinner ones denote relationships only implicitly represented in the database. The diagram contains a total of 34 boxes and 40 links, with the following type breakdown: 4 Hypotheses (one for each possible cause), 29 Data, 1 Unspecified box,[14] 20 For links, 13 Against links, and 7 And links. The expert diagram includes And links solely in cases where conjunctions of two Data statements have a positive or negative relationship with one or more of the hypotheses.

Figure 3. Expert Belvedere diagram for the TWA 800 crash problem.
For the expert diagram I considered as hypotheses only the four possible causes listed in the hypothesis index of the database: a bomb (B), a missile (M), mechanical failure (MF), or human error (HE). Some of the students who included more than four hypotheses in their diagrams included the "Sabotage" and "Accident" group headings from the hypothesis index as well (see Figure A6 in Appendix A). Some of the other extraneous hypotheses included by students were statements by witnesses or investigators, which appear in the expert diagram as Data boxes with source attributions. Expert diagram Data statements that did not appear in most student diagrams included the regularly scheduled flight to Paris (n = 0), the flight bound from JFK airport to France (n = 1), the 1997 NTSB statement about remaining possible causes (n = 1), the service history of the plane (n = 2), the crash in Colombia due to pilot error (n = 6), the bomb-related statements about altitude (n = 6), and the statement regarding the split-second noise on the flight recorder (n = 6).

Errors in Final Diagrams

I counted instances of uncorrected diagramming errors in final student diagrams, relative to the expert diagram where applicable. That is, I disregarded any extraneous hypothesis boxes in student diagrams and considered errors relative to the four key hypotheses only. I coded final diagrams for the following errors: (a) the number of missing hypotheses, (b) the number of hypotheses subject to confirmation bias (no Against links), (c) the number of unsupported hypotheses (no For links), (d) the number of unique hypotheses without any links at all, and (e) the number of unique data without any links. Error counts per condition are shown in Table 2 along with total errors per condition, with student subsample sizes shown in parentheses. Note from the column with total errors that, in comparison to coached students, slightly fewer uncoached students left a higher number of uncorrected errors in their final diagrams.

Table 2
Final Diagram Error Counts by Condition in Experiment 1


Hyps Hyps Hyps Hyps Data Total Condition Missing C.Bias Unsupp. NoLinks NoLinks Errors
IC 7 (4) 2 (1) 5 (4) 0 (0) 2 (2) 16 (9)
NC 9 (5) 6 (4) 2 (2) 0 (0) 3 (2) 20 (8)
Note. Condition subsample sizes appear in parentheses. Proportional means did not differ significantly between groups. IC = intrusive coaching; NC = no coaching.
There were no significant differences between groups on proportional error counts. There were significant or near significant negative correlations between the covariates and some of the error counts: SAT total scores with total errors (-.42, p = .01), number of unsupported hypotheses (-.33, p = .05), and number of missing hypotheses (-.31, p = .07); QPA with number of unlinked data (-.34, p = .04) and total errors (-.25, p = .15); and NFC with number of missing hypotheses (-.29, p = .09). However, although ANCOVAs did show significant effects of covariates on some of the error measures, none of the adjusted proportional means showed significant between-group differences. The ANCOVA results that came closest to reaching significance were confirmation bias count per student (F(1, 30) = 1.56, p = .22), with adjusted proportional means of 0.11 coached and 0.36 uncoached, and total error count per student (F(1, 31) = 1.40, p = .25), with adjusted proportional means of 0.80 coached and 1.27 uncoached.

Although the data in Table 2 reflect nonsignificant trends, all trends were in the predicted direction (i.e., more errors committed by uncoached students than by coached students) except for one: Coached students tended to have more unsupported hypotheses than did uncoached students. However, this measure appears to be linked to the count of missing hypotheses. Of the nine students who failed to include one or more of the key hypotheses, the most common omission was the HE hypothesis (n = 7), which was the most underrepresented hypothesis in the database with regards to supporting and disconfirming evidence. This hypothesis also accounted for five of the seven instances of unsupported hypotheses present in student diagrams. Thus, it appears many students either omitted the HE hypothesis from their diagrams or included it but left it unsupported.

Diagram quality. Based on the diagramming errors noted above, I defined a general categorical measure of final diagram quality: A student's final diagram was classified as adequate if it included (a) all four key hypotheses, (b) at least one piece of supporting evidence linked to each key hypothesis, and (c) at least one piece of contradictory evidence linked to each key hypothesis. Any diagram not meeting all three criteria was classified as inadequate. Of the 36 viable students, 20 created adequate diagrams and 16 did not. The 20 adequate diagrams were evenly divided among conditions (10 coached and 10 uncoached), and the 16 inadequate diagrams were nearly so (9 coached and 7 uncoached). Therefore, there was no obvious effect of coaching on overall diagram quality per my general definition of adequacy.

Distinct Coaching Effects on Diagramming

The general lack of significant between-group differences on final diagramming measures could be due to the fact that uncoached students can self-correct many of the diagramming errors flagged by the Coach (e.g., confirmation bias and unsupported hypotheses) simply by browsing the entire database and by entering and linking information as it is encountered. Indeed, the reindexed link structure of the database makes it apparent that evidence exists both for and against each hypothesis. Therefore, I sought to isolate more distinct, local reactions to coaching that might better discriminate between coached and uncoached student performance. One particular coaching rule seemed like a good candidate for this purpose, because its associated advice recommends a diagram action that average users probably would not think to do on their own. The coaching rule, attend-to-discrepant-evidence (see Figure 4), advises users to weigh the relative strength of evidence for and against a hypothesis and to modify the default neutral belief strength assigned to each linked Data box (see also Appendix B). The coaching feedback also advises users to toggle a diagram display filter ("Show Strength"), which is off by default, to show the belief levels of all constructs in the diagram. The stronger the assigned belief level, the thicker the outline of the box or link will appear with the Show Strength filter turned on, as in Figure 4. Although the option to assign a non-default belief strength is presented any time a user creates a new statement box (see Figure A3 in Appendix A) and, therefore, could become salient to observant users even without coaching, the option to activate the display filter is "hidden" within the Filters menu at the top of the Belvedere window. Therefore, unless a user were curious enough to explore the Belvedere menu options (which were not discussed in the verbal introduction), coaching on this rule would be the only way the user could find out about it.

Figure 4. Coaching that recommends a non-obvious diagramming action.
Of the 19 viable coached students, 18 received coaching on this rule at least once, and 5 of them changed the default belief level of at least one of their boxes after delivery of the advice. The number of boxes so modified by these five students ranged from 1 to 21. However, this evidence alone is hardly conclusive of positive responses to the advice, because 5 of the 17 uncoached students altered some belief levels as well, ranging from 1 to 10 belief updates per student. However, of the 36 viable participants, the only 4 to ever activate the display filter were among the 18 who were coached on the rule. Thus, I can probably conclude that at least some of the students who were coached on the rule read the advice completely, understood it, and decided to act upon it. Interestingly, only two of the four students both activated the filter and altered belief levels; the other two only turned on the filter without updating any belief levels. It is conceivable that the latter two students chose to enact only the easier, less time-consuming part of the advice.

Coaching Effects on Web-Browsing

Of the 36 viable participants, only 17 (9 coached and 8 uncoached) browsed all 38 pages of the TWA database at least once. The other 19 students (10 coached and 9 uncoached) skipped from 1 to 16 pages each, with a mean of 4.26 skipped pages (SD = 4.01, Mdn = 3). Interestingly, within just the reduced subsample of page-skippers, coached students skipped significantly more pages (M = 6.20, SD = 4.57) than did uncoached students (M = 2.11, SD = 1.69), t(11.6) = 2.64, p < .05. Within the complete sample of 36 students, the difference between coached (M = 3.26, SD = 4.53) and uncoached (M = 1.12, SD = 1.62) page skips was almost significant, t(23.0) = 1.93, p = .07. However, I noted a strong negative correlation between number of skipped pages and SAT total score, r(33) = -.49, p < .005. Page skips also had slight negative correlations with the other two covariates, NFC (-.12) and QPA (-.05). An ANCOVA using all three covariate measures showed a significant overall effect of them, F(3, 31) = 3.67, p < .05. The covariate effects reduced the between-group difference in skipped pages for the complete sample, with adjusted group means of 2.89 for coached and 1.49 for uncoached students (F(1, 32) = 1.89, p = .18). The covariates had no significant effect for the reduced subsample.

Of the 38 total pages in the database, 26 were skipped by at least one student. The most commonly skipped page (n = 9) was a parenthetically referenced government meeting that appeared below the four possible causes in the hypothesis index (see Figure A6 in Appendix A). One coached student skipped both the MF and HE hypothesis pages, and another coached student skipped the B hypothesis page; they therefore also skipped the respective evidence sub-pages for and against these hypotheses, possibly contributing to the higher number of page skips among coached students. None of the uncoached students skipped any hypothesis pages, raising the possibility that the frequent advice may have annoyed the coached students into opting out of the problem early. Among all students who browsed each of the four hypothesis pages (17 in each condition), more skipped the indexes of evidence against them than skipped the indexes of evidence for them, for all but the HE hypothesis (respective ns were 6 vs. 1 for B, 4 vs. 0 for M, 1 vs. 0 for MF, and 1 vs. 1 for HE), illustrating the possible tendency toward confirmation bias noted in the diagram errors. However, this tendency did not appear to be stronger among uncoached students like in the diagrams; of the 12 students who skipped an evidence index against at least one hypothesis, 7 were coached and 5 were uncoached.

Attitude Ratings

Ratings of Belvedere. On the end-of-session survey all participants rated six statements about their experiences with Belvedere, on the same nine-point Likert scale used for the NFC items at the end of the survey (see Appendix E). Two of the items (B2 and B5) were negatively worded to attenuate response bias, much like Cacioppo and Petty (1982) did for their NFC scale. After reverse-scoring of those items, the respective mean ratings for the six statements (B1 through B6) were 6.50, 7.41, 7.11, 6.75, 6.39, and 6.69. Median ratings ranged from 6.50 to 8.00, all above the neutral rating of 5 and all in the moderately favorable range of the scale. The mean composite Belvedere rating, defined as the arithmetic mean of the six individual statement ratings, was 6.81 (SD = 1.26, Mdn = 7).

I predicted that my uncoached students would report more positive attitudes toward Belvedere than would my coached students. Although the composite Belvedere ratings of the 19 uncoached students (M = 7.07, SD = 0.74) were nearly a half point higher than those of the 17 viable coached students (M = 6.58, SD = 1.58), the difference was not significant, t(26.1) = 1.21, p = .24. However, on individual Belvedere item B3 ("Belvedere helped me keep track of the various pieces of information relevant to the problem"), uncoached students did report significantly higher ratings (M = 7.77, SD = 1.09) than did coached students (M = 6.53, SD = 2.34), t(26.1) = 2.07, p < .05. Between-group differences in mean ratings for four of the other five items were in the predicted direction (ranging from 0.13 to 0.61 in favor of uncoached students) but did not approach statistical significance. The only exception to the trend was statement B2 ("I found Belvedere to be difficult to use"), which received a very slightly less favorable rating (after reverse-scoring) from uncoached students (M = 7.41) than from coached students (M = 7.42).

Student ratings of Belvedere did not differ significantly on the basis of diagram quality (as defined earlier under Errors in Final Diagrams). Composite ratings of students with adequate diagrams (M = 7.13, SD = 0.96) tended to be higher than those of students with inadequate diagrams (M = 6.42, SD = 1.51), t(24.3) = 1.63, p = .11. Statement B2 also tended to receive higher ratings from students with adequate diagrams (M = 7.80, SD = 1.15) than from those with inadequate diagrams (M = 6.94, SD = 1.69), t(34) = 1.82, p = .08. No other differences approached statistical significance, and there were no significant interactions between diagram quality and coaching condition.

Ratings of the Coach. In addition to the six Belvedere items, students in Condition IC rated six items about the Coach, three of which were reverse-scored (see Appendix E). The 19 viable coached students' respective mean ratings for the six statements (C1 through C6) are shown in Table 3. Note the higher variability in mean and median Coach ratings in comparison to the Belvedere ratings. Note also that only two statements, C4 ("The feedback I received from the Coach was easy to understand") and the reverse-scored C5 ("The Belvedere system would be better off without the Coach"), received favorable ratings, and only slightly favorable at that. Also note the unfavorable mean composite rating (the arithmetic mean of the six Coach-related statements). None of the Coach-related mean ratings differed on the basis of diagram quality.

Table 3
Attitude Ratings of Coach-Related Statements in Experiment 1


C1 C2 C3* C4 C5* C6* Composite
M 4.26 4.47 2.26 5.90 5.47 4.11 4.41
SD 2.71 2.61 1.76 2.13 2.53 2.98 2.03
Mdn 4.00 4.00 1.00 6.00 6.00 3.00 4.00
Note. * Reverse scoring was used on this item.
Ratings correlations. As predicted, the coached students' composite Belvedere ratings were highly positively correlated with their composite Coach ratings, r(17) = .74, p < .0005 (one-tailed). Using one-tailed rejection criteria, composite ratings of the Coach were correlated with each individual Belvedere item rating at the .05 level or better, and composite Belvedere ratings were correlated at the .01 level or better with ratings of each individual Coach item except for C3 ("Often I found the feedback from the Coach to be repetitive"), r(17) = .18, p = .23. As shown in Table 3, this statement had by far the most unfavorable mean rating of the six Coach-related statements. Most of the coached students (10 out of 19) gave this statement the strongest possible agreement rating of 9, which reverse-scored to 1 as indicated by the median rating in Table 3. Given the disparity in mean ratings between C3 and the Belvedere composite, the lack of a significant correlation for this statement is not surprising. Although the second least favorable mean rating among the Coach-related statements went to C6 ("I found the Coach to be annoying"), it is reassuring to see that not all students were annoyed by the repetitive feedback of the intrusive Coach.

Verbal Summaries

Having not found as many significant between-group differences as expected on my primary dependent measures, I turned to my secondary data source, the verbal end-of-session summaries. Because the groups did not differ significantly on most diagram completeness measures, I focused on the phase of the verbal summaries that I surmised could have the highest payoff: additions to summaries following diagram redisplays. I predicted that coached students might be less likely to need to add to their summaries, because they would remember more of their diagram content from having paid closer attention to the specific referents of the frequent coaching they received. I coded the third phase of each verbal summary (the part after redisplay of the final Belvedere diagram) for the number of statements uttered and for the number of relations (For, Against, or even And) that were either stated or strongly implied. Students added 0 to 7 statements to their summaries, with a mean of 1.47 statements (SD = 1.83, Mdn = 1). As predicted, uncoached students tended to add more statements (M = 1.88, SD = 2.29) than did coached students (M = 1.11, SD = 1.24); however, this difference was not significant, t(24.1) = 1.25, p = .22. Students mentioned 0 to 6 relations in their additions to their summaries, with a mean of 1.19 relations (SD = 1.65, Mdn = 0). Uncoached students tended to add more relations (M = 1.41) than did coached students (M = 1.00), but this difference also was not significant (t < 1). Not surprisingly, both addition measures were highly intercorrelated, r(34) = .74, p < .0001.

There were significant or near significant positive correlations between the covariates and both addition measures, ranging from .24 (p = .16) to .38 (p = .02). However, ANCOVAs for both measures showed no significant differences between groups (Fs < 1), with adjusted means slightly less divergent than unadjusted means. Adjusted means for number of statements added were 1.28 for coached and 1.71 for uncoached students. Adjusted means for number of relations added were 1.13 for coached and 1.29 for uncoached students. There was a barely significant interaction of coaching condition and median-split NFC in a two-way ANOVA on number of added statements, F(1, 32) = 4.39, p = .044. Coached students added more statements to their summaries if they had a high NFC (M = 2.00) than if they had a low NFC (M = 0.27), whereas uncoached students added more statements if they had a low NFC (M = 1.57) than if they had a high NFC (M = 1.30). This evidence, although weak, could be suggestive of the role of NFC on attention to detail in the diagramming task.

There were tendencies for students with adequate diagrams to include more statements (M = 1.80, SD = 1.99) and relations (M = 1.60, SD = 1.67) than students with inadequate diagrams (respective Ms = 1.06 and 0.69, SDs = 1.57 and 1.54). This is not surprising, given that those with inadequate diagrams had less to report in their summaries than did those with more fully developed diagrams. However, neither difference in means was statistically significant (respective ts = 1.21 and 1.69, ps = .23 and .10). There were no significant interactions between diagram quality and coaching condition.

Lag Times Following Advice Delivery

The dearth of significant between-group differences on many of my performance measures has another possible explanation: Coached students may not have been reading the advice presented to them. Indeed, many coached students admitted during debriefing that they read only the first few advice presentations and thereafter simply clicked the "Close" button any time a coaching dialog appeared. However, I was not systematic about asking coached students how often they actually read the advice. Unfortunately, there is no way for me to determine exactly how long a coaching dialog box was even displayed on the screen, much less how long the student may have been attending to it. However, I thought I could glean a rough idea of how long each coaching message was processed by measuring the elapsed log time between the action that triggered the coaching and the following logged action, be it browsing or diagramming.

This post-coaching lag time measure is not perfect, for several reasons. Firstly, due to the delay between diagram updates and advice presentations, a facile student's subsequent action may have actually preceded the coaching associated with the previous, triggering action, making it appear as if the coaching had been ignored. Secondly, if the action immediately following delivery of coaching is the creation (or textual update) of a statement box, its time-stamp corresponds to the time when the new (or updated) box was placed in the diagram, not to the time when the Add (or Edit) box dialog was opened. Therefore, lag times for such actions would be artificially inflated, especially for slow typists, and would not reflect time spent processing advice. Thirdly, students sometimes left coaching dialogs open while they moved boxes around in their diagrams or browsed the database, either for later review or to simply get the dialog "out of the way" until they finished what they were doing. In either case, a short or long lag time may not reflect time actually spent processing the coaching feedback. Finally, lag times may not properly account for downtimes during software crashes, although I scoured my notes and logs for the crashes of which I was aware and found only one instance of a crash immediately following a coaching delivery. However, these imperfections aside, I felt the lag time measure could provide at least a general idea of how much time coached students spent reading the advice.

For each of the 19 viable coached sessions, I measured the lag time in seconds following each advice delivery. In cases where the advice-triggering action was the final action taken by the user during the diagramming session (n = 3), the subsequent logged action was my closing of the Belvedere software after debriefing; therefore the extreme lag times for these cases (1079 to 3715 s) were omitted from analyses. The remaining 468 lag times ranged from 1 to 86 s, with a mean of 20.04 s (SD = 16.20) and a median of 15 s. The distribution of lags was positively skewed (coefficient of 1.50), with a kurtosis measure of 2.31. An alarming 14% of the lag times (67 of 468) were of 5 or fewer seconds, arguably too short a time span for most students to have read even the briefest of the Coach's feedback messages. Fully one third of the lag times (154) were 10 s or less. Visual inspection of the distribution revealed a possible bimodal characteristic to it, with modes of approximately 4 or 5 s and 14 or 15 s. This suggests the possibility that the lag times may represent two different distributions: one for advice messages that are ignored, and one for those that are read. I computed a bimodality coefficient of 0.610 using a formula from the manual of a popular statistical analysis software package (SAS Institute, 1999). Because this coefficient was somewhat higher than the criterion of 0.555 listed in the manual (the maximum value is 1.0), there is some evidence of possible bimodality in the lag time distribution. However, despite a search of several statistical references I was unable to locate an appropriate significance test.

Recall that a small proportion of the advice messages (8%) were in response to user requests. While the mean lag for the 431 intrusive messages was 19.02 s (SD = 15.23), the mean lag for the 37 requested messages was a much higher 31.89 s (SD = 21.81), and despite the marked disparity in sample sizes and variances the difference was highly significant, t(39.1) = 3.52, p < .005. After factoring out lag times for the 9 null advice requests, the mean for the requested messages rose to 34.93 s (SD = 21.74) and differed even more significantly from the intrusive mean (t(28.7) = 3.81, p < .001), despite an even smaller sample size. These findings suggest that students may have spent significantly more time processing requested advice than the intrusive advice that was thrust upon them.

Summary

The intrusive Coach presented advice frequently in this experiment, with the average student in Condition IC receiving feedback from the Coach every minute and a half or after every other diagramming action. On the error count measures, intrusive coaching appeared to have many effects in the predicted direction, although many of them were not statistically significant. Some unexpected trends were noted as well, but they were also nonsignificant. Unpredicted significant findings were that coached students had fewer And links and skipped more pages of the web database. Local reactions to a unique advice message were noted, showing that at least some students responded positively to coaching. However, lag time analyses raised the possibility that students may not have been attending to much of the coaching feedback, especially the feedback that was presented intrusively (which accounted for 92% of all coaching presented). I revisit this issue in Experiment 2.


Experiment 2

Experiment 1 answered some questions about immediate, intrusive coaching in Belvedere, but it raised several others. Many of the findings were in line with my predictions, but most of them were limited to nonsignificant trends, with coaching effects possibly attenuated by the issue of whether coached students were actually reading the advice. Nevertheless, having found hints of at least some of the predicted effects of coaching in Experiment 1, my goal for Experiment 2 was to compare (with somewhat larger sample sizes) the relative costs and benefits of two different approaches to providing that feedback: immediately and under the system's control, versus delayed and under the user's control. I expected the affective costs to be lower in the user-controlled condition, a prediction strengthened by the attitude ratings of intrusively coached students in Experiment 1. However, any benefits to be gained from more frequent coaching would be predicated on the users actually processing the advice. Therefore, to the extent that intrusively coached students read more of the advice presented to them than was requested by those in the on-demand condition, I continued to predict greater performance benefits in the intrusive condition.

The latest existing version of the on-demand Coach was too dissimilar from the intrusive version I used in Experiment 1; for example, it lacked the delay factors that were added during early piloting with the intrusive Coach. In order to ensure that both Coaches for Experiment 2 would be as similar as possible, I modified the LISP source code for the intrusive Coach used in Experiment 1 and created a separate, nonintrusive version for the on-demand condition. The new on-demand Coach was identical to the intrusive Coach except for the following: (a) it presented advice only when the user requested it by clicking the on light bulb icon (or on the Next Idea button after an initial advice request); and (b) the light bulb would blink when the Coach had pending advice that was deemed important enough to warrant a minimal intrusion, as described in my Introduction under the section on Intrusive Coaching.

To help counteract the old problem of on-demand Coach users never asking for coaching, I used a periodic reminder prompt in the on-demand condition of this experiment. The prompt, a series of audible beeps, was chosen so as to be as minimally intrusive as possible. To that end, I controlled the prompt signals myself, requiring only brief verbal acknowledgments from the participants. As for when and how often to issue the prompts, I decided to use a time-based rather than an event-based criterion, for two reasons. First, because users vary widely in the speed with which they perform diagramming actions in Belvedere (e.g., some are more facile with the keyboard and mouse than others), during diagramming sessions of equal length faster users would log many more diagram events than would slower users, thereby inflating the relative frequency with which they would receive the reminder prompts under an event-based criterion. A consistent time-based criterion seemed more consistent with the goal of my on-demand condition, which was to see if coaching could be helpful without being annoying like the intrusive condition. Second, while running Experiment 1 (as well as earlier pilot studies) I noted that users often pause from diagramming activity for up to several minutes, while engaged in browsing or in reviewing the current state of their diagrams. These pauses often occur later in the sessions, after users have explored the database and have generated a partial diagram, at points when (based on their comments) users are unsure of what to do next -- points at which coaching could be helpful to them. Under a time-based criterion, reminder prompts could sound during such pauses, whereas they would never sound during such "idle" times under an event-based criterion. I set the length of time between prompts to be 3 min, approximately twice the average inter-coaching time interval in Condition IC of Experiment 1 (91.95 s).

Method

Experiment 2 used the same method as Experiment 1, with the following modifications:

Participants

Participants were 46 undergraduate students from the University of Pittsburgh (32 males and 14 females) and were also from the Introductory Psychology research participation pool. All participants except one were fluent in English, and all had normal or corrected vision.

Design

Participants were block-randomly assigned to either Condition IC (intrusive coaching) or Condition DC (on-demand coaching). Condition IC was identical to that of Experiment 1. In Condition DC, Belvedere's automated Coach was available on demand but it never intervened on its own; however, its light bulb icon would blink when it had important advice to deliver. The Belvedere diagramming interface was otherwise identical for both conditions.

Apparatus

I used a digital watch with a repeating 3-minute countdown timer to issue the beeping reminder prompts to participants in Condition DC.

Procedure

Sufficiently in advance of each scheduled session, I ensured that the Coach LISP process corresponding to the participant's assigned condition was running on my workstation. My verbal instructions to participants in Condition IC were the same as in Experiment 1. Instructions for Condition DC were the same as those for IC with the following exception: After telling the participant about the online Coach, I told her that the Coach is available only on demand by clicking the light bulb icon. I then told her that periodically she would hear some beeping sounds coming from my desk, that these beeps were simply to remind her that the Coach was available whenever she wanted it (i.e., she was not compelled to ask for coaching when she heard them), and that she should simply acknowledge hearing them. In both conditions, I then explained the appearance of the Coach's feedback as in Experiment 1 (see Appendix D for my run script).

In addition, whereas I intervened only to help the Condition IC participants with hardware or software problems (as in Experiment 1), in Condition DC I also intervened as needed to ask whether the participants heard the reminder beeps (i.e., if they did not acknowledge hearing them on their own).

Also, whereas in Experiment 1 the end-of-session survey included attitude rating items about the Coach only for participants in Condition IC, in this experiment participants in both conditions received identical surveys with the complete ratings battery.

Results and Discussion

Data Limitations

Data omissions. Data from one participant were omitted because (a) by mistake the feedback he received during his session was from an older, sufficiently different version of the on-demand coach (without the delay factor on rule activations); and (b) his coaching log file was accidentally overwritten. Data from another participant were omitted because the Coach software failed during his session (after 25 of 43 diagramming actions). The participant, who was in Condition IC, created an unusual And-link construction in his diagram that the Coach was not equipped to parse, causing the Coach LISP process to abort even after multiple restart attempts. Therefore, the participant received no coaching feedback on any of his final 18 diagramming actions. Data from a third participant were omitted for multiple reasons: (a) At the end of his diagramming session he admitted to knowing much more about the TWA 800 crash than was available in the online database, from having watched CNN reports and even a Discovery Channel special about the crash; (b) his prior knowledge strongly influenced his problem solving during the session, as indicated not only by his survey responses but also by the fact that his diagram and his browsing history considered only the most likely cause named in the NTSB's final report (see Footnote 6); (c) a screen display problem prevented me from conducting the third stage of his verbal summary; and (d) it is likely he did not meet the screening criterion of English fluency.[15] After these three omissions, data from 43 participants (22 in Condition IC and 21 in Condition DC) remained.

Covariate data. I was granted access to official SAT scores of all participants; however, three students had neither SAT nor ACT scores on record. Of these three students, two reported that they did not remember their SAT scores and the third reported imprecise estimates. There were three other students who had composite ACT scores in lieu of SAT subscores; I determined their equivalent SAT total scores as in Experiment 1. Therefore, for the 43 viable participants I was able to record only 37 SAT Math and Verbal subscores and 40 SAT total scores. One student denied me access to his GPA, leaving me with 42 accessible GPA figures. All 43 viable participants had usable NFC scale data.

Attitude ratings. There were two students in Condition DC who never asked for any coaching. During debriefing both students reported having given a neutral (5) rating to each of the six Coach-related statements on the end-of-session survey, for lack of a better option (e.g., a not applicable response). To best reflect actual student attitudes toward the Coach, I omitted the Coach-related statement ratings of these two students from the attitude ratings analyses below.

Downtimes due to software failures. For reasons unknown, software crashes on the student's computer were much more prevalent in Experiment 2 than in Experiment 1, affecting 13 of the 43 viable sessions. Three of the affected sessions were plagued by multiple crashes. In all but one case (an Internet Explorer crash when I tried to launch it for the survey), crashes occurred during the diagramming phases of the sessions and involved Belvedere, Netscape, or both. Many of the software failures required me to reboot the student's computer. Fortunately, in all cases I was able to restore both the student's most recently browsed web page in Netscape and the student's Belvedere diagram in its entirety (or, in the worst case, the diagram state immediately preceding the software crash). Therefore, as detailed below, the impact of these failures seems to have been limited to inflated session times as in Experiment 1. Total downtimes (for one or more crashes) for the 13 affected sessions ranged from 47 s to 13.15 min, with a mean downtime of 4.40 min (SD = 3.47, Mdn = 3.90). Software crashes occurred almost equally often in both conditions (during seven IC and six DC sessions). Mean downtime did not differ significantly between IC sessions (M = 4.48, SD = 2.72) and DC sessions (M = 4.30, SD = 4.48).

Statistical Notes

As in Experiment 1, an alpha level of .05 was used for all statistical tests, and any degrees of freedom I express as a decimal number represents an adjustment for unequal variances using Satterthwaite's approximation (as shown in Snedecor & Cochran, 1980).

Covariates: Descriptive Statistics

NFC scale. As in Experiment 1, each participant's NFC item ratings were averaged into a single composite measure after reverse scoring of the nine applicable items (Cacioppo et al., 1984). The 43 NFC composite scores ranged from 4.11 to 8.17 with a mean of 6.48 (SD = 0.93, Mdn = 6.56), indicating a slightly higher average need for cognition than the students in Experiment 1. The mean NFC score was slightly but not significantly higher for IC students (M = 6.53) than for DC students (M = 6.43).

GPA. Student GPAs (at the end of the term during which the experiment was conducted) ranged from 1.29 to 3.63, with a mean of 2.77 (SD = 0.54, Mdn = 2.86), somewhat lower than in Experiment 1. There was no significant difference between mean GPAs of IC (2.73) and DC (2.82) students.

SAT scores. As in Experiment 1, for each student I recorded both most recent and highest SAT subscores, if different. Math subscores ranged from 400 to 800 with a mean of 586 (SD = 94, Mdn = 590), and Verbal subscores ranged from 350 to 770 with a mean of 582 (SD = 81, Mdn = 600). The mean total SAT score including the three equated ACT scores was 1163 (n = 40), 5 points less than the mean sum of subscores. The means of the students' highest Math and Verbal subscores were 595 and 594, respectively. Means for IC students were slightly but nonsignificantly higher than those for DC students on all SAT measures (see Table 4). These scores were somewhat higher on average than those in Experiment 1.

Table 4
Mean SAT Scores by Condition in Experiment 2

Condition SAT-M SAT-V SATtota HiSATM HiSATV
IC M 593 583 1176 603 596 SD 113 93 191 108 75
DC M 577 581 1148 584 593 SD 62 64 113 62 61
Note. Means did not differ significantly between groups. IC = intrusive coaching; DC = on-demand coaching. an = 40 after converting ACT scores of three students.
Correlations among covariates. Using a two-tailed rejection criterion, analyses of the 36 students for whom all subscores were available showed a marginal positive correlation of NFC with SAT Verbal subscores (r = .26, p = .12). SAT Math and Verbal subscores were positively correlated with each other (r = .66, p < .0001). No other correlations among the covariate measures approached statistical significance.

Median splits. I again performed median splits using each of the three covariate measures, for use in two-way ANOVAs on my dependent measures. Any significant interactions revealed by these analyses are reported throughout. Note that respective median-split sample sizes for low and high NFC were 21 and 22, but for low and high GPA and SAT totals they were 22 and 21. F-ratios for these analyses were approximate due to unequal sample sizes, so again, any interactions significant at or near the .05 level should be interpreted with caution.

Session Durations

Approximate total session duration (rounded to the nearest 5-minute increment and including all crash downtimes) ranged from 45 to 120 min, with a mean of 80.00 min (SD = 20.41, Mdn = 80). Subtraction of all downtimes reduced the overall mean session duration to 78.67 min. The IC sessions lasted significantly longer (M = 85.39, SD = 21.67) than did DC sessions (M = 71.63, SD = 15.91), t(41) = 2.36, p < .05. Diagramming durations (discounting the downtimes), using the same definition as in Experiment 1, ranged from 19.93 to 92.02 min, with a mean of 49.49 min (SD = 18.58, Mdn = 46.45). Diagram sessions of IC students were significantly longer in duration (M = 56.14, SD = 20.23) than were those of DC students (M = 42.53, SD = 14.00), t(41) = 2.55, p < .05. Having ruled out crash downtimes as a possible cause for these duration differences, I investigated the relative frequencies of coaching in the two conditions.

Amount and Frequency of Coaching

The 43 viable students received a total of 812 messages from the Coach, 678 in Condition IC and 134 in Condition DC. Of the 678 IC messages, 623 (92%) were presented intrusively and 55 (8%) were presented upon request, in the same relative proportions as in Experiment 1. Of the 55 requested IC messages, 24 (44%) were null advice messages and 31 were substantive (all 623 intrusive messages were substantive). Of the 134 DC messages, all of which were requested, 96 (72%) were substantive and 38 (28%) were null advice messages

The total number of substantive coaching messages displayed to each IC student ranged from 10 to 68, with a mean of 29.73 (SD = 14.31, Mdn = 30.5). Each DC student received 0 to 15 substantive messages, with a mean of 4.57 (SD = 3.88, Mdn = 4). The difference in means was significant, t(24.2) = 7.94, p < .0001. Therefore, not surprisingly, students requested coaching much less frequently than it was presented in the intrusive condition. Each IC student received 0 to 10 null coaching messages (upon request) with a mean of 1.09 per student (SD = 2.39, Mdn = 0), while each DC student received 0 to 6 null messages with a mean of 1.81 (SD = 1.83, Mdn = 1). These means did not differ significantly, t(39.2) = 1.11, p = .27.

Completeness of Final Diagrams

Total element count. As in Experiment 1 I computed some gross, overall measures of diagram completeness. The total number of boxes in each final diagram ranged from 10 to 33 with a mean of 17.19 (SD = 5.63, Mdn = 16). Total number of links ranged from 7 to 45 with a mean of 21.21 (SD = 10.59, Mdn = 19). Although IC diagrams tended to have slightly fewer boxes than DC diagrams (Ms = 17.09 and 17.29, respectively; t < 1), they tended to have more total links than DC diagrams (Ms = 23.09 and 19.24, respectively; t(41) = 1.20, p = .24), consistent with my expectations.

Box types. The number of Hypothesis boxes per final diagram ranged from 2 to 10, with a mean of 4.47 and median of 4 (SD = 1.84). There was no significant difference between IC and DC means (4.27 and 4.67, respectively). The number of Data boxes per final diagram ranged from 6 to 23, with a mean of 11.79 (SD = 4.34, Mdn = 11). Respective means of IC and DC students (12.09 and 11.48) did not differ significantly. The number of Unspecified boxes per final diagram ranged from 0 to 7, with a mean of 0.93 (SD = 1.52, Mdn = 0). Although IC diagrams tended to have fewer Unspecified boxes than did DC diagrams (Ms = 0.73 and 1.14, respectively), consistent with my expectations, the difference was not significant (t < 1). The total number of unlinked boxes per final diagram, regardless of box type, ranged from 0 to 3, with a mean of 0.28 (SD = 0.70, Mdn = 0). Respective means of IC and DC students (0.27 and 0.29) did not differ significantly. Of the 12 total boxes left unlinked by the 43 viable students, 3 were Unspecified boxes, 3 were unique Data boxes, 4 were unique Hypothesis boxes, and the other 2 were duplicate boxes (1 Data and 1 Hypothesis) entered by one student in a large diagram (scrolling was required to see both copies of each box). I revisit the issue of unlinked boxes in my discussion of diagram errors below.

Link types. The number of For links per final diagram ranged from 1 to 30, with a mean of 11.81 (SD = 6.97, Mdn = 12). There was no significant difference between IC and DC means (13.23 and 10.33, respectively; t(41) = 1.38, p = .18), but the trend was in line with my predictions. The number of Against links per final diagram ranged from 1 to 22, with a mean of 8.16 (SD = 4.57) and a median of 7. There was no significant difference between IC and DC means (8.46 and 7.86, respectively), but this trend was predicted as well. The number of And links per final diagram ranged from 0 to 7, with a mean of 1.23 (SD = 1.67, Mdn = 1). Unlike in Experiment 1 there was no significant difference between groups, and the IC mean (1.41) was slightly higher than the DC mean (1.05). However, feedback on the coaching rule that targets And links (conjunct-for-hypothesis?) was presented to users in both conditions (10 in IC and 9 in DC), so the nonsignificant difference is not surprising.

Correlations with covariates. The only significant two-tailed correlation between my gross diagram completeness measures and my covariate measures was between number of And links and SAT Math subscore (r(35) = .36, p < .05). SAT total scores, with the higher sample size (n = 40), showed only marginal positive correlations with number of And links (.28, p = .07) and number of Data boxes (.26, p = .10). An ANCOVA on number of And links showed an only marginal effect of SAT totals (t = 1.78, p = .08), with no significant difference between adjusted means (1.37 for IC and 1.09 for DC). An ANOVA showed significant interactions of coaching condition and median-split GPA on number of Data boxes (F(1, 39) = 4.91, p = .033) and on number of Against links (F(1, 39) = 5.20, p = .028). DC students with low GPAs tended to have fewer Data boxes (M = 11.11) than those with high GPAs (M = 11.75), whereas IC students with low GPAs had more Data boxes (M = 13.92) than those with high GPAs (M = 9.44). Similarly, DC students with low GPAs had fewer Against links (M = 6.56) than those with high GPAs (M = 8.83), whereas IC students with low GPAs had more Against links (M = 9.85) than those with high GPAs (M = 6.44). There was also a significant interaction of coaching condition and median-split SAT total on the number of And links, F(1, 39) = 8.76, p = .005. IC students with high SATs had more And links (M = 2.27) than those with low SATs (M = 0.55), but DC students with high SATs had fewer And links (M = 0.50) than those with low SATs (M = 1.55).

Expert Diagram Comparisons

I compared student diagrams to the same expert diagram used for Experiment 1 (refer back to Figure 3) Once again, some of the students who included more than the four key hypotheses in their diagrams included the "Sabotage" and "Accident" group headings from the hypothesis index as well (see Figure A6 in Appendix A). As in Experiment 1, some of the other extraneous hypotheses included by students were statements by witnesses or investigators, and some students entered some specific conjectures of their own as well. Expert diagram Data statements that did not appear in most student diagrams included the regularly scheduled flight to Paris (n = 1), the flight bound from JFK airport to France (n = 3), the 1997 NTSB statement about remaining possible causes (n = 2), the service history of the plane (n = 7), and the bomb-related statements about altitude (n = 6). The other two low-frequency Data statements from Experiment 1 were included with greater frequency in Experiment 2: the crash in Colombia due to pilot error (n = 11) and the statement regarding the split-second noise on the flight recorder (n = 11).

Errors in Final Diagrams

As in Experiment 1 I counted instances of uncorrected diagramming errors in final student diagrams, disregarding any extraneous hypothesis boxes not present in the expert diagram. I coded final diagrams for the following errors: (a) the number of missing hypotheses, (b) the number of hypotheses subject to confirmation bias (no Against links), (c) the number of unsupported hypotheses (no For links), (d) the number of unique hypotheses without any links at all, and (e) the number of unique data without any links. Error counts per condition are shown in Table 5 along with total errors per condition, with student subsample sizes shown in parentheses.

Table 5
Final Diagram Error Counts by Condition in Experiment 2

Hyps Hyps Hypsa Hyps Data Total Condition Missing C.Bias Unsupp. NoLinks NoLinks Errors
IC 10 (7) 8 (8) 4 (4) 1 (1) 0 (0) 23 (11)
DC 7 (6) 5 (4) 15 (9) 1 (1) 3 (2) 31 (17)
Note. Condition subsample sizes appear in parentheses. IC = intrusive coaching; DC = on-demand coaching. aProportional means are significantly different (p < .05)
The mean proportion of unsupported hypotheses in final IC diagrams (0.18) was significantly lower than the corresponding proportion in DC diagrams (0.71), t(26.4) = 2.37, p < .05. There were no significant differences between groups on any other proportional error counts. The only correlations with covariates that approached statistical significance were negative correlations of SAT total scores (n = 40) with number of missing hypotheses and with total number of errors (rs = -.29, ps = .07). However, ANCOVAs showed only marginal effects of SAT totals on each proportional error count (respective ps = .06 and .10), with no significant differences in adjusted means (Fs < 1). Two-way ANOVAs showed significant interactions of coaching condition with median-split NFC on number of unsupported hypotheses (F(1, 39) = 6.41, p = .016) and with median-split GPA on total number of errors (F(1, 39) = 5.79, p = .021). DC students with low NFC left more hypotheses unsupported (M = 1.00) than those with high NFC (M = 0.33), whereas IC students left hypotheses unsupported only in the high-NFC group (M = 0.31); however, note the small sample sizes. As for total errors, DC students had more if they had low GPAs (M = 2.00) than if they had high GPAs (M = 1.08), whereas the pattern was reversed for IC students (with respective means of 0.77 and 1.44).

As in Experiment 1, the most popular omitted hypothesis was HE (n = 9). Nine other students included it but left it unsupported; however, eight of these nine students were in Condition DC. The next most popular hypothesis to leave unsupported was MF (n = 6, 4 in DC and 2 in IC). These patterns help to explain the significant difference between groups on proportion of unsupported hypotheses. The most popular hypotheses subjected to confirmation bias were M and B (respective ns = 7 and 5), with only one student showing the bias on HE. No students showed confirmation bias on MF.

Diagram quality. Using the same definition of diagram quality as in Experiment 1, I found proportionally more inadequate final diagrams (27) than adequate ones (16) among the 43 viable students. The 22 IC student diagrams were evenly divided (11 adequate and 11 inadequate), but in the DC condition inadequate diagrams far outnumbered adequate ones (16 vs. 5, respectively). A chi-square analysis showed this apparent non-homogeneity between conditions to be almost statistically significant (3.15, p = .08), suggesting a possible advantage of more frequent coaching for overall diagram quality.

Distinct Coaching Effects on Diagramming

I once again examined student reactions (or lack thereof) to the coaching rule that elicits the most distinct diagramming response, attend-to-discrepant-evidence (refer back to Figure 4). Recall that coaching on this rule advises users to modify the default belief strengths of selected diagram constructs and to activate a display filter that is hidden within one of Belvedere's pull-down menus. Of the 43 viable students, 30 received coaching on this rule at least once (20 in IC and 10 in DC). Of these 30 students, 16 changed the default belief level of at least one of their boxes after delivery of the advice (10 in IC and 6 in DC), but only 2 (both in IC) turned on the display filter as well.[16] The number of boxes modified by IC students ranged from 1 to 16, while the number modified by DC students ranged from 1 to 8, probably because the rule was triggered more frequently in Condition IC (up to 10 times per session) than in Condition DC (only once per session).

Coaching Effects on Web-Browsing

Of the 43 viable participants, only 14 (7 in IC and 7 in DC) browsed all 38 pages of the TWA database at least once. The other 29 students (15 in IC and 14 in DC) skipped from 1 to 10 pages each, with a mean of 3.38 (SD = 2.47, Mdn = 3). In this experiment, respective means of the more coached (IC) and less coached (DC) groups did not differ significantly in either the reduced sample of page-skippers (3.53 vs. 3.21) or the complete sample (2.41 vs. 2.14), ts < 1.

Of the 38 total pages in the database, 27 were skipped by at least one student. The most commonly skipped page (n = 12) was the same as in Experiment 1, the government meeting mentioned almost in passing at the bottom of the hypothesis index (see Figure A6 in Appendix A). Three IC students skipped the HE hypothesis page and one DC student skipped the M hypothesis page (and thus their respective data indexes as well). Among all students who browsed each of the four hypothesis pages (19 in IC and 20 in DC), more skipped the indexes of evidence against them than skipped the indexes of evidence for them, for all but the MF hypothesis (respective ns were 8 vs. 1 for B, 7 vs. 2 for M, 0 vs. 0 for MF, and 3 vs. 0 for HE). As in Experiment 1, this browsing pattern helps to explain the tendency toward confirmation bias noted in the diagram errors, especially for the B and M hypotheses. However, this tendency appeared to be more equally distributed in browsing than in diagramming; of the 17 viable students who skipped an evidence index against at least one hypothesis, 9 were IC students and 8 were DC students.

Timing of Coaching Requests

Coaching reminder prompts. During the 21 viable DC sessions, the number of reminder prompts (discounting those that sounded during crash downtimes) ranged from 6 to 26 with a mean of 12.76 prompts per student (SD = 4.63, Mdn = 13). The number of prompts that were followed by a student request for coaching (within 1 min of the prompt) ranged from 0 to 11 with a mean of 3.33 prompts per student (SD = 2.96, Mdn = 3). In other words, within a minute after a reminder prompt, the average student requested advice 26% of the time. Only three DC students never followed up on a reminder prompt; two never requested coaching at all, and the other requested coaching only once, midway between two prompts. Prompted coaching requests resulted in null advice 0 to 4 times per student, with a mean of 0.90 (SD = 1.04), a mode of 0 (n = 9), and a median of 1 (n = 7). Advice requests following reminder prompts accounted for 70 of the 134 total (52%) DC advice requests. Half of the null advice messages received by DC students (19 of 38) followed a reminder prompt.

Not surprisingly, the number of prompted requests for advice was positively correlated with the number of reminder prompts, r(19) = .44, p < .05 (two-tailed). However, I also noted an even stronger positive correlation between the number of prompted requests and the number of null advice results (.77, p < .00005). Although the frequency of null results would naturally be lower for students who requested advice less often, I wondered whether the occurrence of null results would make students less willing to follow up on the reminder prompts. I regressed the number of prompted requests simultaneously on both the number of reminder prompts and the number of null advice results. The overall test was significant (F(2, 18) = 18.33, p < .0001, r2 = .67). Number of null results had a significant effect (B = 2.01, t = 5.12, p = .0001) and number of prompts had a nearly significant effect (B = 0.18, t = 2.01, p = .06). However, I have insufficient basis to claim that null results caused DC students to disregard their reminder prompts. In fact, the student who received the most null results (4) also made the most prompted coaching requests (11), both in absolute terms and proportionally to his number of reminder prompts (i.e., he followed up on 79% of 14 prompts).

Light bulb blinks. As mentioned in my Introduction, a minimally intrusive coaching feature had been added to Belvedere (beginning with version 2.0), by which the light bulb icon in the palette would slowly flash on and off (four times) when the Coach had some particularly important new advice to offer the user. Even though most pilot users never noticed the flashing, I decided to leave it enabled for the DC condition[17] in my experiment for its potential as an additional prompt to seek advice. Each such series of four flashes is hereafter counted as a single blink. The number of times such blinks occurred during each of the 21 viable DC sessions ranged from 0 to 25, with a mean of 8.95 blinks per session (SD = 6.51, Mdn = 9). During two sessions, the bulb never flashed because none of the crucial advice rules (see Appendix B) ever applied. For each session, the number of blinks was uncorrelated with the number of coaching requests, r(19) = -.06, p = .79. Over all DC sessions there were 188 blinks, but only 27 of them (14%) were followed immediately by a coaching request. Of the 19 students for whom the bulb ever blinked at all, there were 12 students who requested coaching immediately afterwards at least once. However, most of them (n = 8) responded to the blinks only once or twice. For their 12 sessions there were 11.50 mean blinks per session (SD = 3.83, Mdn = 10.5), but only 1 to 5 immediate coaching requests, with a mean of 2.25 requests per session (SD = 1.14, Mdn = 2).

One question to ask of these data is how many of the students actually noticed the blinking. Unfortunately, during session debriefings I was not as systematic as I should have been in asking whether students ever noticed the bulb blinking. As a result, I lack answers to this question for 7 of the 19 viable DC students for whom the bulb blinked at least once. However, of the other 12 students, 7[18] reported having noticed them and 5 reported not noticing them. Of the 7 students for whom I did not record a response, only 2 never requested coaching immediately after a bulb blink, so it is possible that the other 5 did notice them. It is also possible that their advice requests were coincidental to the blinks, or that they followed reminder prompts instead of blinks. However, because the clock used to time-stamp the bulb blinks in the log was not perfectly synchronized to the watch used to deliver the reminder prompts, and because the bulb blinks often began after a delay similar to that of the advice presentations, any attempt to synchronize the reminder prompts with the bulb blinks would be difficult at best.

Attitude Ratings

Ratings of Belvedere. On the end-of-session survey (see Appendix E), respective mean ratings of the six Belvedere statements B1 through B6 (after reverse-scoring of items B2 and B5) were 5.81, 6.95, 6.93, 6.98, 6.09, and 6.23, each with a median rating of either 6 or 7 on the nine-point scale. The mean composite rating was 6.50 (SD = 1.22, Mdn = 6.67), somewhat less favorable than in Experiment 1. Once again I predicted that less frequently coached (DC) students would report more positive attitudes toward Belvedere than would IC students. I did find a marginally significant difference in mean composite ratings between the two conditions, with the DC students (M = 6.85, SD = 0.94) giving more favorable ratings on average than IC students (M = 6.17, SD = 1.37), t(41) = 1.90, p = .065. Group mean ratings for the individual statements are shown in Table 6. DC students gave a significantly more favorable rating than IC students (after reverse-scoring) to statement B5 ("It would have been easier for me to work on the assigned problem without using Belvedere"), t(34.6) = 3.84, p = .0005. The DC-favored difference in means for statement B4 ("Belvedere would be helpful in collecting and organizing information for a paper or report") was marginal, t(41) = 1.67, p = .10. All other differences were also in the predicted direction, but none of them approached statistical significance.

Table 6
Belvedere Attitude Ratings by Condition in Experiment 2

Cond. B1 *B2 B3 B4a *B5b B6 Compositec
IC M 5.64 6.91 6.68 6.59 5.18 6.00 6.18 SD 1.65 1.80 1.84 1.79 1.94 1.66 1.37
DC M 6.00 7.00 7.19 7.38 7.05 6.48 6.85 SD 1.18 1.87 1.08 1.24 1.16 1.37 0.94
Note. * Reverse scoring was used on this item. IC = intrusive coaching; DC = on-demand coaching. aMeans are marginally different (p = .10) bMeans are significantly different (p = .0005) cMeans are marginally different (p = .065)
Counter to the trend in Experiment 1 with respect to diagram quality, composite Belvedere ratings of students with adequate diagrams (M = 6.01, SD = 1.17) were significantly lower than those of students with inadequate diagrams (M = 6.79, SD = 1.17), t(41) = 2.11, p < .05. Statement B6 ("Overall I found my session with Belvedere to be enjoyable") also tended to receive lower ratings from students with adequate diagrams (M = 5.69, SD = 1.40) than from those with inadequate diagrams (M = 6.56, SD = 1.53), t(41) = 1.86, p = .07. These differences could be due to the disproportionate number of IC students in the adequate-diagram group relative to the inadequate-diagram group. No other differences approached statistical significance, and there were no significant interactions between diagram quality and coaching condition.

Ratings of the Coach. As noted in the Data Limitations section above, the neutral ratings of the two uncoached DC students were omitted from the following analyses. Table 7 shows the mean Coach-related ratings by condition for the remaining 41 students. Respective overall means for the six statements were 4.71, 4.90, 3.90, 6.61, 5.42, and 5.24. Median ratings ranged from 3 to 7, with an overall mean composite rating of 5.13 (SD = 1.82, Mdn = 5.33). The only significant difference in ratings between groups was for statement C5 ("The Belvedere system would be better off without the Coach"), to which the DC students gave a more favorable rating after reverse scoring (i.e., DC students disagreed and IC students agreed), t(39) = 2.53, p < .05. However, most other means fit the predicted pattern. Note from Table 7 that the mean composites of IC and DC students were on opposite sides of the neutral mark. In fact, the only statement to receive a favorable mean rating from IC students was C4 ("The feedback I received from the Coach was easy to understand"), which was also the only statement to receive an appreciably (but nonsignificantly) higher mean rating from IC than from DC students. Statement C3 ("Often I found the feedback from the Coach to be repetitive") received a very slightly more favorable rating from IC than DC students after reverse scoring, and it was also the only statement to receive an unfavorable mean rating from DC students. That is, while mean DC ratings were favorable in most cases, students in both conditions slightly agreed on average that the Coach was repetitive. However, neither group agreed to this statement as strongly as did the coached students in Experiment 1 (refer back to Table 3).

Table 7
Coach Attitude Ratings by Condition in Experiment 2

Cond. C1 C2 *C3 C4 *C5a *C6 Composite
IC M 4.18 4.41 3.91 7.05 4.59 4.73 4.81 SD 2.84 2.89 2.98 1.70 2.52 2.98 2.02
DC M 5.32 5.47 3.90 6.11 6.37 5.84 5.50 SD 2.21 2.09 1.66 2.48 1.86 2.14 1.52
Note. * Reverse scoring was used on this item. IC = intrusive coaching; DC = on-demand coaching. aMeans are significantly different (p < .05)
As with the composite Belvedere ratings, composite Coach ratings of students with adequate diagrams (M = 4.26, SD = 1.84) were significantly lower than those of students with inadequate diagrams (M = 5.69, SD = 1.60), t(39) = 2.63, p = .01. Adequate-diagram means were significantly lower than inadequate-diagram means for statement C1 ("I found Belvedere's online Coach to be helpful"; Ms = 3.13 vs. 6.56, SDs = 2.28 and 2.30, t = 3.54, p = .001), statement C2 ("I appreciated the feedback I received from the Coach"; Ms = 3.31 vs. 5.92, SDs = 2.21 and 2.29, t = 3.60, p < .001), and statement C5 (Ms = 4.38 vs. 6.08, SDs = 2.68 and 1.96, t = 2.35, p < .05). There were no significant interactions between diagram quality and coaching condition. There were, however, barely significant interactions of coaching condition with both median-split NFC and median-split SAT totals on composite Coach ratings (respective Fs(1, 37) = 4.33 and 4.42, ps = .044 and .042). DC students with low NFC had lower composite ratings than those with high NFC (respective Ms = 4.93 and 6.13), but IC students with low NFC had higher composite ratings than those with high NFC (respective Ms = 5.31 and 4.46). Students with low SATs gave higher ratings than those with high SATs in both conditions, but the difference was greater among IC students (respective Ms = 6.03 and 3.59) than among DC students (respective Ms = 5.70 and 5.28).

Ratings correlations. One-tailed rejection criteria were used for all analyses reported in this paragraph. As predicted, the students' composite Belvedere ratings were positively correlated with their composite Coach ratings, r(39) = .39, p = .005. Interestingly, the correlation was much higher for DC students (r(17) = .63, p < .005) than for IC students, among whom the correlation did not even reach statistical significance (r(20) = .24, p = .15). Among the IC students (n = 22), composite Coach rating was significantly correlated only with Belvedere item B6 (r = .44, p = .02), and composite Belvedere rating only with Coach-related items C1 and C2, respective rs = .38 and .41, ps < .05. Among the DC students (n = 19 after the uncoached omissions), composite Coach rating was significantly correlated at the .05 level or better with all Belvedere item ratings except for B4 (r = .24, p = .16) and B3 ("Belvedere helped me keep track of the various pieces of information relevant to the problem"; r = .32, p = .09), and composite Belvedere rating with all Coach-related items except C3 (r = .27, p = .14), C4 (r = .34, p = .08), and C5 (r = .37, p = .06).

Verbal Summaries

Having found mainly nonsignificant trends in my primary dependent measures as in Experiment 1, I again turned to my secondary data source, the verbal end-of-session summaries. I again focused on the third phase of the verbal summaries: additions to summaries following diagram redisplays. I again coded the third phase of each verbal summary for the number of statements uttered and for the number of relations either stated or strongly implied. Students added 0 to 5 statements to their summaries, with a mean of 1.14 statements (SD = 1.57, Mdn = 0). As predicted, DC students tended to add more statements (M = 1.29, SD = 1.65) than did IC students (M = 1.00, SD = 1.51); however, this difference was not significant (t < 1). Students mentioned 0 to 6 relations in their additions to their summaries, with a mean of 0.95 relations (SD = 1.56, Mdn = 0). DC students tended to add more relations (M = 1.19, SD = 1.83) than did IC students (M = 0.73, SD = 1.24), but this difference also was not significant (t < 1). As in Experiment 1, both addition measures were highly intercorrelated, r(41) = .86, p < .0001.

Number of added statements had a significant negative correlation with QPA (-.39, p < .01) and with NFC (-.31, p < .05), and NFC also had a marginal negative correlation with the number of relations added (-.27, p = .08). An ANCOVA using all three covariates showed a significant overall covariate effect on the number of statements added, F(3, 38) = 3.96, p < .05. Although the adjusted means showed inflated differences in the predicted direction, there remained no significant difference between group means, which were 0.97 statements for IC students and 1.31 statements for DC students (F < 1). ANCOVAs on the number of added relations did not show a significant effect of covariates using any regression model.

As in Experiment 1 there were tendencies for students with adequate diagrams to include more statements (M = 1.38, SD = 1.71) and relations (M = 1.06, SD = 1.57) than students with inadequate diagrams (respective Ms = 1.00 and 0.89, SDs = 1.49 and 1.58), although neither difference in means was significant (ts < 1). However, a two-way ANOVA showed a significant interaction between diagram quality and coaching condition on the number of statements added, F(1, 39) = 9.88, p < .005. The IC students added more statements if they had adequate diagrams (M = 1.73) than if they had inadequate diagrams (M = 0.27), but DC students showed the opposite pattern, adding more statements if they had inadequate diagrams (M = 1.50) than if they had adequate diagrams (M = 0.27).

Lag Times Following Advice Delivery

As in Experiment 1, for each viable session I measured the lag time in seconds between delivery of each advice message and the action immediately following it. I again omitted any lag times for which the advice-triggering action (i.e., a diagram action for intrusive messages, or a coaching request for on-demand messages) was the user's final action; such lag times (n = 7) ranged from 942 to 1426 s. I omitted 14 additional lag-time outliers ranging from 106 to 238 s; these lag times followed documented cases of crash downtimes, user scrolling of either the Netscape or Belvedere window (neither of which are logged actions), verbal interactions between the user and myself, slow typing or updating of text in a box dialog (as discussed at the end of Experiment 1), or user inactivity. The remaining 791 lag times ranged from 1 to 98 s, with a mean of 24.23 s (SD = 20.25) and a median of 17 s. For the 664 lag times in Condition IC, the mean was 22.81 s (SD = 19.63) and the median was 16. As in Experiment 1, the distribution of IC lags was positively skewed (coefficient of 1.43), with a kurtosis measure of 1.54 and a slight hint of bimodality. The bimodality coefficient (computed from the same formula used in Experiment 1) was 0.669, even higher than the coefficient for the IC lags in Experiment 1. The 127 lag times for Condition DC, on the other hand, had a mean of 31.64 s (SD = 21.85) and a median of 28 s, and the lag distribution exhibited much less skewness (coefficient of 0.86) and kurtosis (.008) with fewer signs of bimodality. The bimodality coefficient for the DC lags was only 0.565, not much higher than the criterion value representing a uniform distribution (SAS Institute, 1999). The difference in mean lag times between groups was highly significant, t(789) = 4.56, p < .0001. Bartlett's test gave an almost significant result for inequality of group variances (F(126, 663) = 1.24, p = .0515); however, the difference in means was significant even assuming unequal variances (t(167.1) = 4.24, p = .0001).

Given the higher frequency of null advice messages in comparison to Experiment 1 (mostly in Condition DC), I then restricted analyses to substantive advice messages only, for which between group mean lags were 22.35 s (SD = 19.01, N = 642) for IC students and 34.71 s (SD = 22.25, N = 93) for DC students. These means also differed significantly, t(112.3) = 5.09, p < .0001. Among the DC students, respective mean lag times for null and substantive messages were 23.24 s (SD = 18.54, N = 34) and 34.71 s (SD = 22.25, N = 93). The difference in means was significant, t(125) = 2.68, p < .01, confirming my suspicion that the null advice messages would have shorter lag times. Indeed, the mean lag time for null messages was not much shorter than the mean IC lag time for substantive messages. In order to investigate possible differences in lag time between intrusive and requested advice messages, I pooled the lag time data from the IC conditions in both experiments, resulting in a set of 1043 intrusive messages and 58 requested messages. Despite the inordinately unequal sample sizes, the mean lag time of 20.96 s for intrusive messages (SD = 17.60) was significantly shorter than the 28.72 s mean for requested messages (SD = 21.40), t(61.4) = 2.71, p < .01. When I also pool the lag times from the DC condition, all of which followed requests for advice, the mean for requested messages increases to 32.41 s (SD = 22.05, N = 151), which differs even more significantly from the intrusive mean, t(178.7) = 6.11, p < .0001. Thus, lag time data from both experiments suggest that students spend much less time processing intrusive advice than they do requested advice.

Summary

Students in Condition IC received five times the number of advice messages requested by DC students. However, most performance measures continued to show only trends in the predicted direction, with fewer unpredicted trends than in Experiment 1. Predicted trends in attitude ratings were stronger than in Experiment 1, although ratings correlations were weaker among IC students in this experiment. Once again, local reactions to a unique advice message were noted, showing that students in both conditions responded positively to at least some of the coaching. However, the lag time analyses show that the time between coaching and subsequent actions is significantly longer in the on-demand condition than in the intrusive conditions of both experiments. Collapsed across groups, lag time is significantly shorter for intrusive messages than for requested messages, despite the huge sample-size advantage for intrusive messages. Therefore, even though the lag-time measure is imperfect as described above, it seems to indicate that Belvedere users paid more attention to requested coaching than to intrusive coaching.


General Discussion

Costs versus Benefits

Generally speaking, my experiments showed a greater number of significant negative effects than positive effects of coaching. Although the news on the attitude measures was not all bad -- students generally found the Coach's feedback to be easy to understand -- they also found it to be repetitive, even in the on-demand condition. Therefore, although several trends hinted at the positive effects I predicted, the predicted negative affective impact of coaching was more readily apparent, suggesting that my hypothetical attitude-performance tradeoff may have been somewhat top-heavy. However, there are several reasons why the performance effects were not as readily apparent. I discuss several of them in turn.

Ceiling Effects

It is conceivable that the problem solving task used in my experiments was too easy. Although the TWA crash problem was selected for its accessibility to students, it may have been too accessible. The proliferation of media coverage of airplane crashes may have "lowered the bar" on the task of evaluating and analyzing the evidence and hypotheses in my online database. In fact, my reindexing of the original TWA database may have made the task easier still. Although the reindexing allowed me to determine the relative frequency with which users considered evidence for and against the various hypotheses, the link structure made the evidential relationships obvious to the users, providing more inquiry scaffolding than they would otherwise have had with a database of unconnected hypotheses and data.

Self-Correction of Errors

As discussed in my results of Experiment 1, students who make a conscientious effort to wade through a problem database with explicit evidential indexing like the one I used, will likely either avoid or self-correct any errors they might otherwise have left in their final diagrams. The similarity in error counts between conditions should not be surprising, given that half of the students in Experiment 1 and a third of those in Experiment 2 browsed the entire database, while the remaining students browsed the majority of it. It is possible that, had I not made the evidential relations explicit in the database, more of the between-group differences in final diagram errors counts might have been significant.

Student Ability Level

It is also possible that the task was too easy for college-aged students in general, as opposed to younger students. Belvedere was originally designed with middle-school students in mind, to address curricular deficiencies in the teaching of scientific inquiry skills. Although one could claim that many college freshmen (the typical population enrolled in undergraduate courses in introductory psychology) are deficient in these skills as well, it is likely that they would perform at a higher level than their younger counterparts, by virtue of their greater knowledge if not of their age or grade in school (Means & Voss, 1996). Although some of my covariate measures suggested differences between groups on some of my performance measures, they had generally little effect (possibly due to the potential ceiling effect).

The Apparent Advantage of Requested Advice

As noted in my results, students appeared to spend significantly more time processing requested advice than they did intrusive advice. Although this conclusion is based solely on analyses of post-coaching lag times, a measure that suffers from many drawbacks as discussed in Experiment 1, the finding was replicated with a larger sample size in Experiment 2. Therefore, it seems my students were more receptive to advice when they asked for it themselves, even if it was not provided when it became immediately applicable. However, it is difficult to tease apart the issues of feedback timing and control in my second experiment. Given the nature of on-demand coaching, in which feedback delivery is by definition under the control of the user and may not occur at a time when the feedback may first be helpful, the two aspects of feedback timing (immediate vs. delayed) and feedback control (user- vs. system-initiated) are inexorably linked in this study. The question remains as to whether immediacy or locus of control is the more important aspect of feedback with Belvedere.

When Was Advice Requested?

One exploratory question to ask of the data from participants in Condition DC is when (i.e., under what circumstances) they requested advice from the on-demand Coach. Although my introduction of a reminder prompt was meant to increase their interactions with the Coach, it also made it difficult to determine any patterns indicating when users felt they needed advice. Some earlier speculative notions that student requests for advice might be impasse-driven (e.g., VanLehn, 1988b) do not seem to apply because working with Belvedere represents problem solving in the absence of correct answers (D. Suthers, personal communication, May 23, 1998; A. Lesgold, personal communications, October 23 & November 16, 1998, May 30, 2000). Not only are there no true impasses in the sense of becoming blocked on the path toward a correct answer, but also the ill-defined nature of the problem may make it difficult for a user to even recognize an impasse in any other sense (e.g., a knowledge deficit or a procedural stumbling block). Even if users in my experiments did recognize any deficits in their problem-relevant knowledge, the web database provided enough scaffolding for them to correct such deficits without having to rely on coaching.

Of course, there was the possibility that DC students would simply request coaching only when reminded of its availability by the experimenter. Indeed, based on my definition of what constituted a prompted request for advice, more than half of all DC advice requests followed a reminder prompt. However, regardless of the criterion chosen for the timing of reminder prompts, I expected that students probably would not request advice after every reminder, and indeed they did not. Unfortunately, it is difficult to partial out the relative effects of reminder prompts and minimally intrusive light bulb blinks on DC student requests for coaching. That is, it is difficult to determine any emerging patterns from their advice requests beyond these coaching prompts.

When to Present Advice Intrusively?

Although no patterns are immediately apparent from the DC student advice requests, one thing is certain: They did not request advice nearly as often as it was provided by my intrusive Coach. The modifications that created this intrusive Coach reflected an admittedly brute-force approach, adopted to ensure the delivery of sufficient coaching feedback to examine its effects within the Belvedere framework. However, given the dearth of positive performance effects and the apparent negative affective impact of its advice, it is probably safe to conclude that any future versions of an intrusive Coach for Belvedere should scale back its frequency of advice presentation. Instead of providing feedback every time a critical rule applies, as in my experiments, an enhanced intrusive Coach might delay its presentation of advice until one of several possible "key points" in an interaction, such as a period of inactivity on the part of the user. Although such an enhancement would require sensitivity to information not currently available to the Coach (e.g., time spent typing into dialog boxes or browsing web pages), adding such time-based sensitivities might be worth the investment.

Limitations of the Coach

In addition to the drawbacks inherent in its domain-general nature, the argument pattern Coach suffers from several other limitations as well, any number of which could negatively skew the attitudes of frequently coached users like my Condition IC participants. Below I discuss some of the limitations mentioned informally by several of my participants.

"Jumping the Gun"

Despite the introduction of a delay factor to the advice rules, which caused the Coach to wait until a sufficient number of diagramming actions had passed before offering applicable advice (see my Introduction section on Intrusive Coaching), many of my students complained that the Coach would sometimes advise them to take actions they were about to take anyway. In some cases, students claimed that the Coach interjected such advice while they were actually in the process of carrying out the action it recommended. One way to address this problem would be to adjust the delay factors of the coaching rules, so that the Coach would wait even longer than it currently does before presenting advice. However, in order to thoroughly determine which specific rules to adjust, not to mention how to adjust them, one would have to solicit reactions from users at every advice delivery about the appropriateness of its timing. While asking users to verbalize their thoughts when advice is presented during problem solving would have several advantages, at least from the standpoint of improving the effectiveness of the Coach, it would also interfere with the very task the Coach aims to support. Soliciting such frequent reactions to coaching would run the risk of placing users' comments outside the context of the inquiry task.

Imposing Order on the Inquiry Process

Another possible drawback to the current Coach is the way it tries to structure the process of creating an inquiry diagram. The Coach guides Belvedere users to consider hypotheses and evidence together, such that if they enter too many boxes of one type (Data or Hypotheses), the Coach will ask the users to link them to some boxes of the other type. Some of my students complained that they preferred to collect all of the relevant evidence surrounding the plane crash before considering any of the possible causes. Other students preferred to enter both hypotheses and evidence as they encountered them, but to add relational links only after they had collected all of their "thoughts" in the diagram. This inflexibility of the Coach, while helping to ensure that hypotheses are supported and that data are explained, does not allow for the possibility of multiple solution paths as indicated by the preferences of these students.

Findings of a Similar Study

An aforementioned study by Toth et al. (in press) also investigated the problem solving products of groups of students using Belvedere. Like mine, this study focused not on domain knowledge but on the scientific inquiry skills acquired while problem solving via "evidence mapping" in an ill-defined domain. Although the study did not involve the automated Coach at all, it did involve the use of reflective assessment rubrics, which made explicit the criteria for evaluating reasoning representations such as a Belvedere diagram or a prose summary. These rubrics were analogous to many of the same evaluation criteria encoded into the Coach's pattern-matching rules and were therefore similar, although the rubrics were introduced before and after problem solving rather than online. The rubrics were seen as complementary to the representational scaffolding provided by Belvedere, with the authors concluding that "rubrics seem to encourage students to look for and record disconfirming as well as confirming information more than mapping alone".

Although Toth et al. investigated graphical (Belvedere) versus textual (word processor) representations, I focus here only on the former aspect of their work, particularly on comparisons of the two groups who used Belvedere with and without the rubrics. They measured students' information searching behaviors by counting the number of relevant Hypotheses and Data entered in diagrams, as well as the number of For, Against, and And links. Their student participants were asked to write prose conclusions after problem solving with Belvedere, during which their diagrams were not visible (cf. my students' verbal summaries). The authors scored these prose conclusions based partly on whether they included evidence for a main hypothesis as well as evidence against it, similar to how I evaluated my students' diagrams. They also found no significant differences between their basic information search measures (akin to my body count measures), with only a trend favoring those who used the rubrics. They found that rubric users had significantly more links than non-users, supporting the nonsignificant trend favoring the IC condition in my Experiment 2. However, they found no significant differences between rubric groups on the specific numbers of For, Against, or And links, which paralleled my own findings (with the exception of the And links in Experiment 1). They also found no differences in reasoning scores on their students' prose conclusions, consistent with the lack of differences in my selective verbal summary comparisons.

Toth et al. concluded that further evidence mapping studies with Belvedere or other representation formats should seek to account for differences in student reasoning ability, which is something I tried to do with my surrogate covariate measures. They also stressed the need to evaluate the impact of such representations of the problem solving activities of individual users, something else my research has done. They also discussed the potential need for different representational supports for "inductive" reasoning (i.e., entering Data first, before Hypotheses) versus "deductive" reasoning (i.e., starting with Hypotheses before entering and linking Data), touching upon one of the major limitations of Belvedere's Coach as described above.

Conclusions

Returning now to assessing the impact of Belvedere's minimally intelligent automated coaching, I am forced to conclude that immediate feedback seems to have done more harm than good, at least in the present experiments. Although there were several trends in the data to suggest performance benefits of such feedback, there was stronger evidence of its negative affective consequences on user attitudes. However, it is difficult to assess whether the harm done by immediate feedback in these studies was due to its timing or to locus of control. Many would agree that the intrusive Coach intervened too frequently (and, perhaps, too immediately). However, its feedback may have been too limited to warrant such frequent delivery. It is possible that better, more useful feedback than currently provided may not be judged as annoying even if delivered with the same frequency. Alternatively, if the current intrusive Coach could be made to scale back its frequency of advice delivery by intervening more selectively, control could become less of an issue. Further research on the timing of advice requests could inform the design of a more selective intrusive Coach, such that perhaps it could be made to intervene at points when users tend to ask for advice anyway. Also, as mentioned earlier in this discussion, if the Coach could be made aware of user idle times perhaps it could be made to intervene at those times, instead of popping up while the user is engaged in a flurry of browsing and diagramming activity. Such modest delays of otherwise immediate coaching feedback could help tip the cost/benefit scales back toward optimality.

Future Directions

Extensions to the Coach

Expert advice. One extension to Belvedere's Coach that has already been implemented (see Footnote 2) is the addition of expert domain knowledge, from which the Coach can draw in generating more domain-specific advice to users. Although such a capability requires additional knowledge engineering, the payoff could be high for reusable problem domains (e.g., scientific debates relevant to a school curriculum in which many students would need to tackle the same issues). One shortcoming of the current argument pattern Coach is that it cannot recognize the contents of statements entered in a diagram. Therefore, some of its advice rules (e.g., explain-all-the-data) are prone to suggesting possible relationships between hypotheses and data where simply none exist. As described elsewhere (Suthers et al., in press), adding minimal semantic annotations to a self-contained web database (using a knowledge representation language accessible to the Coach) can enable the Coach to recognize the basic content of information chunks transferred into Belvedere from the database. The Coach can then provide more specific feedback relevant to the actual information entered, using the domain-general argument pattern Coach as a fall-back when any non-annotated information (e.g., from outside the self-contained database) is entered. The benefits relative to the costs of the additional knowledge engineering still remain to be seen, but it is a logical next step from the present research.

Graphical advice. Another possible limitation of the current advice delivered by Belvedere's Coach is the modality in which it is presented. Belvedere is a graphical environment in which the units of representation are visual objects (boxes and links). However, although many of the Coach's advice messages are accompanied by graphical highlighting of diagram objects, the advice itself is largely text-based. A possible extension to the current argument pattern coach would be for the Coach to present temporary placeholder boxes or links in the diagram when it provides advice (D. Suthers, personal communication, September 23, 2000). For example, on the confirmation bias rule, the Coach could present a temporary Data box with a temporary Against link between it and the target Hypothesis box. The temporary Data box could be empty or, if combined with the expert coaching functionality, could contain an actual piece of disconfirming evidence from an annotated database. These temporary constructs would disappear from the diagram once the user dismissed the advice (using an analog to the "Close" button on the current advice dialog boxes). Such an extension might make the advice more salient to users, and embedding the advice in the same modality as the diagramming task may make it less of a distraction. However, one argument against this scheme is that providing such placeholders, especially ones containing actual domain information, would be akin to providing users a "bottom out hint" (Aleven & Koedinger, 2000, p. 294) in a hierarchical sequence of help messages. Many tutorial systems prefer to begin with general advice on principles or solution processes that will help users to arrive at a correct answer on their own, reserving an actual correct answer itself for the final (bottom) hint in the sequence. Providing Belvedere users with an actual piece of disconfirming evidence for a confirmation-biased hypothesis would relieve them of the burden of finding and analyzing the status of such evidence on their own. If a task goal is to learn scientific inquiry skills, then providing help on that level would seem to run counter to the goal. However, the less intelligent version of the idea, in which empty boxes are temporarily presented by the Coach, would seem to be a viable compromise extension to the current Coach.

Same Coach, Different Scenarios

Finally, an obvious extension to the current research would be to use the same types of coaching with (a) more difficult ill-defined problems, (b) less able experimental participants, or (c) both. As mentioned earlier in this paper (see Footnote 3), other self-contained web databases already exist for more complex ill-defined problems (databases with much less scaffolding of evidential relations than the one I used), and the domain-general nature of coaching in Belvedere opens it to countless other domains as well. To the extent that the relative lack of positive effects in my experiments was due to the simplicity of the chosen ill-defined problem or to the structure of its associated database, solvers of more complex problems could conceivably get some more mileage out of the current Coach than did my students. Alternatively, younger students or others more deficient than college freshmen in the skills of scientific inquiry might find the Coach's current feedback to be more helpful. Even the intrusive Coach in its current incarnation might be less likely to jump the gun on users who did not find its feedback to be superfluous to their own trains of thought. This could hold true even in problem domains that provide as much inquiry scaffolding as did my self-contained web database, depending on user reasoning ability. Only after exploration of at least some of these avenues could one draw any general conclusions about the impact of immediate feedback, even in its current form, on ill-defined problem solving with Belvedere.


Appendix A

An Example Interaction with Belvedere

Below is an illustrated excerpt of a typical user's interaction with Belvedere and Netscape.



FigA1

Figure A1. Example screen layout for the experimental sessions.


Figure A1 shows an example screen layout, with a Netscape web browser on the left and a Belvedere Inquiry Diagram on the right. The web browser shows a page from the TWA 800 problem database for one of the possible causes of the crash (a missile).



FigA2

Figure A2. Screen after clicking on "Evidence for this hypothesis".


When the user clicks on the hyperlink Evidence for this hypothesis, the screen appears as shown in Figure A2, with a new web page including hyperlinks to various evidence statements supporting the missile hypothesis.



FigA3

Figure A3. Screen after clicking on "Plane's proximity to dry land", clicking on Data icon, and copying statement.


The user clicks on the first evidence link, Plane's proximity to dry land. Netscape then displays a short page with details about that bit of evidence. The user decides she wants to include part of this evidence statement in her diagram, so she clicks on the Data icon in the Belvedere palette, at which point an Add Data dialog box appears on the screen. She then uses the mouse to sweep out part of the statement from the web page, copies it, and pastes it into the dialog box (Figure A3).



FigA4

Figure A4. Screen after clicking on "Plane's proximity to dry land".


After pasting the statement using the mouse (and optionally changing the box Type or the default neutral Strength of her belief in its contents), the user clicks on the Add this to Diagram button at the bottom left of the dialog box (see Figure A3). The Data statement then appears as a floating square box inside the Belvedere diagram window, until the user decides where in the diagram she wants it to appear and clicks to place it in the diagram (Figure A4).



FigA5

Figure A5. Screen after drawing For link from new Data box
to an existing Hypothesis box.


The user then wishes to indicate her new data's support of the missile hypothesis. She does so by clicking on the For icon in Belvedere's palette, clicking on her newly-created Data box to begin drawing the For link, then dragging the green arrow to her existing box representing the missile hypothesis, completing the link (Figure A5).



FigA6

Figure A6. Screen after clicking on "Consider Possible Causes".


The user then moves the mouse back to the Netscape window and clicks on the Consider Possible Causes link in the far left frame. This brings up a web page listing the possible causes of the crash, with hyperlinks to each of them. As shown by the grey link shades in Figure A6, the user has already looked at information concerning the first two possible causes (bomb and missile), but she has not yet considered the last two (mechanical failure and human error).


Appendix B

Textual Contents of the 20 Coaching Advice Rules

Notes:


2-statement-circular-argument *

  1. This seems to be a circular argument. (There are two "for" links in opposite directions.)
    Circular arguments are not very strong: you need independent support for these statements.
  2. This seems to be a circular argument. (There are two "for" links in opposite directions.)
3-statement-circular-argument * and
4-statement-circular-argument *
  1. This seems to be a circular argument.
    Circular arguments are not very strong: you need independent support for these statements.
  2. This seems to be a circular argument.
alternate-hypothesis
  1. Scientists consider many hypotheses to get the best explanation of the data they are interested in. If they don't compare their favorite idea to other ideas, somebody else will!
    Is there another hypothesis that you could consider?
  2. Is there another hypothesis that you could consider?
attend-to-discrepant-evidence
  1. The diagram shows a hypothesis that has some evidence for it and some evidence against it.
    The hypothesis is in trouble if the evidence against it is reliable and *really* is against it.
    What do you think? Which is stronger here, the evidence for the hypothesis, or the evidence against it?
    You can edit the hypotheses and data to indicate which you think have stronger support. Use "Show Strength" on the "Filter" menu to display strength.
  2. Which is stronger here, the evidence for the hypothesis, or the evidence against it?
    Or are they equally reliable?
    Use the "Show Strength" filter and edit the statements to show which are stronger.
confirmation-bias *
  1. You've done a nice job of finding data that are consistent with this hypothesis.
    However, in science we must consider whether there is any evidence *against* our hypothesis as well as evidence for it. Otherwise we risk fooling ourselves into believing a false hypothesis.
    Is there any evidence against this hypothesis?
  2. Don't forget to look for evidence against this hypothesis!
conjunct-for-hypothesis? * contradicting-links
  1. There is both a "for" and an "against" link between these two statements. Is this a contradiction, or do you have reasons for both relationships? Is one relationship stronger than the other?
  2. Can these statements really be for and against each other at the same time?
data-supports-conflicting-hypotheses
  1. The diagram shows this data supporting conflicting hypotheses.
    Is this possible? If so, can you find more data that helps decide between the two hypotheses? Look for evidence that is for one hypothesis but against the other.
  2. The same data supports conflicting hypotheses.
    Can you find other evidence that is for one hypothesis but against the other?
discriminating-evidence-needed *
  1. These hypotheses are supported by the same data. When this happens, scientists look for more data as a "tie breaker" -- especially data *against* one hypothesis.
    Can you produce some data that would "rule out" one of the hypotheses?
  2. Can you produce some data that might support just one of the hypotheses?
explain-all-the-data
  1. A good hypothesis is one that explains most of or all the data.
    Could this hypothesis explain this other data as well?
    Or is this a weakness of the hypothesis?
  2. Could this hypothesis explain this data, or is this a weakness of the hypothesis?
hypotheses-lack-empirical-evidence
  1. Hypotheses are just conjectures -- scientific guesses -- until they explain or predict observed data.
    Can you find data that are for or against these hypotheses?
  2. Can you find data that are for or against these hypotheses?
hypothesis-lacks-empirical-evidence
  1. Can you find data that are for or against this hypothesis?
    A scientific hypothesis is put forward to explain observed data. Data that a hypothesis explains or predicts count *for* it. Data that are inconsistent with the hypothesis count *against* it.
  2. Can you find some data for or against this hypothesis?
many-objects-and-no-links *
  1. I see you are collecting ideas now, but eventually you need to record the relationships between your statements.
    You can express relationships by drawing "for" or "against" links. To do this, use the mouse to select a link from the panel at the top.
  2. Are any of these statements related to other statements here? Is one statement "for" another statement? Is one statement "against" another?
no-links
  1. You need to record the relationships between your statements.
    Are any of these statements related to other statements here? If so, you can express a relationship by drawing a "for" or "against" link. To do this, use the mouse to select a link from the panel at the top. If not, perhaps you need to add other statements, so that you can indicate which data support which hypotheses.
  2. Are any of these statements related to other statements here? Is one statement "for" another statement? Is one statement "against" another?
nothing-in-argument ready-to-decide? statements-unconnected
  1. Are any of these statements related to other statements here by a "for" or "against" relation?
  2. Are any of these statements related to other statements here?
swallow-does-not-a-summer-make
  1. Strong hypotheses and theories usually have a lot of data to support them.
    However, this hypothesis has only one consistent data item. It looks rather weak.
    Can you find more data for this hypothesis? Can you find data against it?
  2. This hypothesis has only one consistent data item.
    Could you find more data for (or against) this hypothesis?
unexplained-data
  1. In science, we try to explain observed data with hypotheses that say why the data are so.
    Can you state a hypothesis that could explain the data here?
  2. Can you state a hypothesis that might explain the data here?

Appendix C

TWA Web Database Link Structure

Notes:


Home page: http://advlearn.lrdc.pitt.edu/experiments/materials/JC/TWA/

[displays menu.htm in left pane and crshprob.htm in right pane]
menu.htm:    [six unnumbered links, numbered here for clarity]
  1. crshprob.htm + 747front.gif     ("Mission Statement")
  2. crshdetl.htm    ("Get More Detail about the Crash")
  3. hypoindx.htm    ("Consider Possible Causes")
  4. dataindx.htm + cokpt.jpg     ("Study the Wreckage")
  5. parallel.htm    ("Find out about Similar Events")
  6. 747outer.jpg    ("Diagram of a Boeing 747-121")

Appendix D

Generalized Form of Experimenter Script


What I'm going to ask you to do today is to work through a scientific problem that we've collected some information about and put into an online web database. Have you ever used Netscape before? (If not, explain underlined hyperlinks, Back and Forward buttons.) That's what you'll be using to browse our database today.

The scientific problem we'll be dealing with today is an actual, real-life problem that remains unsolved to this day. Therefore, there is no correct answer to the problem, and you will not be expected to come up with a definitive solution to it yourself.

Before we get started on the problem, I'd like to request some information from you. We would like to compare your problem solving performance in this experiment today with some other general measures of academic and reasoning ability. Do you happen to remember your SAT scores offhand (Math? Verbal?)? Do you happen to know your current GPA (as of this term)?

The specific scientific problem we'll be dealing with today is: What caused the crash of TWA flight 800? Like I said, it's an actual problem that remains unsolved to this day; scientists still haven't figured out a definite cause for the crash. So again, there is no correct answer to this problem. Your task today is to try to "make sense" of the information we have in the database and to try to come up with an account of what you think may have caused the crash.

While you're working on the problem, I'm going to ask you to record your thoughts in Belvedere (POINT). Belvedere is a piece of software we've developed here at LRDC that allows you to graphically map out the relationships between hypotheses and evidence. Basically, it lets you draw argument diagrams on the screen.

If while browsing through the database you find a piece of evidence that you think is relevant, you can enter it into Belvedere using a Data box (POINT). When you click on the little Data icon (DEMO), a dialog box will appear here (POINT) and you can type in a summary of the evidence here (POINT), and then click here (POINT: "Add this to diagram") to place your little Data box somewhere in the diagram. In the same manner, if you find a hypothesis that you think is relevant, you can put it in a Hypothesis box (POINT). It works the same way as a Data box, but it will appear as a box with rounded corners, as opposed to square corners like a Data box. Also, if you find a statement that you think is relevant, but you're not sure whether it's Data or a Hypothesis, then you can put it in an Unspecified box, which has a kind of cloud-like shape. So those are the three types of boxes you can use for different types of statements.

To show the relationships between the statements in your boxes, you would connect them using these links (POINT). You would use a For link to show that one statement supports or explains another, or is somehow for another). This type of link would be colored green in your diagram. You would use an Against link to show that one statement contradicts another or is somehow against another. This link would show up as red in your diagram, with a red "X" drawn over the middle of it. And you could use the And link to show that two statements together are for or against another statement. This link would appear in black, with a little "ball" in the middle of it. In just a minute I'll show you an example of what these boxes and links look like in a Belvedere diagram. One thing about drawing links: Belvedere doesn't work like many other drawing programs, and link-drawing is a little counterintuitive. To draw a link, click on its icon, move the cursor inside the box you want to draw it from, then click and hold the mouse while dragging the cursor inside the box you want to draw the link to... Do you have any questions at this point?

Sample Text & Diagram (NS)

I'm going to show you a short excerpt of some text about an unrelated scientific problem, along with a corresponding representation of the text as a Belvedere diagram. I'm going to ask you to read the text and compare it to the diagram, so that you understand how the information in the diagram matches up to what's in the text. Let me know when you're done... (after reading) Do you have any questions about how the text and the diagram correspond?

If in Condition NC, skip to *** below

Belvedere has a computerized Coach that will keep track of what you're doing in the diagram and may want to make suggestions about how to improve your diagram or about what you may want to look for next in the database. You can get advice from the Coach whenever you want it by clicking on the lightbulb icon (POINT). The advice will appear in a little box in the upper left corner of the screen, with "Here's an Idea" across the top of it.

Condition I (IC) Condition D (DC)
I'm telling you this because sometimes the Coach may also speak up on its own, even if you don't click on it -- so if a box appears there you'll know it's a message from the Coach. Periodically throughout the session, you will hear beeping sounds from behind you. These beeps are simply to remind you that advice is available from Belvedere whenever you want it (you don't have to click the bulb when you hear the beeps). Please let me know each time you hear them...

If a Coach box pops up such that it covers up part of your diagram, you can move the Coach box around on the screen. I'm telling you this because sometimes the Coach will highlight parts of your diagram in yellow, when its advice refers to specific parts of it; so you may need to move the Coach box around on the screen to see what parts turned yellow.

***
That's about it. I've told you the basics about how to use Belvedere, but I may have left out some of the details. But, I'm not trying to put you through the ringer here; if I see you having trouble with something, I'll jump in and help you. I'll be sitting back here doing other work, so I may not be paying close attention to what you're doing the entire session; so if you are having trouble with something, please feel free to ask me for help at any point. Also, feel free to ask me questions at any time during the session.

Remember -- Your final goal is not to come up with a definitive answer to the question (not even the best experts have been able to do that!), but rather to try to "make sense" of the information and try to figure out the most likely cause of the crash...

Do you have any questions at this point? Then I'll ask you to start by reading the Problem Statement here (POINT), and then you can proceed through the database however you see fit.

(prepare SAT perm form and feedback sheet)

Verbal Summaries     (stop timer if Cond. DC)

OK, for the next part of the session, I'm going to ask you some questions and ask you to give verbal responses to them. So I want to tape record just this part of the session... (click Desktop icon to hide all)

  1. (free-form) ... Can you provide a summary of the argument that you developed using Belvedere?
  2. What do you conclude about the bomb / missile / mechanical-failure / human-error hypotheses? Which of the four hypotheses do you think gives the best explanation of what happened?
  3. (show diagram) ... Looking at your diagram, is there anything you would like to change or add to your verbal summary? ... (until done) Anything else?
Survey (MSIE): (pick one based on condition)

For the last major part of the session today, I'd like you to fill out a brief survey about the experiment. It has some questions about the crash, some items about Belvedere, and some more general items...

SAT/GPA permission form

OK, I have one last thing to ask of you. I mentioned earlier that we'd like to compare your performance today with other more general measures. So I'm asking each participant in this experiment for permission to access their official SAT scores and GPA at the end of this term, on the University's ISIS system. If you do give me permission to do so, rest assured that your data would be held completely confidential (only I will see it), that it would be used only for statistical averaging purposes, and that it would not be tied to your name in any way -- it would only be stored with your arbitrary ID number for this experiment...

Debrief: ask whether S has heard of the NTSB's final report? Ever see bulb blink (DC only)? Find the Coach or beeps annoying?

Ask S not to discuss problem or experiment with classmates


Appendix E

End-of-Session Survey Items

Notes:


[The following scale remained visible in a separate frame at the bottom of the browser window while the participant scrolled through the survey items in the upper window frame]

123456789
Very
Strongly
Disagree
Strongly
Disagree
Moderately
Disagree
Slightly
Disagree
Neutral Slightly
Agree
Moderately
Agree
Strongly
Agree
Very
Strongly
Agree


[The remaining content of this Appendix appeared in the scrollable upper window frame]

ID:      [Participant number, entered by experimenter at beginning of survey]
  1. In the late summer of 2000, the National Transportation Safety Board (NTSB) released its final report on the crash of TWA Flight 800. According to that report, which of the four causes in Belvedere's database was named as the most likely cause of the crash?
    A bomb     A missile     Mechanical failure     Human error
    I am not familiar with the NTSB's final report on the crash

  2. Do you believe the NTSB named the correct most likely cause it its report?
    Yes     Maybe     No         Don't know what the NTSB reported
  3. How much influence did the NTSB's final report (or the media coverage of it) have on your reasoning or problem solving activities during your session today?

    None     A little     Moderate     A lot     Extremely high

Below are two series of statements. For each statement, please read it carefully and then indicate your level of agreement with it by choosing the proper rating on the right, according to the scale at the bottom of the page. Note that the rating levels range from 1 (very strongly disagree) to 9 (very strongly agree), with 5 being neutral.

Please note that there are no correct responses to any of these statements; we merely seek your honest opinions about them. Therefore, we ask that you please read carefully and rate each statement as honestly and accurately as you can.

EXPERIMENT-SPECIFIC RATINGS

[#] 123456789
B1 I enjoyed using the Belvedere software system.
B2 * I found Belvedere to be difficult to use.
B3 Belvedere helped me keep track of the various pieces of information relevant to the problem.
123456789
B4 Belvedere would be helpful in collecting and organizing information for a paper or report.
B5 * It would have been easier for me to work on the assigned problem without using Belvedere.
B6 Overall I found my session with Belvedere to be enjoyable.
123456789
C1 I found Belvedere's online Coach to be helpful.
C2 I appreciated the feedback I received from the Coach.
C3 * Often I found the feedback from the Coach to be repetitive.
123456789
C4 The feedback I received from the Coach was easy to understand.
C5 * The Belvedere system would be better off without the Coach.
C6 * I found the Coach to be annoying.

GENERAL RATINGS

[The 18-item Need for Cognition scale appeared here, in a table with the same physical layout as above (e.g., with the row of rating numbers repeated every three items)]

Please make sure you have chosen a response for EVERY item on the survey before you click on the "Submit survey" button below.

If you wish to erase all of your responses and start over, click on "Redo survey".

          

Appendix F

Sample Text and Diagram for Unrelated Scientific Problem


Nowadays many folks, both scientists and lay people alike, believe that our planet is experiencing global warming. This belief is partly based on recent observations that our polar ice caps are melting. Also, some climatological evidence compiled by meteorologists across the country shows that droughts are becoming more frequent and widespread than in the past. However, there are other scientists who claim that these changes are due not to global warming but instead to periodic fluctuations in the Earth's climate, resulting in changing weather patterns. For example, local meteorologists recorded fewer extremely hot days in western Pennsylvania during the past few summers than in earlier summers. In addition, certain widespread fluctuations in temperature, rainfall, and other weather patterns are known to occur in conjunction with El Niño and La Niña. These two weather phenomena, which recur every few years off the west coast of South America, are known to have affected our own climate in different ways over the past few years.



References

Aleven, V., & Ashley, K. D. (1995). Using a well-structured model to teach in an ill-structured domain. In Proceedings of the Seventeenth Annual Conference of the Cognitive Science Society (pp. 419-424). Mahwah, NJ: Erlbaum.

Aleven, V., & Koedinger, K. R. (2000). Limitations of student control: Do students know when they need help? In G. Gauthier, C. Frasson, & K. VanLehn (Eds.), ITS 2000: Proceedings of the 5th International Conference on Intelligent Tutoring Systems (pp. 292-303). Berlin: Springer-Verlag.

Anderson, J. R., Boyle, C. F., Farrell, R., & Reiser, B. (1984). Cognitive principles in the design of computer tutors. In Proceedings of the Sixth Annual Conference of the Cognitive Science Society (pp. 2-9). Boulder: University of Colorado, Institute of Cognitive Science.

Anderson, J. R., Boyle, C. F., & Reiser, B. J. (1985). Intelligent tutoring systems. Science, 228, 456-462.

Anderson, J. R., Corbett, A. T., Koedinger, K. R., & Pelletier, R. (1995). Cognitive tutors: Lessons learned. The Journal of the Learning Sciences, 4(2), 167-207.

Anderson, J. R., & Reiser, B. J. (1985). The LISP tutor. Byte, 10, 159-175.

Brown, J. S., Collins, A., & Duguid, P. (1989). Situated cognition and the culture of learning. Educational Researcher, 18(1), 32-41.

Burton, R. R., & Brown, J. S. (1982). An investigation of computer coaching for informal learning activities. In D. Sleeman & J. S. Brown (Eds.), Intelligent tutoring systems (pp. 79-98). New York: Academic Press.

Cacioppo, J. T., & Petty, R. E. (1982). The need for cognition. Journal of Personality and Social Psychology, 42, 116-131.

Cacioppo, J. T., Petty, R. E., & Kao, C. F. (1984). The efficient assessment of need for cognition. Journal of Personality Assessment, 48, 306-307.

Cavalli-Sforza, V. (1998). Constructed vs. received graphical representations for learning about scientific controversy: Implications for learning and coaching. Unpublished doctoral dissertation, Intelligent Systems Program, University of Pittsburgh, PA.

Chu, R. W., Mitchell, C. M., & Jones, P. M. (1995). Using the operator function model and OFMspert as the basis for an intelligent tutoring system: Towards a tutor/aid paradigm for operators of supervisory control systems. IEEE Transactions on Systems, Man, and Cybernetics, 25(7), 1054-1075.

Clancey, W. J. (1986). Qualitative student models. Annual Review of Computer Science, 1, 381-450.

Collins, A. (1996). Design issues for learning environments. In S. Vosniadou, E. De Corte, R. Glaser, & H. Mandl (Eds.), International perspectives on the design of technology-supported learning environments (pp. 347-361). Mahwah, NJ: Erlbaum.

Conati, C., & VanLehn, K. (1999). Teaching meta-cognitive skills: Implementation and evaluation of a tutoring system to guide self-explanation while learning from examples. In AIED '99: Proceedings of the 9th World Conference of Artificial Intelligence and Education. Amsterdam: IOS Press.

Connelly, J. W. (1989). An empirical investigation of the effective degrees of feedback content in GIL, an intelligent tutor for programming. Unpublished manuscript, Princeton University, Princeton, NJ.

Connelly, J. (1997). Specialty exam. Cognitive psychology program, University of Pittsburgh. Available: http://www.pitt.edu/~connelly/comps.html

Connelly, J., & Lesgold, A. (1999). Intelligent tutoring systems. In J. G. Webster (Ed.), Encyclopedia of electrical and electronics engineering (Vol. 10, pp. 529-541). New York: Wiley.

Corbett, A. T., & Anderson, J. R. (1990). The effect of feedback control on learning to program with the Lisp tutor. In Proceedings of the Twelfth Annual Conference of the Cognitive Science Society (pp. 796-803). Hillsdale, NJ: Erlbaum.

Corbett, A. T., & Anderson, J. R. (1992). LISP Intelligent Tutoring System: Research in skill acquisition. In J. H. Larkin & R. W. Chabay (Eds.), Computer- assisted instruction and intelligent tutoring systems: Shared goals and complementary approaches (pp. 73-109). Hillsdale, NJ: Erlbaum.

De Corte, E. (1996). Changing views of computer-supported learning environments for the acquisition of knowledge and thinking skills. In S. Vosniadou, E. De Corte, R. Glaser, & H. Mandl (Eds.), International perspectives on the design of technology-supported learning environments (pp. 129-145). Mahwah, NJ: Erlbaum.

Fischer, P. M., & Mandl, H. (1988). Improvement of the acquisition of knowledge by informing feedback. In H. Mandl & A. Lesgold (Eds.), Learning issues for intelligent tutoring systems (pp. 187-241). New York: Springer-Verlag.

Fix, V., & Wiedenbeck, S. (1996). An intelligent tool to aid students in learning second and subsequent programming languages. Computers and Education, 27(2), 71-83.

Gertner, A. S., & VanLehn, K. (2000). Andes: A coached problem solving environment for physics. In G. Gauthier, C. Frasson, & K. VanLehn (Eds.), ITS 2000: Proceedings of the 5th International Conference on Intelligent Tutoring Systems (pp. 133-142). Berlin: Springer-Verlag.

Gott, S. P., Lesgold, A., & Kane, R. S. (1997). Tutoring for transfer of technical competence. In S. Dijkstra, F. Schott, N. Seel, & R. D. Tennyson (Eds.), Instructional design: Vol II: Solving instructional design problems (pp. 221-250). Mahwah, NJ: Erlbaum.

Katz, S., & Lesgold, A. (1993). The role of the tutor in computer-based collaborative learning situations. In S. P. Lajoie & S. J. Derry (Eds.), Computers as cognitive tools (pp. 289-317). Hillsdale, NJ: Erlbaum.

Katz, S., & Suthers, D. (1998). Guiding the development of critical inquiry skills: Lessons learned by observing students interacting with subject-matter experts and a simulated inquiry coach. Paper presented at the American Educational Research Association 1998 Annual Meeting, April 13-17 1998, San Diego, CA.

Koedinger, K. R., & Anderson, J. R. (1993b). Reifying implicit planning in geometry: Guidelines for model-based intelligent tutoring system design. In S. P. Lajoie & S. J. Derry (Eds.), Computers as cognitive tools (pp. 15-45). Hillsdale, NJ: Erlbaum.

Koedinger, K. R., Anderson, J.R., Hadley, W.H., & Mark, M. A. (1995). Intelligent tutoring goes to school in the big city. In AI-ED 95: Proceedings of the 7th World Conference on Artificial Intelligence in Education (pp. 421-428). Washington, DC: Association for the Advancement of Computing in Education.

Kulik, J. A., & Kulik, C.-L. C. (1988). Timing of feedback and verbal learning. Review of Educational Research, 58(1), 79-97.

Legree, P. J., Gillis, P. D., & Orey, M. A. (1993). The quantitative evaluation of intelligent tutoring system applications: Product and process criteria. Journal of Artificial Intelligence in Education, 4(2/3), 209-226.

Lesgold, A. (1994a). Assessment of intelligent training technology. In E. L. Baker & H. F. O'Neil Jr. (Eds.), Technology assessment in education and training (pp. 97-116). Hillsdale, NJ: Erlbaum.

Lesgold, A. (1994b). Ideas about feedback and their implications for intelligent coached apprenticeship. Machine-Mediated Learning, 4, 67-80.

Lesgold, A., Katz, S., Greenberg, L., Hughes, E., & Eggan, G. (1992). Extensions of intelligent tutoring paradigms to support collaborative learning. In S. Dijkstra, H. P. M. Krammer, & J. J. G. van Merriënboer (Eds.), Instructional models in computer-based learning environments (pp. 291-311). Berlin: Springer-Verlag.

Mark, M. A., & Greer, J. E. (1993). Evaluation methodologies for intelligent tutoring systems. Journal of Artificial Intelligence in Education, 4(2/3), 129-153.

McKendree, J. (1990). Effective feedback content for tutoring complex skills. Human-Computer Interaction, 5(4), 381-413.

Means, M. L., & Voss, J. F. (1996). Who reasons well? Two studies of informal reasoning among children of different grade, ability, and knowledge levels. Cognition and Instruction, 14(2), 139-178.

Merrill, D. C., Reiser, B. J., Ranney, M., & Trafton, J. G. (1992). Effective tutoring techniques: A comparison of human tutors and intelligent tutoring systems. The Journal of the Learning Sciences, 2(3), 277-306.

Nathan, M. J. (1998). Knowledge and situational feedback in a learning environment for algebra story problem solving. Interactive Learning Environments, 5, 135-159.

Paolucci, M., Suthers, D., & Weiner, A. (1996). Automated advice-giving strategies for scientific inquiry. In C. Frasson, G. Gauthier, & A. Lesgold (Eds.), ITS96: Proceedings of the Third International Conference on Intelligent Tutoring Systems (pp. 372-381). New York: Springer-Verlag.

Polson, M. C., & Richardson, J. J. (1988). Foundations of intelligent tutoring systems. Hillsdale, NJ: Erlbaum.

Reiser, B. J., Friedmann, P., Gevins, J., Kimberg, D. Y., Ranney, M., & Romero, A. (1988). A graphical programming language interface for an intelligent LISP tutor. In Proceedings of CHI'88, Conference on Human Factors in Computing Systems (pp. 39-44). New York: ACM.

Reusser, K. (1996). From cognitive modeling to the design of pedagogical tools. In S. Vosniadou, E. De Corte, R. Glaser, & H. Mandl (Eds.), International perspectives on the design of technology-supported learning environments (pp. 81-103). Mahwah, NJ: Erlbaum.

SAS Institute Inc. (1999). SAS OnlineDoc®, Version 8. Cary, NC: Author. Available: http://v8doc.sas.com/sashtml/stat/chap23/sect13.htm#idxclu0263

Schneider, D., & Dorans, N. (1999, June). Concordance Between SAT® I and ACTTM Scores for Individual Students. Research Notes (RN-07). New York, NY: The College Board. Available: http://www.collegeboard.org/research/html/rn07.pdf

Schofield, J. W., Evans-Rhodes, D., & Huber, B. R. (1990). Artificial intelligence in the classroom: The impact of a computer-based tutor on teachers and students. Social Science Computer Review, 8(1), 24-41.

Schooler, L. J., & Anderson, J. R. (1990). The disruptive potential of immediate feedback. In Proceedings of the Twelfth Annual Conference of the Cognitive Science Society (pp. 702-708). Hillsdale, NJ: Erlbaum.

Seidel, R. J., & Park, O. C. (1994). An historical perspective and a model for evaluation of intelligent tutoring systems. Journal of Educational Computing Research, 10(2), 103-128.

Shute, V., & Glaser, R. (1990). A large scale evaluation of an intelligent discovery world: Smithtown. Interactive Learning Environments, 1, 51-77.

Shute, V. J., & Regian, J. W. (1993). Principles for evaluating intelligent tutoring systems. Journal of Artificial Intelligence in Education, 4(2/3), 245-271.

Snedecor, G. W., & Cochran, W. G. (1980). Statistical methods (7th Ed.). Ames, IA: Iowa State University Press.

Stasz, C., Ormseth, T., McArthur, D., & Robyn, A. (1989, March). An intelligent tutor for basic algebra: Perspectives on evaluation. In Instructional views of intelligent computer-assisted instruction: Data and issues. Symposium conducted at the annual meeting of the American Educational Research Association, San Francisco, CA.

Suthers, D. (1993). Preferences for Model Selection in Explanation. Paper presented at the Proceedings of the 13th International Joint Conference on Artificial Intelligence (IJCAI-93), Chambery, France.

Suthers, D., Connelly, J., Lesgold, A., Paolucci, M., Toth, E. E., Toth, J., & Weiner, A. (in press). Representational and advisory guidance for students learning scientific inquiry. To appear in K. Forbus & P. J. Feltovich (Eds.), Smart machines in education. Menlo Park, CA: AAAI Press.

Suthers, D., & Jones, D. (1997). An architecture for intelligent collaborative educational systems. In B. du Boulay & R. Mizoguchi (Eds.), Proceedings of AI-ED 97 World Conference on Artificial Intelligence in Education (pp. 87-94). Tokyo, Japan: IOS Press.

Suthers, D. D., Toth, E. E., & Weiner, A. (1997). An integrated approach to implementing collaborative inquiry in the classroom. Proceedings of the Second International Conference on Computer Supported Collaborative Learning (CSCL'97), Toronto, December 10-14, 1997. pp. 272-279.

Suthers, D., & Weiner, A. (1995). Groupware for developing critical discussion skills. In J. L. Schnase & E. L. Cunnius (Eds.), Proceedings of CSCL '95: The First International Conference on Computer Support for Collaborative Learning (pp. 341-348). Mahwah, NJ: Erlbaum.

Suthers, D., Weiner, A., Connelly, J., & Paolucci, M. (1995). Belvedere: Engaging students in critical discussion of science and public policy issues. In AI-ED 95: Proceedings of the 7th World Conference on Artificial Intelligence in Education (pp. 266-273). Washington, DC: Association for the Advancement of Computing in Education.

Sweller, J. (1988). Cognitive load during problem solving: Effects on learning. Cognitive Science, 12, 257-285.

Toth, E. E., Suthers, D. D., & Lesgold, A. M. (in press). Mapping to know: The effects of evidence maps and reflective assessments on scientific inquiry skills. Science Education.

Toth, J. A., Suthers, D., & Weiner, A. (1997). Providing expert advice in the domain of collaborative scientific inquiry. In B. du Boulay & R. Mizoguchi (Eds.), Proceedings of AI-ED 97 World Conference on Artificial Intelligence in Education. Tokyo, Japan: IOS Press.

Twidale, M. (1993). Redressing the balance: The advantages of informal evaluation techniques for intelligent learning environments. Journal of Artificial Intelligence in Education, 4(2/3), 155-178.

VanLehn, K. (1988a). Student modeling. In M. C. Polson & J. J. Richardson (Eds.), Foundations of intelligent tutoring systems (pp. 55- 78). Hillsdale, NJ: Erlbaum.

VanLehn, K. (1988b). Toward a theory of impasse-driven learning. In H. Mandl & A. Lesgold (Eds.), Learning issues for intelligent tutoring systems (pp. 19-41). New York: Springer-Verlag.

VanLehn, K. (1996). Conceptual and meta learning during coached problem solving. In C. Frasson, G. Gauthier, & A. Lesgold (Eds.), ITS96: Proceedings of the Third International Conference on Intelligent Tutoring Systems (pp. 29-47). New York: Springer-Verlag.

VanLehn, K., Freedman, R., Jordan, P., Murray, C., Osan, R., Ringenberg, M., Rose, C., Schulze, K., Shelby, R., Treacy, D., Weinstein, A., & Wintersgill, M. (2000). Fading and deepening: The next steps for Andes and other model-tracing tutors. In G. Gauthier, C. Frasson, & K. VanLehn (Eds.), ITS 2000: Proceedings of the 5th International Conference on Intelligent Tutoring Systems (pp. 474-483). Berlin: Springer-Verlag.

Veerman, A. L. (2000). Computer-supported collaborative learning through argumentation. Unpublished doctoral dissertation, University of Utrecht, the Netherlands. Available: http://eduweb.fss.uu.nl/arja/Veerman-thesis-pdf.zip

Voss, J. F., & Post, T. A. (1988). On the solving of ill-structured problems. In M. T. H. Chi, R. Glaser, & M. J. Farr (Eds.), The nature of expertise (pp. 261-285). Hillsdale, NJ: Erlbaum.

Wan, D., & Johnson, P. M. (1994). Experiences with CLARE: A computer- supported collaborative learning environment. International Journal of Human- Computer Studies, 41(6), 851-879.

Wenger, E. (1987). Artificial intelligence and tutoring systems: Computational and cognitive approaches to the communication of knowledge. Los Altos, CA: Morgan Kaufmann.

Wertheimer, R. (1990). The geometry proof tutor: An "intelligent" computer- based tutor in the classroom. Mathematics Teacher, 84(4), 308-317.


Footnotes

  1. According to Anderson et al. (1995), "The current ACT-R theory claims that one learns from problem-solving products.... Thus, it does not matter whether all the critical steps occur together in time or not--only that they be represented in the final solution.... Still, we will see that immediate feedback can be beneficial in cutting down on time spent in error states and making it easier to interpret the student's problem solving" (p. 181).

  2. All references to the "Coach" in this paper are to what Suthers and Jones (1997) describe as Belvedere's basic argument pattern coach, which is domain-general. Newer implementations of Belvedere include an additional, "expert-path" coach that can provide advice tailored to specific domains or problems for which additional knowledge has been encoded by hand, in a language that the Coach can recognize (Suthers et al., in press; Toth, Suthers, & Weiner, 1997).

  3. Example databases (available online at http://advlearn.lrdc.pitt.edu/belvedere/materials/) include possible causes for the periodic mass extinctions on Earth, evolutionary questions regarding the iguanas living on the Galapagos Islands, and possible causes for neurological diseases on the island of Guam.

  4. Belvedere includes a primitive, textual "chat" facility that permits synchronous distal communication among multiple users working on the same diagram. This functionality was added to facilitate our early "Wizard of Oz" (Twidale, 1993, p. 162) coaching studies (e.g., Katz & Suthers, 1998), as well as to give collaborating users somewhere to record their correspondence, other than in the diagrams themselves.

  5. In my piloting studies I used a separate computer screen to monitor students' diagramming activity in Belvedere, periodically checking to ensure that box and link primitives were being used appropriately. In sessions with 56 students I noted only a few clear-cut instances in which they were not.

  6. Press releases of these reports are available at http://www.ntsb.gov/Pressrel/1998/980716.htm and http://www.ntsb.gov/Pressrel/2000/000823a.htm

  7. These sorts of measures have been used in several evaluations of other intelligent systems (see Connelly, 1997; Connelly & Lesgold, 1999). While not directly related to the effectiveness of coaching on problem solving, they may give some insight into a user's willingness both to seek coaching on her own and to respond positively to coaching provided automatically.

  8. An example given by one pilot participant was the pop-up "paper clip" character in Microsoft Word.

  9. Some of the advice patterns present the same textual feedback in reference to different diagram elements. In other words, variants of previously presented messages may appear more than once, but subsequent presentations may apply to different diagram elements (highlighted in yellow in the diagram). In cases where users do not attend to the highlighted diagram elements, the feedback will appear to be simply repetitive.

  10. Originally I had planned to construct a Belvedere representation of the brief text from scratch while the participant watched, or to simulate the same using acetate overlays (e.g., show a box with a hypothesis, followed by a box with evidence, followed by a link between them). However, I feared that doing so would bias the participants into following the same order of entering diagram elements during their sessions, thereby affecting my coaching manipulation.

  11. Despite my specific prior instructions on link-drawing, several participants required some degree of help when drawing their initial links. Some also required help deleting boxes or links, for which they had not been instructed previously. In addition, because my coaching manipulation required me to use an older version of Belvedere, minor software glitches (mostly confined to improper display of And links) periodically required my attention. There were also infrequent occasions when Belvedere, Netscape, or the computer itself would crash, although these were far more prevalent in Experiment 2.

  12. A computer crash after one student's diagramming session forced me to present him a printout of the survey instead.

  13. More precisely, this interval is measured from the time stamp of the advice-triggering action in the log file; there are no precise time stamps for the actual appearance of advice messages. With the LISP-based Coach, there is typically a delay of up to a few seconds between the triggering action and the actual presentation of the advice.

  14. This lone Unspecified box contains a statement listed on the web page for the bomb hypothesis, in which a hypothetical research team elaborated on the possibility of a bomb causing the crash. Because the statement was neither a free-standing hypothesis distinct from the bomb hypothesis nor a statement from an eyewitness, I labeled it as Unspecified with a weak For link to the bomb hypothesis.

  15. During my verbal introduction, the student felt compelled to justify his self-reported low SAT Verbal subscore (370) by telling me he was an international student, and during his session he had to ask me the definitions of the words hypothesis, theory, proximity, and deliberating).

  16. Two of the three omitted students (one in each condition) also modified belief strengths and activated the display filter. Log files show that the one in Condition IC (the one for whom the Coach crashed) received feedback on the rule prior to his actions, and although the DC student's coaching log was lost my handwritten session notes indicate that he did as well.

  17. This feature was disabled for the IC condition in each experiment because it seemed superfluous with the more intrusive advice.

  18. The omitted DC student with the lost coaching log reported having seen it as well, although without the coaching log there is no way to verify that it blinked. The bulb never blinked for the other omitted DC student.