John William Connelly III ________________________________
August 14, 2001 ________________________________
Robert Glaser, Ph.D. ________________________________ Jonathan W. Schooler, Ph.D. ________________________________ Daniel D. Suthers, Ph.D. ________________________________ ________________________________ ________________________________ Alan M. Lesgold, Ph.D. ________________________________ Committee chairperson
| Alan M. Lesgold |
First and foremost, I wish to acknowledge the members of my Doctoral Committee, not only collectively but also individually. I owe thanks to Bob Glaser, for his significant contributions to the literature that motivated my research and for serving also on the committee overseeing my doctoral comprehensive exam, from which my research ideas were drawn; to Dan Suthers, for his tireless work on Belvedere while a colleague at LRDC, for his assistance in conceptualizing and implementing my pilot experiments, and for agreeing to come back here from afar to serve on my committee; to Jonathan Schooler, not only for his efforts in enabling me to finish my degree but also for serving on my master's committee as well, and for being the quintessential role model of an active and inquisitive cognitive researcher; and last but by no means least, to Alan Lesgold, for remaining my primary advisor throughout my entire graduate-school tenure even after he assumed new responsibilities, and for not giving up on me even after I felt I had overstayed my welcome.
In addition to Alan and Dan, I wish to thank the other former members of the Argumentation Group at LRDC throughout its various incarnations. Special thanks go to Arlene Weiner and Eva Toth for compiling our web-based materials and for sharing earlier experiences with Belvedere "in the trenches"; to Sandy Katz for her ideas on coaching and for our discussions about earlier conceptions of intrusive advice-giving; to Violetta Cavalli-Sforza, whose dissertation research with Belvedere informed my own; to the many others responsible for developing Belvedere and its Coach, including Massimo Paolucci, Kim Harrigal, Joe Toth, and especially Dan Jones, who worked well beyond the call of duty to help me resurrect and maintain the hardware and software needed to conduct my research.
Special thanks are due also to Squeak, for making me try to keep things in perspective; to Jenn Gross, for her 11th-hour proofreading assistance and for just being there; to R.B., for literally saving my life as I was preparing to enter the home stretch, and to R.B. Junior for getting me home; to my unlicensed "therapists" along the way (Babe, Rosie, & Mary; Dave & Joey; and Matt & John); to Carlton Hicks for providing the mantras that helped me get through this; to the other few, proud comrades I have had the privilege of calling sum potase; and finally, to the mighty Excalibur, for serenading me through yet a few more all-nighters.
Different theories of learning or instruction underlie the various software systems, from production-system models of individual instruction (e.g., Anderson, Boyle, & Reiser, 1985) to theories of cognitive apprenticeship and situated cognition, often involving groups (e.g., Brown, Collins, & Duguid, 1989). The different theories have led to variations in many aspects of system design, from expert and student modeling components to user interfaces (Clancey, 1986; De Corte, 1996; Polson & Richardson, 1988; Reusser, 1996; Wenger, 1987). Among the most salient differences between the various pedagogical approaches are the content, amount, timing, and control (i.e., user- vs. system-initiated) of the feedback delivered by the system's intelligent agent(s) to users engaged in problem solving activities with the system.
The research described herein was motivated by a literature review comparing various approaches to automated feedback delivery across several different problem solving domains (Connelly, 1997; see also Connelly & Lesgold, 1999). That review, which was drawn from a cross-sectional survey of empirically evaluated ILEs for which evaluation results were readily available, indicated that the majority of existing systems supported well-defined problem solving tasks and domains (see also Seidel & Park, 1994). The present research explores whether certain feedback-delivery strategies that have proven successful in well-defined problem domains (i.e., domains in which problems have a single or a finite set of correct answers, such as mathematics, physics, or computer programming) might be extended into a more ill-defined problem solving context where there are no generally accepted "right answers" (e.g., Voss & Post, 1988).
Specifically, I chose to manipulate and evaluate the impact of the feedback provided by Belvedere, a software system designed to foster argumentation and inquiry skills in users trying to solve ill-defined scientific problems (Suthers et al., in press; Suthers, Weiner, Connelly, & Paolucci, 1995). Belvedere enables users to construct on-screen diagrams representing the relationships between hypotheses and evidence for any number of open-ended scientific debates. The Belvedere system includes an online Coach that continually analyzes the evolving argument diagrams in terms of general principles of scientific inquiry. The Coach is capable of providing feedback to users on demand, in the form of hints or suggestions, to help guide their ongoing inquiry and diagram construction activities. My research in this dissertation focuses on the extent to which feedback from the Coach appears to help Belvedere's users during problem solving, and on whether certain feedback variations used in well-defined domains may also work for the types of domains and ill-defined problem solving skills Belvedere was designed to support.
I begin this dissertation with a brief overview of some customary approaches to providing system-generated feedback. I then focus and elaborate on one of them, the immediate feedback approach, which is used by many successful systems that foster problem solving in well-defined domains. After briefly describing how one might incorporate immediate feedback into a design approach that is more conducive to fostering problem solving in ill-defined domains, I then describe Belvedere and its feedback characteristics in more detail, including findings from some formative evaluations of the system. I then present a brief outline of my experiments, addressing some general issues pertaining to system evaluation where relevant and appropriate (Legree, Gillis, & Orey, 1993; Mark & Greer, 1993; Shute & Regian, 1993; Twidale, 1993), followed by a brief discussion of my experimental measures and predictions. After some brief technical details, I report each experiment in turn, followed by my general discussion of findings.
Approaches to Automated Feedback
Many intelligent instructional software tools are complex systems consisting
of several different components or modules. Most ITSs, for example, are
comprised of four main components: (a) a domain knowledge or expert module,
which contains the target knowledge of the domain that the system is designed
to teach; (b) a student model, which assesses a student's emerging knowledge
or competence in the target domain by using diagnostic techniques such as
model tracing (matching a student's solution steps to those of an expert
problem solving model; Merrill, Reiser, Ranney, & Trafton, 1992); (c) a
tutoring or pedagogical module, which structures the interaction between the
system and the user, deciding at various points in the interaction which task
material to present and what kind of feedback to provide, if any; and (d) a
user interface, which serves as the means by which the user and the system
communicate (Polson & Richardson, 1988; Reusser, 1996). Although the various
system components are usually interrelated in function and often in features
(Katz & Lesgold, 1993), generally feedback-delivery decisions are coordinated
by the pedagogical component, the relative effects of which tend to vary with
the specifications of its underlying instructional approach (Connelly &
Lesgold, 1999).
Pedagogical styles can differ along such non-orthogonal dimensions as guided versus unguided (e.g., model tracing vs. discovery learning), tutoring versus coaching, and student-directed versus system-directed (De Corte, 1996; Reusser, 1996). Put another way, some pedagogical approaches are more directive (e.g., Anderson et al., 1985), some are noninterventionist (e.g., De Corte, 1996), and some are in between, such as the cognitive apprenticeship teaching methods of scaffolding and fading (Brown et al., 1989; Collins, 1996). Thus, feedback from a system can play any number of roles, from corrective to regulative to informative (Fischer & Mandl, 1988; Wenger, 1987), and it may differ in relative amount and timing (e.g., immediate vs. delayed vs. on demand; Collins, 1996; Kulik & Kulik, 1988; Merrill et al., 1992; Schooler & Anderson, 1990). At one extreme are systems that provide detailed feedback at many points during interactive sessions with their users, while systems at the other extreme may provide no explicit feedback at all, in some cases providing implicit pedagogical support through various interface features instead (Merrill et al., 1992; Twidale, 1993). Some systems may even fade or otherwise vary their default feedback delivery strategies or methods during the course of an interactive session with a user (e.g., Chu, Mitchell, & Jones, 1995; Shute & Glaser, 1990; VanLehn, 1996; VanLehn et al., 2000).
Given the range of disciplines and tasks for which ILEs have been built, it is difficult to compare approaches to ILE design and delivery of system feedback without accounting for the types of skills they support. Clancey (1986) describes how problem solving operators and inference procedures differ between formal, closed domains such as mathematics and natural, open domains such as medical diagnosis. McKendree (1990) suggests that more complex or ambiguous tasks may require a greater degree of informative feedback than more constrained ones, for which more directive feedback often suffices. Others in the field have described the process of learning from an ILE as a four-way interaction of learner style, desired knowledge outcome, type of instructional environment, and subject matter (Shute & Glaser, 1990). After reviewing several different systems that were designed to support problem solving in a variety of domains, we noted that "it is difficult to identify any domain-specific effects of, or any clear preferences between, the various approaches to providing feedback" (Connelly & Lesgold, 1999, p. 539). However, we believed that to be due partly to the overrepresentation of well-defined domains in the literature. Problems in such domains have a constrained set of correct answers, making them amenable to expert and student modeling. However, problems in more ill-defined domains are usually not as clear-cut, with multiple solutions (as well as multiple paths leading to those solutions) that can be reached only by using rough heuristics rather than algorithms (Voss & Post, 1988). For these reasons expert and student modeling are often intractable, giving the pedagogical component of an ILE for such a domain less to work with in deciding what feedback to present to users. A question posed of the present research is to what extent feedback approaches that have proven beneficial in many well-defined domains can also be of help in a more ill-defined domain. I turn now to one such approach.
Immediate Feedback
A major issue in the design of interactive learning environments is that of
deciding when a system should provide feedback to its user(s). Many
successful systems provide corrective feedback immediately after their users
make any mistakes. For example, most model tracing tutors work by generating
feedback any time a student's solution path deviates from a path that will
lead to a correct answer (Merrill et al., 1992). One reason for preferring
this approach is to ensure that feedback is delivered in the context in which
it is needed: that of the student's current goal and working memory states
(Anderson, Corbett, Koedinger, & Pelletier, 1995; Corbett & Anderson, 1992).
Another reason to provide corrective feedback immediately is to prevent
students from floundering while trying to recover from lengthy incorrect
solution paths (Anderson, Boyle, Farrell, & Reiser, 1984; Corbett & Anderson,
1992; Gertner & VanLehn, 2000; McKendree, 1990).
Although used to varying extents in some operative skill tutors (Chu et al., 1995; Legree et al., 1993) and in limited ways in an economics microworld (Shute & Glaser, 1990), immediate feedback approaches dominate some physics tutors (e.g., Gertner & VanLehn, 2000; VanLehn, 1996) and many of the tutors for programming, geometry, and mathematics, including all of the tutors based upon the ACT* theory of cognitive skill acquisition (e.g., Anderson & Reiser, 1985; Anderson et al., 1985; Koedinger & Anderson, 1993b; Koedinger, Anderson, Hadley, & Mark, 1995; McKendree, 1990). Indeed, the ACT* commitment to providing immediate feedback in its tutors is one of the theory's most controversial features (Anderson, et al., 1995; Corbett & Anderson, 1992). Although the revised ACT-R theory and its newer tutorial instantiations permit off-path problem solving (Anderson et al., 1995), they still focus students toward correct solution paths, and immediate feedback still plays a major role in the interaction.[1]
However, research has shown immediate feedback to be disadvantageous in certain situations and with particular tasks (Kulik & Kulik, 1988; VanLehn et al., 2000). In one experiment using a modified version of the ACT* group's famous LISP Tutor for programming (Anderson & Reiser, 1985), students who received immediate feedback solved training problems faster than students who received delayed feedback, but when solving test problems took more time and made more errors than delayed-feedback students (Schooler & Anderson, 1990). In addition, delayed-feedback students seemed to be better at planning problem solutions than were immediate-feedback students. The authors argued that the absence of immediate feedback in the delayed condition allowed students to redeploy their working memory resources toward developing secondary skills such as error detection and correction. A study comparing two versions of the GIL tutor (Graphical Instruction in LISP; Reiser et al., 1988) provides further evidence of this: Students who did not receive GIL's immediate model-tracing feedback scored better on a transfer test of program debugging skills than those who did (cited in Merrill et al., 1992).
With even more complex tasks, feedback may be best left for post-problem reflection, when working memory resources are no longer being taxed by immediate problem-solving demands (Lesgold, 1994a; Sweller, 1988). Sherlock II, an ILE for training a complex avionics troubleshooting task, has facilities to support reflective follow-up after problem solving, including goal-related presentations such as intelligent replays of problem solving steps, critiques of those steps, and information about what an expert might have done (Lesgold, 1994b). These capabilities were added to help compensate for the learning opportunities that are precluded by the high cognitive effort expended during problem solving (Lesgold, 1994a; Lesgold, Katz, Greenberg, Hughes, & Eggan, 1992), as well as to coach situations in which students were able to solve the problems but did so in a non-optimal way (Gott, Lesgold, & Kane, 1997).
In short, the value of immediate feedback seems to vary with not only the task but also the desired learning outcomes of the intervention. Nevertheless, for many systems that support well-defined problem solving, the immediate delivery of feedback has proven beneficial in fostering user attainment of the primary skills that the system was designed to teach, based on various achievement measures. Such measures from within the laboratory include shorter time to problem completion, fewer errors committed, less time needed to correct errors, and in some cases higher post-test scores, than appropriate controls (Anderson et al., 1985; Connelly, 1989; Corbett & Anderson, 1992; McKendree, 1990). In some studies laboratory measures were supplemented by classroom achievement measures, ranging from better exam scores and course grades to higher levels of participation than controls (Koedinger et al., 1995; Schofield, Evans-Rhodes, & Huber, 1990; Wertheimer, 1990). Based on system evaluations using these measures, it is commonly accepted that many of these systems "have an enviable track record" (VanLehn et al., 2000, p. 475).
Feedback at What Cost?
Costs to the User
Even when achievement measures indicate clear benefits of providing immediate
feedback, we must also consider the potential costs to the user of
providing that feedback. Do users of intelligent learning environments
desire immediate feedback, even if it is ultimately helpful? In a
series of experiments manipulating feedback delivery by the LISP Tutor
(Corbett & Anderson, 1990, 1992), students rarely requested immediate feedback
from the ITS; most of them wanted feedback only when they were finished coding
a problem (see also Anderson et al., 1995). Students using the PACT Geometry
Tutor have shown a similar resistance to immediate online help (Aleven &
Koedinger, 2000). Although there are many possible reasons for this, it is
clear that "developers must assess not only the effectiveness of a system but
also the likelihood that it will be fully accepted into the culture or domain
of the target audience" (Connelly & Lesgold, 1999, p. 531), in order to avoid
the danger of having the costs of interacting with an ILE outweigh its
benefits for the users (see also Mark & Greer, 1993).
A list of ILE evaluation criteria by Mark and Greer (1993) focuses on measures of both achievement and affect. Affective measures include student motivation, self-esteem measures, attitude measures, and time on task. While most achievement measures are objective, many affective ones are subjective and usually gathered by questionnaire. Thus, affective measures may be open to interpretation and are best used as supplements to the more tangible achievement measures, especially because the two may not always correlate (Corbett & Anderson, 1990; Mark & Greer, 1993). For example, in an evaluation study of the RAND Algebra Tutor (Stasz, Ormseth, McArthur, & Robyn, 1989), despite some modest increases in achievement scores many students felt that the tutor did not help them learn algebra. Moreover, the few students who reported thinking that the tutor did help them learn algebra actually received lower course grades. Therefore, in an effort to obtain a broader sense of the relative costs and benefits to the user of interacting with Belvedere, the research I describe herein is motivated by ILE evaluation studies in which attitude measures supplement more objective performance measures (e.g., Burton & Brown, 1982; Chu et al., 1995; Corbett & Anderson, 1990; Fix & Wiedenbeck, 1996; Reusser, 1996; Stasz et al., 1989; Wan & Johnson, 1994).
Costs to the Developer
While weighing the potential costs and benefits to the user of providing
automated feedback, often system developers must also assess the costs to
themselves relative to both potential and actual benefits for the user
(Suthers et al., in press). The engineering of sufficient domain knowledge
for a computerized tutor to solve problems and diagnose student errors can be
quite an expensive proposition for system developers, although in some domains
the benefits can justify the costs. As stated earlier, most systems that
employ immediate feedback do so in the context of problems with a finite set
of correct answers (Connelly, 1997; Connelly & Lesgold, 1999), and among these
systems, those that do model tracing generate feedback based on deviations of
a student's solution path from a path that will lead to a correct answer
(Merrill et al., 1992). Often the amount of knowledge needed to solve such
problems is constrained enough that all of it can be represented by the
system, allowing it to infer the genesis of a student's incorrect solution
path. However, as problems become less constrained and more ill-defined, and
both the knowledge and the cognitive skills needed to solve the problems
become more complex, the prospects for knowledge engineering and student
modeling become more unwieldy (Aleven & Ashley, 1995). When the knowledge
bases required to tackle such tasks are too large for model tracing to be a
viable option, on what should immediate feedback be based?
A Solution: Balancing Costs and Benefits
The main issue in trying to realize the benefits of immediate feedback in an
ill-defined problem solving domain is: For ill-defined problems such as
scientific inquiry tasks, for which there are usually no right answers and in
which the amount of knowledge that must be brought to bear may be too large to
be represented in the system, on what basis and to what extent can immediate
feedback be of any use?
Obviously, feedback on open-ended scientific problems that lack any "correct"
solutions cannot be based on student deviations from an ideal solution path.
Additionally, an ILE that lacks complete knowledge of a problem domain would
obviously be limited in its ability to provide domain-specific feedback.
However, to the extent that particular solution processes or components of
solutions applicable across different domains can be identified as ideal or
correct, and to the extent that those general processes can be represented
by the system, feedback could be delivered on that basis (Aleven & Ashley,
1995; Conati & VanLehn, 1999). Naturally, the representation of detailed
domain knowledge in the system would enable it to tailor such feedback to the
specific problem at hand, when appropriate. However, for a system user
struggling with a complex argumentation task, timely feedback about even
domain-general solution processes or components, based on general principles
of scientific inquiry and evidential reasoning, could still be helpful.
We describe elsewhere (Suthers et al., in press) an approach to ILE design that we characterize as "minimalist" AI and education, a way of applying basic AI principles to ILEs while circumventing the aforementioned knowledge representation and student modeling problems. Instead of attempting to build relatively complete knowledge representations, reasoning capabilities, or pedagogical agent functionality characteristic of model tracing tutors, this alternative approach provides ILEs with minimal abilities to respond (in a manner believed to be pedagogically relevant) to selected components of student activities and constructions, such as their basic syntax or some other categorical, easily discernible features. The feedback provided by a minimalist approach may be characterized as "state-based" rather than "knowledge-based" (Nathan, 1988): The software helps students recognize important features of their problem solving state, leaving most of the burden of knowledge representation and management in the hands of the students. This is the approach taken by the developers of Belvedere, which delivers feedback via an online Coach "that can provide reasonable advice with no domain specific knowledge engineering" (Suthers et al., in press). More specifically, Belvedere represents an incremental design approach that seeks to determine the value of low-cost, domain-general feedback alternatives before trying to assess the potential value added by more expensive domain-specific knowledge[2] to the system, an important consideration when working with broad, ill-defined problem domains that are not conducive to model tracing. I turn now to a brief overview of Belvedere.
Belvedere
Overview
Belvedere (Suthers & Jones, 1997; Suthers, Toth, & Weiner, 1997; Suthers &
Weiner, 1995; Suthers et al., 1995) is a networked graphical environment
designed to foster scientific argumentation skills in students of
middle-school age and older. Students use Belvedere's on-screen node and link
primitives (e.g., Hypothesis, Data, For, Against)
to construct graphical argument representations of ill-defined problems in
scientific domains, either individually or collaboratively over a network.
These problems are presented as inquiry exercises in which students are asked
to seek out and map the relationships between relevant hypotheses and
evidence, using actual unsolved scientific "mysteries" as domain content.
Problems can come from any source, although Belvedere's developers have
created specialized, self-contained hypertext databases about several
scientific debates, which are accessible via standard web browsers[3] (see Appendix A for an illustrated excerpt of a typical
problem-solving session using Belvedere and Netscape). Belvedere's graphical
interface was designed to resemble that of familiar computer drawing programs,
so that students can learn to create argument diagrams with only minimal
training. With a web browser and Belvedere's diagram and minimal chat[4] facilities running concurrently,
Belvedere enables students to synchronously discuss and reflect upon their
argumentation processes and products while exploring alternative answers to
the given problems.
Coaching
In the standard Belvedere system, a computerized Coach is available on demand
to provide guidance for developing argument diagrams (Paolucci, Suthers, &
Weiner, 1996). When asked for advice, Belvedere's Coach presents feedback, in
the form of suggestions or questions in a dialog box, about the current state
of the evolving diagram. Primarily the Coach looks for possible deviations of
a user's diagram constructs from those that represent the good argumentation
and inquiry practices embodied in its pattern-matching rules. For example, if
a user appears to be succumbing to confirmation bias (i.e., inclusion of
evidence in favor of a hypothesis but no evidence against it), the Coach will
suggest that disconfirming evidence be considered as well. Belvedere does no
student modeling or diagnosis (e.g., Clancey, 1986; VanLehn, 1988a); it bases
its coaching advice on the structural features of diagrams alone. Belvedere's
Coach generates advice by applying its 20 syntactic rules to features of the
current diagram after each incremental change to it. Appendix B shows the textual message contents of the
advice associated with each coaching rule.
The Coach provides advice about abstracted patterns of relationships among
statements, but it does not address the specific contents of these statements.
Its strengths are in its potential for pointing out principles of scientific
inquiry in the context of students' own evidential reasoning, and its
generality and applicability to new topics or domains with no additional
knowledge engineering. These are the qualities that make Belvedere's feedback
state-based as opposed to knowledge-based (Nathan, 1988).
Although the knowledge-blind characteristic of the evidence pattern Coach allows it to work effectively for problems in virtually any scientific domain, it does have one potential drawback: The advice it presents may be irrelevant if the pertinent node primitives are used incorrectly (see also Wan & Johnson, 1994). Examples of such incorrect usage from sessions with an older version of Belvedere include typing a hypothesis into a Theory statement box or drawing an Explains link in one direction when a Supports link in the opposite direction would be more appropriate. We have considered redesigning Belvedere to enforce "correct" usage of primitives via immediate feedback and of coherent argument patterns via delayed feedback (Suthers et al., 1995). However, given the Belvedere project's overall focus on supporting collaborative discussion, such interventionist measures remain unimplemented in the standard Belvedere system (Suthers & Weiner, 1995). Instead, we minimized potential occurrences of this problem by significantly reducing the number of primitives in Belvedere's diagramming palette (cf. Cavalli-Sforza, 1998), including the removal of directional links. With only three types of statements and three types of links from which to choose in the version I used, there are no errors of subtle distinction that would affect coaching relevance. Although coaching-relevant usage errors are still possible (e.g., labeling evidence as a Hypothesis or using a For link instead of Against to show a negative relationship between statements), now they are far less likely to occur.[5]
Early Formative Evaluations
Several formative evaluation studies of an earlier incarnation of Belvedere
were conducted with middle- and high-school students (Suthers et al., 1995).
The first was a laboratory study in which single students worked on a
scientific problem. In the second study, some of the same students came back
to work on a different problem, in pairs, at a single computer. Without
prompting from the experimenters, students would divide the labor between
themselves; one would control the mouse, while the other would use the
keyboard. This often led to censorship, favoring the student who controlled
the keyboard (cf. Wertheimer, 1990). A third laboratory study and a
subsequent school study had dyads work together on a problem from separate
computers. These students had their own input devices, alleviating (but not
completely eliminating) the censorship problem (Suthers & Weiner, 1995), and
their monitors were situated such that students could point to each other's
screens while discussing their shared diagrams.
Most students required little or no assistance from the experimenters to begin using Belvedere. They varied in their willingness to add information by typing it themselves, many preferring to copy text from the online databases provided by the developers. Students used the older Belvedere's many node and link primitives in ways that were inconsistent both with their intended usage and with their own and other students' usage. Although Belvedere's developers concluded that such unintended usage actually served to stimulate collaborative discussions (Suthers & Weiner, 1995), such usage caused some problems for the older Belvedere's automated Coach, motivating the aforementioned reduction in number of primitives.
The redesigned Belvedere was made available in Department of Defense dependent school (DoDDS) classrooms in Germany and Italy, in part to empirically evaluate its Coach (Suthers et al., in press). Data available to us from DoDDS in the form of limited personal observations, third party observations, videotapes, and computer logs indicate that (a) the on-demand Coach was almost never invoked; (b) there were situations where students did not know what to do next in which the Coach would have helped if it had been invoked; and (c) the Coach's advice and its relevance to the students' activities was sometimes ignored as if not understood. Items (a) and (b) indicate that, in spite of the developers' initial reluctance to interfere with students' deliberations, unsolicited advice is sometimes needed (see also Aleven & Koedinger, 2000; VanLehn et al., 2000).
Intrusive Coaching
Early attempts to address the issue of unsolicited advice involved the use of
a "minimally intrusive" Coach, such that the Belvedere menu icon that is used
to invoke the Coach (a light bulb) would slowly blink on and off when the
Coach had something important to say (Suthers & Jones, 1997). Some of the
coaching rules (6 of 20) were deemed important enough to warrant such an
immediate interruption. Examples include the rules that look for confirmation
bias and the need for discriminating evidence between two hypotheses (see Appendix B). However, in early usability studies of
the so-modified Belvedere, very few students reported ever noticing the
blinking icon. Therefore, for the current research the Coach was further
modified to immediately interject advice (i.e., without waiting for the user
to ask for it). Furthermore, rather than restrict intrusive advice to the
subset of rules for which the minimally intrusive Coach would have blinked,
the new intrusive Coach presents advice whenever it has anything appropriate
to say based on any of its rules.
Because each coaching pattern-matching rule responds to different aspects of the diagram state, only a subset of the 20 rules will apply to a user's diagram at any given time. In order to ensure that the Coach does not present newly applicable advice that could be rendered irrelevant by the user's next diagramming action, each rule has an associated delay factor. This delay factor, defined as the number of subsequent diagramming actions through which the rule must apply (ranging from 0 to 4), governs how long the Coach will wait before it will consider presenting the applicable advice. Furthermore, many rules have assigned priority levels that reflect the relative importance of their associated advice. Advice selection by the Coach is performed by a preference-based quick-sort algorithm, following a mechanism used by Suthers (1993) for selecting between alternate explanations. Preferences take into account factors such as prior advice already given, recency of the relevant diagram changes, and various categorical attributes of the applicable advice (Suthers et al., in press). The result of the algorithm is a sorted list of advice rules that apply (after any applicable delay factors) to the current diagram state. When such a list exists, the modified intrusive Coach will provide immediate feedback to the user by automatically presenting the advice at the top of the list.
Assessing the Effects of Belvedere's Coaching
Because Belvedere's standard Coach delivers advice only on demand, and because
the data we have on hand (previously discussed in the Early Formative
Evaluations section) show that the on-demand Coach is rarely invoked, it
has been difficult for us to get some objective notion of its effectiveness
during sessions with target users.
Coaching has been a focus of at least two studies conducted during Belvedere's
development: one involving the coaching of students by actual human domain
experts using Belvedere's chat facility (Katz & Suthers, 1998), and another
involving offline comparisons of consistency relations in student diagrams to
those of an expert, using a prototype extension of the automated Coach
(Paolucci et al., 1996). However, neither study involved online user
interactions with the automated Coach. The Coach was apparently used to some
extent in a set of external evaluation studies conducted in Europe (Veerman,
2000), but the focus of these studies was on collaboration of dyads and small
groups using Belvedere and on the nature of chat discussions between the
collaborators, with only passing mention of the automated coaching. Another
study conducted in the overseas DoDDS classrooms (Toth, Suthers, & Lesgold, in
press) examined the effects of different representations (Belvedere vs. text)
and of the users' reflective assessments of these representations.
Assessments were based on rubrics that codified evaluation criteria, much like
they are embodied within Belvedere's coaching rules. However, for this study
Belvedere's automated Coach was disabled because there was no counterpart
available for the text-based conditions. In short, we still lack a clear
empirical picture of the Coach's effectiveness within the overall Belvedere
framework, and this deficit was a motivation of the present research.
Two Experiments
A driving purpose of my dissertation research was to investigate the effects
of automated coaching on user performance with Belvedere. Specifically, in an
effort to extend immediate feedback principles into the ill-defined problem
solving supported by Belvedere, I chose to manipulate two aspects of
Belvedere's feedback delivery component (its Coach), and to measure the impact
of its feedback on the problem solving behaviors of individual students
working with Belvedere. I conceived of two experiments that hold constant
every other aspect of the Belvedere software system, manipulating only the
presence and the timing/control of coaching feedback, respectively.
Experiment 1. In an attempt to isolate the effects of coaching in the face of infrequent user requests for it, my first experiment compares students using Belvedere with immediate, intrusive coaching (as described earlier in the Intrusive Coaching section) to students using Belvedere without any coaching at all. By enforcing frequent feedback delivery to one group and denying it to the other, and then comparing users' problem-solving activities between groups, my goal was to isolate the gross, overall effects of immediate coaching feedback on user interaction with Belvedere.
In addition to investigating the role of immediate feedback, this first experiment also serves as an additive design manipulation (Legree et al., 1993), to help determine the value added by automated coaching to the overall Belvedere environment. Other researchers have performed the same manipulation by investigating the effects of removing tutorial feedback entirely from their systems (e.g., comparing versions of an ILE with and without feedback). An informal evaluation of the WEST system, which is a "guided discovery learning" environment (Burton & Brown, 1982, p. 80) for mastering the arithmetic strategies needed to play the simple board game How the West was Won, showed that students who used the system with coaching gained a broader understanding of the different moves in the game, as well as more favorable attitudes toward the game, than did those who used a version without coaching (Burton & Brown, 1982). The aforementioned study using the GIL programming tutor (cited in Merrill et al., 1992) compared students using the standard version of GIL to those using an exploratory version without model tracing feedback. Although the exploratory students scored as well as the model-tracing students on post-tests, they took twice as much time as the standard GIL students to complete the training problems. Thus, the value added by tutorial feedback was to cut the user's learning time in half. In comparison to these and many other systems, the coaching currently provided by Belvedere is relatively unsophisticated. Consequently, it is important to test its educational value and, consistent with the incremental design approach outlined earlier, to add more complex coaching functionality as needed to address deficiencies in the utility of the Belvedere system.
Experiment 2. With the expectation of having established some effects of coaching in the first experiment, I designed a follow-up experiment that compares Belvedere with immediate, intrusive coaching to the standard Belvedere system with on-demand coaching, with an added provision to encourage more frequent advice-seeking by users of the latter. This second experiment investigates the relative effectiveness of feedback timing and control (system- vs. user-initiated); that is, whether the type of feedback Belvedere provides is more useful when provided immediately and automatically, or whether it is best provided only upon users' requests, when they feel help is needed. This experiment permits me not only to compare the relative costs and benefits to Belvedere's users of both feedback approaches, but also to investigate under what circumstances Belvedere's users ask for feedback from the Coach.
Rationale for Single-User Sessions
Although the Belvedere system was designed partly to support collaboration
among users, I chose to conduct each session in both experiments with single
users, for several reasons. Firstly, Belvedere's online Coach does not foster
collaboration explicitly; that is, all of its feedback is worded generically
with respect to cardinality of users, applying equally well to both single and
multiple users. Therefore, for the sole purposes of assessing the general
effects of Belvedere's feedback, there was no principled reason to prefer
collaborative sessions to single-user sessions. Secondly, single-user data
reveal feedback effects with greater sensitivity than would be possible in
multiple-user sessions, because more of each individual user's cognitive
effort and time on task are devoted to explicit problem-solving actions than
to verbalization, input censorship, and extraneous off-task discussion
(cf. Suthers & Weiner, 1995). Therefore, not only does the Coach have more
total input to which it can respond, but also are the participants not
distracted by interactions with other users. Thirdly, because the standard
Belvedere delivers coaching only on demand, even during collaborative sessions
only the individual user who asks for it sees the advice on her screen.
Therefore, because the argument diagram is a shared enterprise between the
collaborating users, it would have been difficult to assess the effects of
coaching on either individual or collective user activities. Fourthly, in
some of our prior sessions with multiple users (Suthers et al., 1995) we
observed students coaching each other, often just before entering
information into their shared diagram (e.g., discussing which Belvedere box
primitive to use for a given statement). Such peer coaching would limit the
opportunities for an intrusive Coach to interject advice. Finally, data from
collaborative sessions with Belvedere are inherently more difficult and
expensive to collect and interpret than are data from single-user sessions.
Except for some limited communications using the somewhat constraining Chat
facility, none of the collaborative activities in which our Belvedere users
have engaged were traceable by a computer. Our prior collaborative sessions
required the use of at least one video camera, with an additional experimenter
manning each camera, to capture user dialogue and gestures. I obviated the
need to collect and analyze video protocols by limiting data collection to
single-user sessions and by using computerized event logs, which captured all
user browsing and diagramming actions in Belvedere, as the primary sources of
data in my experiments.
Overview of the Task
The problem domain. The problem assigned to each participant was a
specific scientific mystery: What caused the crash of TWA Flight 800?
Information about the crash and its possible causes was presented to students
in a self-contained web database. The database is an adaptation of one of
several topic databases initially constructed by former members of the
Belvedere research group (see Footnote 3). I
chose this particular database for my experiments because it is smaller and
more manageable than the other databases, many of which have required multiple
sessions to explore thoroughly in previous investigations of students using
Belvedere. Pilot testing also showed this database to be one of the more
accessible ones to students, making it ideal for sessions of relatively short
duration that focus not on learning domain knowledge but on applying
principles of scientific inquiry.
The original database was constructed during the months following the July 1996 crash, while the investigation was still ongoing. Although the National Transportation Safety Board (NTSB) has since released at least two "final" reports[6] on their investigation, each of which names a most likely cause of the crash, I felt that our web database of several possible causes would still be a viable vehicle by which to test the coaching manipulation in my experiments. My decision was guided by the fact that, in my pilot studies, only 1 student out of 56 reported having heard of the NTSB's first final report when queried during debriefing. However, the NTSB released its most recent report in late August of 2000, shortly before I was to begin data collection for this research. Therefore, as detailed in the Method section of Experiment 1, students were queried at the end of their sessions to ascertain whether they knew about that report and whether it had any influence on their reasoning.
I adapted the original TWA problem database by reducing the grain size of information on each page and by introducing indexing conventions to make it easier to track user browsing. My version of the database consists of 38 individual web pages, accessible from a home page that divides the web browser window into panes with a menu in the narrow left pane and the main page content in the wider right pane (see Figure A1 in Appendix A). The menu pane includes six links to other pages, one of which (labeled "Consider Possible Causes") is a link to an index of four hypothetical causes of the crash (see Figure A6). Each hypothesis listed in this index is a link to a separate web page about a possible cause, and each such page includes two hyperlinks -- one leading to an index of evidence for the hypothetical cause, and one leading to an index of evidence against it (see Figure A1). Each such evidence index consists of hyperlinks to textual bits of evidence, each on its own web page. This level of indexing allows tracking of which hypothesis and which type of evidence a participant is considering at any given time. The hierarchical link structure of the database is represented in Appendix C.
Solving the problem. The problem solving task presented to participants
was to try to "make sense" of the information in the web database (a la
Toth et al., in press) and to try to determine the most likely cause of the
crash based on the information available. As students worked their way
through the database using a web browser (Netscape), they were asked to record
their thoughts in a Belvedere diagram. That is, each time students came
across a hypothesis, a piece of evidence, or any other type of information
they deemed relevant to the problem, they were to insert the information into
Belvedere using the appropriate box primitive (e.g., using a Data box
to contain evidence). They were also asked to indicate the relationships
between the statements they entered by interconnecting them using Belvedere's
link primitives (e.g., using an Against link to show a contradiction).
Students were told that they were not expected to come up with a definitive
answer to the question of what caused the crash; rather, they simply had to
try to sort through the information and determine what they thought was the
most likely cause. No other endpoint was specified, so students proceeded
with the diagramming task until they felt they had satisfied the goal.
Performance Measures
Of primary interest are the apparent effects of coaching feedback on student
activity during the sessions.
Direct effects of coaching were inferred in a number of ways: (a) by analysis
of student activities following the presentation of coaching feedback; (b) by
comparison of final student diagrams to an expert diagram; and, to the extent
deemed necessary after analysis of diagramming session events, (c) by
comparison of verbal argument summaries between students in the different
feedback conditions. Therefore, I used a battery of dependent measures, drawn
from various sources including: chronological records of relevant session
events, culled from time-stamped log files of all user diagramming actions,
browsing actions, and coaching feedback received; users' final Belvedere
diagrams; and written notes and tape recordings of users' end-of-session
verbal argument summaries.
These multiple avenues allowed for
analysis of possible coaching effects on both the processes and the products
of the students' scientific inquiry activities.
Diagram-creation log files were used to determine user actions in a diagram before and after the delivery of selected coaching advice (e.g., to see whether students chose to implement actions recommended by the Coach). Similarly, web browsing log files were used to determine possible coaching effects on user navigation within the hyperlinked problem database. Final student diagrams were coded for numbers and types of elements present and were compared against an expert diagram (described in Experiment 1), using overlay conventions similar to those of Cavalli-Sforza (1998). Such conventions include noting which diagram elements are present in both student and expert diagrams, which expert elements are missing from student diagrams, and which additional student elements are extraneous to the expert diagram.
The verbal summaries were intended as a secondary data source because they can reveal nothing about the direct effects of coaching during a problem solving session with Belvedere. However, they were viewed as a complement to the diagrams and time-stamped event records so that, in the absence of clear and direct coaching effects, they might provide a rough gauge of participants' overall understanding of their inquiry diagram products, which could indirectly reflect coaching effectiveness. At the end of each diagramming session, verbal summary information was collected from each participant in three phases: (a) a free-form summary, without structured prompts and without the Belvedere diagram visible; (b) questions involving structured prompts about possible causes for the crash, again without the diagram visible; and (c) any additions or changes to the summary after redisplay of the diagram.
Affective Measures
As indicated in my cost/benefit discussion above, also of interest are the
effects of coaching on users' attitudes about using Belvedere. Therefore, a
brief battery of end-of-session attitude ratings[7] was collected from each participant,
tailored to the condition to which she was assigned (i.e., only users who
received coaching were asked to rate the Coach). As detailed in the Method
section for Experiment 1, rating items were accompanied by standard nine-point
Likert scales.
The indirect effects of coaching on student attitudes toward using the
Belvedere environment were inferred by comparison of attitude ratings
between students in the different feedback conditions. In the first
experiment, attitude ratings about the Coach from students in the coaching
condition were compared to their overall ratings of Belvedere. Analyses of
these measures are outlined in the following experiment sections.
Covariate Measures
As noted in similar studies, students' problem solving performance in
ill-defined domains can depend on their reasoning ability (Means & Voss, 1996;
Toth et al., in press). Therefore, I sought to include some kinds of ability
measures for possible use as covariates in data analyses. I had considered
several ability assessment measures with varying degrees of directness and
relevance to my experimental task, such as: (a) having participants provide
definitions of common argumentation terms (cf. Cavalli-Sforza, 1998); (b)
presenting a textual debate and asking participants to identify relevant
claims and evidence; (c) asking participants to give an open-ended analysis of
a short, accessible article on a scientific topic; (d) presenting a partial
Belvedere diagram and (after explaining the diagramming conventions in it)
asking participants to identify its strong or weak points; and even (e)
administering a standardized test (e.g., the California Critical Thinking
Skills Test). However, regardless of their directness or relevance, each of
these measures posed the dual danger for the participants of (a) "priming"
them to interact with Belvedere in ways that would have reduced the impact of
my coaching manipulations, and (b) diverting their cognitive effort away from
their actual problem-solving session with Belvedere. Therefore, I felt it
justified to settle for less powerful covariate measures that neither
contaminated nor fatigued my participants. The measures I chose to collect
from students were their current grade point average (GPA), their scores on
the Scholastic Assessment Test (SAT) or American College Testing (ACT)
examination, and their scores on the short form of the Need for Cognition
(NFC) scale (Cacioppo, Petty, & Kao, 1984). The latter scale, which entails
assignment of agreement ratings to a number of propositions, was presented to
participants at the end of their sessions so as not to fatigue them before
their interactions with Belvedere. It was hoped that some combination of
these covariate measures would serve as surrogates for a more direct
assessment of prior inquiry skill.
General Hypotheses
As described earlier, Belvedere's online Coach can recognize and provide
feedback about abstracted patterns of relationships among the statements in a
user's inquiry diagram. After every change to the diagram by the user, the
Coach examines the types and configurations of boxes and links in the diagram,
looking for indications that the user may not be employing good argumentation
or inquiry practices. If it finds any such indications, it may want to
suggest remediation to the user, depending on the severity of the deviations.
Feedback messages on the inquiry patterns monitored by the Coach are presented
in Appendix B.
As indicated earlier, in prior work with Belvedere there were many situations in which users could have been helped if they had sought coaching (Suthers et al., in press). Therefore, to the extent that my experiment participants lacked well-developed inquiry skills or familiarity with at least some of the rules of good scientific inquiry that are embodied within the Coach, I expected to find positive effects of coaching in my primary performance measures. My predictions were as follows:
On the other hand, based on the aforementioned findings from prior work with the standard Coach (Suthers et al., in press) as well as findings from some pilot work with my intrusive Coach, I had reason to predict some variability in my affective measures based on the type and amount of coaching received. As stated earlier, prior work with the standard Coach showed that its advice was often ignored by students as if not understood. In several rounds of piloting with both standard and intrusive coaching, I asked students to inform me when they did not understand the coaching feedback during their Belvedere sessions, and I reiterated this request at the end of each session while asking them for reflective follow-up comments about the Coach. Although most of my pilot-study participants reported that they understood the Coach, a recurring theme among many of their post-session verbal impressions of the intrusive Coach was that it was "annoying". More specifically, many students reported that it offered advice too often, in some cases even before they had had the opportunity to follow up on its earlier advice. Some pilot participants also likened the Coach to similar advice-giving features of some commercial software packages.[8] Therefore, to the extent that students in my intrusive coaching conditions found the advice to be unwanted, I predicted the following:
Technical Details
Belvedere. Belvedere version 2.0.1 was used for both experiments, on
two different platforms: the Windows client for my students and the less
robust Solaris client, which was more susceptible to crashing, for myself.
Only the Inquiry Diagram component of the Belvedere system was used; the Chat
facility was not. In this version of Belvedere, nondirectional links replaced
the directional links of its immediate predecessor, further simplifying the
graphical language of the diagrams. Although a more recent Belvedere version
(2.1) was available at the time I began this research, I chose not to use it
for practical reasons. In the newer version, the Coach is integrated into the
Java-based Belvedere client itself, whereas the older version uses a Coach
that runs as a LISP process on a separate computer. The behavior of the Coach
was much easier to modify in the LISP environment without affecting the rest
of the Belvedere system. The Coach ran in a Common LISP environment (via
Harlequin LispWorks v3.2.2), also using LOOM version 3.0, on a Sun SPARC
workstation running Solaris UNIX. This workstation was the same one on which
I ran the Solaris Belvedere client to monitor student diagram construction.
Yet a third networked computer was required to run my experiments. This third computer, also a Sun SPARC, ran a Java-based "Connection Manager" that allowed diagram updates in the students' Belvedere client to automatically display in my client. Belvedere diagram information was stored on this third computer using the Postgres95 database management system, which maintains information about each element added to or changed in each Belvedere diagram. Finally, this SPARC also ran the web server for the TWA problem database. All browser page accesses by the students were logged to this machine, as were all of their major Belvedere actions.
Netscape. Students used Netscape Communicator for Windows (version 4.74) to browse the self-contained database about the TWA crash. To ensure that I would be able to track participants' web browsing activity from the server side, I reduced the user's Netscape memory and disk caches to zero and configured Netscape to load from the network every time a web page was visited. I also removed from view all of Netscape's toolbars except the navigation toolbar with the directional buttons, to facilitate web navigation for the students.
Internet Explorer. The end-of-session survey presented to each participant was contained within a web form with radio buttons, configured such that survey responses would be sent to me electronically only after responses were indicated for all items. That is, any attempt to submit an incomplete form would result in a browser error message. To make it easier for participants to correct such oversights, I needed to use a separate browser with caching enabled. Therefore the survey was presented using Microsoft's Internet Explorer browser (version 5.00) with standard cache settings.
Design
The experiment employed a single-factor between-subjects design. Participants
were block-randomly assigned to either Condition IC (intrusive coaching) or
Condition NC (no coaching). In Condition IC, participants received intrusive
coaching feedback from Belvedere while constructing their inquiry diagrams.
Coaching feedback was also available on demand in this condition. In
Condition NC, Belvedere's automated coach was disabled, and the button from
which it can normally be invoked on demand was removed from the Belvedere
diagramming interface.
Apparatus
I conducted each session in a single room, with one computer and desk for the
participant and another for myself. Throughout the session the participant
used a Pentium® tower with a 15" color monitor, standard keyboard, and a
two-button mouse; from the other computer (a SPARC workstation) I monitored
the data and coaching logs as well as the participant's diagram construction
using Belvedere's networking capabilities. The room was configured such that
I also was able to monitor the participant's screen surreptitiously from my
desk across the room. The participant's screen layout was configured such
that the Belvedere Inquiry Diagram window filled one vertical half of the
screen and the Netscape Navigator window filled the other half (see Figure A1
in Appendix A for an example setup; the two
applications were reversed in my sessions). This configuration allowed the
user to browse information in Netscape and insert it easily into Belvedere, if
desired, without having to switch between applications. Toward the end of
each session, I used a standard cassette recorder and written notes to capture
participants' verbal argument summaries.
Materials
As discussed in my Introduction, participants were asked to construct their
Belvedere diagrams based on the contents of a self-contained web database
about a scientific mystery: the possible causes for the crash of TWA Flight
800.
Appendix D is a generalized form of the actual
script I used during verbal interactions with participants, including
condition-dependent instructions to the participants and requests for
information from them. For brevity, I modified the script as it appears in Appendix D to apply to both experiments.
An end-of-session survey was given to each participant. The survey consisted
of: (a) the screening questions (mentioned in the Introduction) regarding the
user's familiarity with the NTSB's final report on the crash; (b) a short
battery of attitude rating items regarding the user's interaction with
Belvedere and, for users in Condition IC, its Coach; and (c) the 18-item short
form of the NFC rating scale (Cacioppo et al., 1984). The attitude and NFC
items used a common, nine-point Likert rating scale, with response anchors
modeled after those used by Cacioppo and Petty (1982) with their longer,
34-item form of the NFC scale. My response scale ranged from 1 (very
strongly disagree) to 9 (very strongly agree), with 5 being
neutral and with intervening anchors qualified with strongly,
moderately, and slightly. Two versions of the survey were
constructed, one with and one without the Coach-related attitude statements.
The rating scale and an abbreviated form of the survey (with the Coach items
but without the NFC items) may be found in Appendix
E.
I informed each participant in Condition IC that a computerized Coach would be monitoring her diagram construction, and that periodically it may want to suggest possible ways to improve her diagram. I then told her that the Coach is available on demand by clicking the light bulb icon, and that it may also "speak up" on its own even when she does not click on the light bulb. I explained the appearance of the Coach's feedback (i.e., it appears in a pop-up dialog box and may highlight some diagram elements in yellow) so that the participant would be able to recognize it as such. Participants in Condition NC did not receive any information about coaching, and they used a modified version of the interface with the light bulb icon removed.
After asking the participant if she had any questions, I brought up the TWA problem's home page in Netscape to begin the diagramming session and then retired to my own desk to monitor the session. The participant worked through the problem as she saw fit, with me intervening only to help her with any problems she may have had with her computer or with the Netscape or Belvedere software.[11] I also remained available to answer any questions the participant may have had during the session. The session proceeded with the participant creating and incrementally refining an argument diagram of the problem, with participants in Condition IC receiving periodic, usually intrusive feedback from the Coach. The diagramming session ended when the participant verbally indicated to me that she thought she was done working on the problem, or when total session time neared the end.
At the conclusion of the diagramming session I cleared the participant's screen, removing her Belvedere and Netscape windows from view, and I turned on the tape recorder. I then asked the participant to provide a free-form verbal summary of the argument she had constructed during her diagramming session. When the participant indicated that her summary was complete, I used structured verbal prompts as needed to encourage the participant to evaluate each of the possible causes in turn and to select a "winning" causal hypothesis, if she had not already done so in her free-form summary. I then restored the Belvedere window containing her final argument diagram and asked if she wished to change or add anything to her summary. I followed each significant pause between the participant's utterances with the content-free prompt "anything else?" (a la Means & Voss, 1996), until she indicated that she was finished. I then turned off the tape recorder.
Finally, I launched the Internet Explorer web browser (with standard cache settings) and presented a web form to the participant containing the end-of-session survey[12] (see the Technical Details in the Introduction for more details). After the participant successfully submitted her survey responses, I presented a consent form and asked her for written permission to access her official GPA and SAT scores from the University. I then debriefed the participant, asking about her knowledge of the final NTSB report and about her reactions to the Coach, and I gave her a credit slip for her participation.
Results and Discussion
Data Limitations
Data omission. An unusual client-server network communication error
occurred during one of the 20 coached sessions, resulting in the Coach's
failure to recognize one of the participant's Data statements. I was alerted
to the error by a series of "bogus edge" warning messages in the coaching log
for that session, which began to appear after the participant drew a For link
between that Data statement and one of her Hypothesis statements (the only
other statement to which it was ever linked). The warning message recurred
each time Belvedere redrew that link to the phantom Data statement, which
apparently existed in the participant's diagram but not in the Postgres95
relational database on the server. Therefore, when generating advice messages
the Coach was unable to account for both that Data statement and the For link
between it and the Hypothesis. Because the error occurred early in the
diagramming phase of her session (after only 5 of 47 diagram actions and only
1 of 25 total coaching messages), and because a later recreation of her
diagram without the error produced a different series of messages from
the Coach, her data were excluded from all analyses. After her omission, data
from 36 participants (19 coached students and 17 uncoached students) remained.
Covariate data. I was granted access to official student records for all but one participant; therefore, I disregarded student self-reports of GPA and SAT scores in favor of the official figures. One participant denied me access to both his SAT scores and his GPA (he also did not self-report them). Two other participants lacked any SAT scores, but their ACT composite scores were available. Using a recent concordance table published by the College Board (Table 3 of Schneider & Dorans, 1999), I determined the equivalent SAT total scores for their ACT composites and substituted them in covariate analyses where feasible. However, I was unable to determine any equivalent SAT Math or Verbal subscores for these students. Remaining after these limitations were 35 viable GPAs and SAT total scores, as well as 33 viable SAT Math and Verbal subscores. Data from the NFC scale were available for all 36 viable participants.
Downtimes due to software failures. Software crashes occurred during the diagramming phases of two uncoached sessions and one coached session. The respective crash downtimes were 0.95, 6.37, and 4.88 min. In the first case only Netscape was affected, but the other two cases required me to reboot the student's computer, thereby affecting both Belvedere and Netscape. Fortunately, in all cases I was able to restore both the student's most recently browsed web page in Netscape and the student's Belvedere diagram in its entirety. Therefore, as detailed below, the impact of these failures seems to have been limited to inflated session times.
Statistical Notes
An alpha level of .05 was used for all statistical tests. Unless otherwise
indicated, each of my between-group statistical tests involved a separate test
for equality of group variances (e.g., Bartlett's F-test). For cases in which
the equality test indicated a violation of the equal-variance assumption, I
report the more conservative statistics and probability values assuming
unequal variances. In most such cases, adjusted degrees of freedom
using Satterthwaite's approximation (as shown in Snedecor & Cochran, 1980)
were computed as decimal numbers. Therefore, any degrees of freedom I express
as a decimal number implies unequal variances.
Covariates: Descriptive Statistics
NFC scale. Each participant's ratings of the 18 items on the NFC short
form (Cacioppo et al., 1984) were averaged into a single composite measure,
equal to the arithmetic mean of the 18 individual ratings, after reverse
scoring of the 9 negatively worded items. The 36 NFC composite scores ranged
from 4.11 to 8.22 with a mean of 6.19 (SD = 1.01, Mdn = 5.97),
indicating on average a modest need for cognition among the students. The
mean NFC score was slightly but not significantly higher for uncoached
students (M = 6.28) than for coached students (M = 6.11).
GPA. Student GPAs (at the end of the term during which the experiment was conducted) ranged from 1.64 to 3.95, with a mean of 2.85 (SD = 0.58, Mdn = 2.88). There was no significant difference between mean GPAs of coached (2.82) and uncoached (2.89) students.
SAT scores. I noted the students' most recent SAT subscores and, if different, their highest subscores from any prior test administrations. Math subscores ranged from 380 to 740 with a mean of 548 (SD = 87, Mdn = 550), and Verbal subscores ranged from 340 to 750 with a mean of 573 (SD = 92, Mdn = 570). The mean total SAT score including the two equated ACT scores was 1120 (n = 35), one point less than the mean sum of subscores (n = 33). The means of students' highest Math and Verbal subscores were 559 and 586, respectively. Means for uncoached students were slightly but nonsignificantly higher than those for coached students on all SAT measures (see Table 1).
Table 1
Mean SAT Scores by Condition in Experiment 1
Correlations among covariates. Using a two-tailed rejection criterion, analyses of the 33 students for whom all subscores were available showed positive correlations of GPA with SAT Verbal subscores (.52, p < .005) and with SAT total scores (.46, p < .01) but not with SAT Math subscores (.28, p = .11). SAT Math and Verbal subscores were positively correlated with each other (r = .54, p < .005). NFC composite scores were uncorrelated with the other covariate measures.
Condition SAT-M SAT-V SATtota HiSATM HiSATV
IC M 533 559 1092 547 573 SD 102 93 168 90 76
NC M 565 589 1153 573 602 SD 65 92 130 68 83
Note. Means did not differ significantly between groups. IC = intrusive coaching; NC = no coaching. an = 35 after converting ACT scores of two students.
Median splits. In addition to covariate analyses, I performed median splits using each of the three covariate measures, for use in two-way (with coaching condition) analysis of variance (ANOVA) analyses on my dependent measures. Any significant interactions revealed by these analyses are reported throughout. Note that median-split sample sizes were 18 per cell for NFC and SAT totals, but there were 19 students in the low-GPA group and 17 in the high-GPA group. F-ratios for these analyses were approximate due to unequal sample sizes, so any interactions significant at or near the .05 level should be interpreted with caution.
Session Durations
I recorded the approximate total duration of each participant's experiment
session, rounded to the nearest 5-minute increment. Session duration
(including the three crash downtimes) ranged from 40 to 95 min, with a mean of
64.72 min (SD = 13.57, Mdn = 65). Subtracting the downtimes
slightly reduced the mean session duration to 64.38 min. Sessions involving
the Coach tended to last longer (M = 67.11, SD = 13.78) than
uncoached sessions (M = 61.33, SD = 12.67), although the
difference was not statistically significant, t(34) = 1.30, p =
.20. Possible reasons for the tendency include (a) addition of the coaching
overview to the verbal introductions in the IC sessions, (b) addition of
Coach-related items to the IC end-of-session surveys, and (c) interaction with
the Coach itself.
I used the time-stamped log files to more precisely determine the duration of the diagramming phase of each session. I measured the time interval between the initial browse of the TWA problem home page and the student's final action, either in Belvedere or in Netscape. Diagramming durations ranged from 19.50 to 64.57 min, with a mean of 40.24 min (SD = 12.62, Mdn = 38.07). Diagram sessions involving the Coach tended to last longer (M = 42.06, SD = 13.97) than the uncoached sessions (M = 38.20, SD = 10.98), possibly due to interaction with the Coach. However, this difference also was not significant (t < 1).
Amount and Frequency of Coaching
The 19 viable coached students received a total of 471 messages from the
Coach, of which 434 (92%) were presented intrusively and 37 (8%) were
presented upon request. Of the 471 messages, 462 were substantive and 9 were
null advice messages (i.e., instances in which the message was "The
coach doesn't have anything to suggest."). A null message is the result of a
user request for advice when the Coach has no list of rules that currently
apply to the diagram. By design all intrusive messages from the Coach were
substantive; therefore, only 28 of the 37 requested messages (76%) were
substantive. The total number of substantive coaching messages displayed to
each coached student ranged from 6 to 43, with a mean of 24.32 (SD =
9.78) and a median of 24.
To determine how often advice was presented during the coached sessions, I
defined two versions of an inter-coaching interval (ICI) measure between
successive advice presentations: one for elapsed time (ICIt) in seconds[13] and one for the number of
diagramming events (ICIe). More specifically, for each substantive coaching
instance I measured the interval between it and the previous substantive
coaching instance (or, in the case of the first instance, the interval between
it and the first diagramming action). The mean of the individual student
means for ICIt and ICIe were 91.95 s and 1.87 diagram events, respectively.
When I restricted calculations to intrusive messages only, the per-student
mean ICIe was 1.97 (i.e., on average the Coach presented an intrusive advice
message after every other diagramming action). Both ICI measures show that,
as expected, the frequency of advice presentation by the intrusive Coach was
relatively high on average.
Completeness of Final Diagrams
Total element count. As a first step in trying to determine the effects
of coaching on diagram completeness, I computed some gross, overall "body
count" measures by totalling the numbers of boxes and links in students' final
diagrams (a la Toth et al., in press). The total number of boxes in
each final diagram ranged from 7 to 26 with a mean of 16.31 (SD = 5.26,
Mdn = 17.50). Total number of links ranged from 4 to 36 with a mean of
19.11 (SD = 7.79, Mdn = 17.50). Contrary to my expectations,
the final diagrams of uncoached students tended to have more boxes and more
links (Ms = 17.12 and 19.47, respectively) than did those of coached
students (Ms = 15.58 and 18.79, respectively); however, neither
difference in means was statistically significant.
Box types. I further analyzed counts of diagram elements by the specific type of box. The number of Hypothesis boxes per final diagram ranged from 1 to 7, with a mean and median of 4.00 (SD = 1.41). There was no significant difference between coached and uncoached means (3.95 and 4.06, respectively). The number of Data boxes per final diagram ranged from 5 to 22, with a mean of 11.56 (SD = 4.46, Mdn = 12). Respective means of coached and uncoached students (10.89 and 12.29) did not differ significantly. The number of Unspecified boxes per final diagram ranged from 0 to 5, with a mean of 0.75 (SD = 1.36, Mdn = 0). There was no significant difference between coached and uncoached means (0.74 and 0.76, respectively). I also counted the total number of unlinked boxes (i.e., boxes with no relational link of any kind) in each final diagram, regardless of box type. The number of unlinked boxes per final diagram ranged from 0 to 2, with a mean of 0.31 (SD = 0.58, Mdn = 0). Respective means of coached and uncoached students (0.32 and 0.29) did not differ significantly. Of the 11 total boxes left unlinked by the 36 viable students, 3 were Unspecified boxes, 5 were unique Data boxes, 2 were unique Hypothesis boxes, and 1 was a duplicate of a Data box appearing elsewhere in a very large diagram (scrolling was required to see both copies of the box). Except for the slightly lower coached mean number of Unspecified boxes, each nonsignificant trend among box types ran counter to my predictions. I revisit the issue of unlinked boxes in my discussion of diagram errors below.
Link types. I also coded diagram relations by type of link. The number of For links per final diagram ranged from 1 to 22, with a mean of 10.08 (SD = 5.36, Mdn = 9.50). There was no significant difference between coached and uncoached means (10.37 and 9.77, respectively), but the trend was in line with my predictions. The number of Against links per final diagram ranged from 1 to 17, with a mean and median of 7.50 (SD = 3.51). There was no significant difference between coached and uncoached means (7.74 and 7.24, respectively), but this trend was predicted as well. The number of And links per final diagram ranged from 0 to 7, with a mean of 1.53 (SD = 1.96, Mdn = 1). Coached final diagrams had significantly fewer And links (M = 0.68, SD = 0.89) than did uncoached final diagrams (M = 2.47, SD = 2.40), t(19.9) = 2.90, p < .01. This difference, which accounted for the unexpected trend in total link counts, could be due to specific coaching on And links (or, more to the point, to the lack thereof for uncoached students). Of the 36 viable students, 25 (12 coached and 13 uncoached) used at least one And link during their diagramming sessions. However, 1 of the 20 coaching rules (conjunct-for-hypothesis?) specifically targets the resolution of ambiguous support relationships involving And links (see Figure 2), and this rule was triggered at least once during 9 of the 19 viable coached sessions. One of the simplest ways for a user to address the advice is to simply delete the And link in question. Indeed, five of the nine students who were so coached on their And link(s) deleted at least half of them before the end of their diagramming sessions, whereas the uncoached students had no such impetus to do so.
Correlations with covariates. I noted significant or near significant correlations between several of my gross diagram completeness measures and my covariate measures. NFC (n = 36) was positively correlated with number of Data boxes (.36) and number of Hypothesis boxes (.34), ps < .05, and it showed marginal positive correlations with total numbers of boxes (.32, p = .06) and links (.28, p = .10) and a marginal negative correlation with number of Unspecified boxes (-.32, p = .06). GPA (n = 35) showed marginal positive correlations with number of Data boxes and with total number of boxes, rs = .30, ps = .08. SAT total score (n = 35) was positively correlated with both number of Data boxes and total number of boxes (rs = .44, p < .01), and with numbers of For links (.39) and Against links (.42) as well as total number of links (.42), ps < .05.
|
| Figure 2. Coaching on an ambiguous support relation involving an And link. |
ANCOVAs on box counts. An ANCOVA on total number of boxes using NFC and SAT totals showed a significant overall effect of covariates (F(2, 32) = 5.76, p < .01), with significant effects of both NFC (t = 2.05) and SAT total (t = 2.72), ps < .05. Adjusted means were 16.15 boxes for coached students and 16.54 boxes for uncoached students, reducing the still nonsignificant (F < 1) unexpected trend favoring uncoached students. An ANCOVA on number of Hypothesis boxes showed a significant effect of NFC, (F(1, 33) = 4.20, p < .05), deflating the difference between coached and uncoached means (3.99 and 4.02, respectively). An ANCOVA on number of Data boxes showed a significant overall effect of the covariates (F(3, 31) = 4.61, p < .01), with significant effects of SAT total (t = 2.28) and NFC (t = 2.33), ps < .05. Adjusted coached and uncoached means were 11.39 and 11.80, respectively, also reducing the unexpected trend favoring uncoached students. An ANCOVA on number of Unspecified boxes showed an almost significant effect of NFC with a negative regression coefficient, F(1, 33) = 3.78, p = .06. There was still no significant effect of coaching condition (F < 1), but the predicted difference between adjusted coached and uncoached means (0.70 and 0.80, respectively) was larger than that of the unadjusted means. ANCOVAs on the number of unlinked boxes did not show a significant effect of covariates using any regression model. However, a two-way ANOVA revealed a significant interaction of coaching condition and median-split SAT total, F(1, 32) = 7.42, p = .01. Uncoached students left more boxes unlinked if they had high SAT totals (M = 0.56) than if they had low SAT totals (M = 0), whereas coached students left more boxes unlinked if they had low SAT totals (M = 0.50) than if they had high SAT totals (M = 0.11). Please note, however, that across conditions only 11 total boxes were left unlinked, so the subsample sizes here are small.
ANCOVAs on link counts. An ANCOVA on total number of links showed a significant overall effect of the covariates (F(3, 31) = 3.45, p < .05), with a significant effect of SAT total (t = 2.45, p < .05) and a marginal effect of NFC (t = 1.80, p = .08). Although there was still no significant effect of coaching condition (F < 1), the adjusted means (19.64 for coached and 18.62 for uncoached students) were in line with my predictions, unlike the unadjusted mean link totals. An ANCOVA on number of For links using NFC and SAT totals showed a significant overall effect of covariates (F(2, 32) = 3.36, p < .05), with a significant effect of SAT total (t = 2.52, p < .05). Although there was still no significant effect of coaching condition (F < 1), adjusted means showed inflated differences in the predicted direction (10.88 for coached and 9.26 for uncoached students). An ANCOVA on number of Against links showed a significant overall effect of the covariates (F(3, 31) = 3.73, p < .05), with a significant effect of SAT total (t = 2.71, p < .05) and a marginal effect of NFC (t = 1.70, p = .10). Although there was still no significant effect of coaching condition (F(1, 32) = 1.58, p = .22), adjusted means showed an even stronger trend in the predicted direction (8.14 for coached and 6.84 for uncoached students). ANCOVAs on the number of And links did not show a significant effect of covariates using any regression model.
Expert Diagram Comparisons
Figure 3 shows an expert diagram for the TWA 800
problem. This diagram is an extension of an earlier expert representation
that was compiled by former members of the Belvedere research group, prior to
my reindexing of the evidence in the TWA 800 problem database. The four
hypothetical causes for the crash appear near the center of the diagram, with
supporting and contradicting evidence surrounding them along the periphery.
The thicker For and Against links in the diagram represent relationships made
explicit in the reindexed hyperlink structure of the database, while the
thinner ones denote relationships only implicitly represented in the
database. The diagram contains a total of 34 boxes and 40 links, with the
following type breakdown: 4 Hypotheses (one for each possible cause), 29 Data,
1 Unspecified box,[14] 20 For links, 13 Against
links, and 7 And links. The expert diagram includes And links solely in cases
where conjunctions of two Data statements have a positive or negative
relationship with one or more of the hypotheses.
|
| Figure 3. Expert Belvedere diagram for the TWA 800 crash problem. |
Errors in Final Diagrams
I counted instances of uncorrected diagramming errors in final student
diagrams, relative to the expert diagram where applicable. That is, I
disregarded any extraneous hypothesis boxes in student diagrams and considered
errors relative to the four key hypotheses only. I coded final diagrams for
the following errors: (a) the number of missing hypotheses, (b) the number of
hypotheses subject to confirmation bias (no Against links), (c) the number of
unsupported hypotheses (no For links), (d) the number of unique hypotheses
without any links at all, and (e) the number of unique data without any links.
Error counts per condition are shown in Table 2 along
with total errors per condition, with student subsample sizes shown in
parentheses. Note from the column with total errors that, in comparison to
coached students, slightly fewer uncoached students left a higher number of
uncorrected errors in their final diagrams.
Table 2
Final Diagram Error Counts by Condition in Experiment 1
There were no significant differences between groups on proportional error counts. There were significant or near significant negative correlations between the covariates and some of the error counts: SAT total scores with total errors (-.42, p = .01), number of unsupported hypotheses (-.33, p = .05), and number of missing hypotheses (-.31, p = .07); QPA with number of unlinked data (-.34, p = .04) and total errors (-.25, p = .15); and NFC with number of missing hypotheses (-.29, p = .09). However, although ANCOVAs did show significant effects of covariates on some of the error measures, none of the adjusted proportional means showed significant between-group differences. The ANCOVA results that came closest to reaching significance were confirmation bias count per student (F(1, 30) = 1.56, p = .22), with adjusted proportional means of 0.11 coached and 0.36 uncoached, and total error count per student (F(1, 31) = 1.40, p = .25), with adjusted proportional means of 0.80 coached and 1.27 uncoached.Hyps Hyps Hyps Hyps Data Total Condition Missing C.Bias Unsupp. NoLinks NoLinks Errors
IC 7 (4) 2 (1) 5 (4) 0 (0) 2 (2) 16 (9)
NC 9 (5) 6 (4) 2 (2) 0 (0) 3 (2) 20 (8)
Note. Condition subsample sizes appear in parentheses. Proportional means did not differ significantly between groups. IC = intrusive coaching; NC = no coaching.
Although the data in Table 2 reflect nonsignificant trends, all trends were in the predicted direction (i.e., more errors committed by uncoached students than by coached students) except for one: Coached students tended to have more unsupported hypotheses than did uncoached students. However, this measure appears to be linked to the count of missing hypotheses. Of the nine students who failed to include one or more of the key hypotheses, the most common omission was the HE hypothesis (n = 7), which was the most underrepresented hypothesis in the database with regards to supporting and disconfirming evidence. This hypothesis also accounted for five of the seven instances of unsupported hypotheses present in student diagrams. Thus, it appears many students either omitted the HE hypothesis from their diagrams or included it but left it unsupported.
Diagram quality. Based on the diagramming errors noted above, I defined a general categorical measure of final diagram quality: A student's final diagram was classified as adequate if it included (a) all four key hypotheses, (b) at least one piece of supporting evidence linked to each key hypothesis, and (c) at least one piece of contradictory evidence linked to each key hypothesis. Any diagram not meeting all three criteria was classified as inadequate. Of the 36 viable students, 20 created adequate diagrams and 16 did not. The 20 adequate diagrams were evenly divided among conditions (10 coached and 10 uncoached), and the 16 inadequate diagrams were nearly so (9 coached and 7 uncoached). Therefore, there was no obvious effect of coaching on overall diagram quality per my general definition of adequacy.
Distinct Coaching Effects on Diagramming
The general lack of significant between-group differences on final diagramming
measures could be due to the fact that uncoached students can self-correct
many of the diagramming errors flagged by the Coach (e.g., confirmation bias
and unsupported hypotheses) simply by browsing the entire database and by
entering and linking information as it is encountered. Indeed, the reindexed
link structure of the database makes it apparent that evidence exists both for
and against each hypothesis. Therefore, I sought to isolate more distinct,
local reactions to coaching that might better discriminate between coached and
uncoached student performance. One particular coaching rule seemed like a
good candidate for this purpose, because its associated advice recommends a
diagram action that average users probably would not think to do on their own.
The coaching rule, attend-to-discrepant-evidence (see Figure 4), advises users to weigh the relative strength of
evidence for and against a hypothesis and to modify the default neutral belief
strength assigned to each linked Data box (see also Appendix B). The coaching feedback also advises users
to toggle a diagram display filter ("Show Strength"), which is off by default,
to show the belief levels of all constructs in the diagram. The stronger the
assigned belief level, the thicker the outline of the box or link will appear
with the Show Strength filter turned on, as in Figure
4. Although the option to assign a non-default belief strength is
presented any time a user creates a new statement box (see Figure A3 in Appendix A) and, therefore, could become salient to
observant users even without coaching, the option to activate the display
filter is "hidden" within the Filters menu at the top of the Belvedere
window. Therefore, unless a user were curious enough to explore the Belvedere
menu options (which were not discussed in the verbal introduction), coaching
on this rule would be the only way the user could find out about it.
|
| Figure 4. Coaching that recommends a non-obvious diagramming action. |
Coaching Effects on Web-Browsing
Of the 36 viable participants, only 17 (9 coached and 8 uncoached) browsed all
38 pages of the TWA database at least once. The other 19 students (10 coached
and 9 uncoached) skipped from 1 to 16 pages each, with a mean of 4.26 skipped
pages (SD = 4.01, Mdn = 3). Interestingly, within just the
reduced subsample of page-skippers, coached students skipped significantly
more pages (M = 6.20, SD = 4.57) than did uncoached students
(M = 2.11, SD = 1.69), t(11.6) = 2.64, p < .05.
Within the complete sample of 36 students, the difference between coached
(M = 3.26, SD = 4.53) and uncoached (M = 1.12, SD
= 1.62) page skips was almost significant, t(23.0) = 1.93, p =
.07. However, I noted a strong negative correlation between number of skipped
pages and SAT total score, r(33) = -.49, p < .005. Page skips
also had slight negative correlations with the other two covariates, NFC
(-.12) and QPA (-.05). An ANCOVA using all three covariate measures showed a
significant overall effect of them, F(3, 31) = 3.67, p < .05.
The covariate effects reduced the between-group difference in skipped pages
for the complete sample, with adjusted group means of 2.89 for coached and
1.49 for uncoached students (F(1, 32) = 1.89, p = .18). The
covariates had no significant effect for the reduced subsample.
Of the 38 total pages in the database, 26 were skipped by at least one student. The most commonly skipped page (n = 9) was a parenthetically referenced government meeting that appeared below the four possible causes in the hypothesis index (see Figure A6 in Appendix A). One coached student skipped both the MF and HE hypothesis pages, and another coached student skipped the B hypothesis page; they therefore also skipped the respective evidence sub-pages for and against these hypotheses, possibly contributing to the higher number of page skips among coached students. None of the uncoached students skipped any hypothesis pages, raising the possibility that the frequent advice may have annoyed the coached students into opting out of the problem early. Among all students who browsed each of the four hypothesis pages (17 in each condition), more skipped the indexes of evidence against them than skipped the indexes of evidence for them, for all but the HE hypothesis (respective ns were 6 vs. 1 for B, 4 vs. 0 for M, 1 vs. 0 for MF, and 1 vs. 1 for HE), illustrating the possible tendency toward confirmation bias noted in the diagram errors. However, this tendency did not appear to be stronger among uncoached students like in the diagrams; of the 12 students who skipped an evidence index against at least one hypothesis, 7 were coached and 5 were uncoached.
Attitude Ratings
Ratings of Belvedere. On the end-of-session survey all participants
rated six statements about their experiences with Belvedere, on the same
nine-point Likert scale used for the NFC items at the end of the survey (see
Appendix E). Two of the items (B2 and B5) were
negatively worded to attenuate response bias, much like Cacioppo and Petty
(1982) did for their NFC scale. After reverse-scoring of those items, the
respective mean ratings for the six statements (B1 through B6) were 6.50,
7.41, 7.11, 6.75, 6.39, and 6.69. Median ratings ranged from 6.50 to 8.00,
all above the neutral rating of 5 and all in the moderately favorable range of
the scale. The mean composite Belvedere rating, defined as the arithmetic
mean of the six individual statement ratings, was 6.81 (SD = 1.26,
Mdn = 7).
I predicted that my uncoached students would report more positive attitudes toward Belvedere than would my coached students. Although the composite Belvedere ratings of the 19 uncoached students (M = 7.07, SD = 0.74) were nearly a half point higher than those of the 17 viable coached students (M = 6.58, SD = 1.58), the difference was not significant, t(26.1) = 1.21, p = .24. However, on individual Belvedere item B3 ("Belvedere helped me keep track of the various pieces of information relevant to the problem"), uncoached students did report significantly higher ratings (M = 7.77, SD = 1.09) than did coached students (M = 6.53, SD = 2.34), t(26.1) = 2.07, p < .05. Between-group differences in mean ratings for four of the other five items were in the predicted direction (ranging from 0.13 to 0.61 in favor of uncoached students) but did not approach statistical significance. The only exception to the trend was statement B2 ("I found Belvedere to be difficult to use"), which received a very slightly less favorable rating (after reverse-scoring) from uncoached students (M = 7.41) than from coached students (M = 7.42).
Student ratings of Belvedere did not differ significantly on the basis of diagram quality (as defined earlier under Errors in Final Diagrams). Composite ratings of students with adequate diagrams (M = 7.13, SD = 0.96) tended to be higher than those of students with inadequate diagrams (M = 6.42, SD = 1.51), t(24.3) = 1.63, p = .11. Statement B2 also tended to receive higher ratings from students with adequate diagrams (M = 7.80, SD = 1.15) than from those with inadequate diagrams (M = 6.94, SD = 1.69), t(34) = 1.82, p = .08. No other differences approached statistical significance, and there were no significant interactions between diagram quality and coaching condition.
Ratings of the Coach. In addition to the six Belvedere items, students in Condition IC rated six items about the Coach, three of which were reverse-scored (see Appendix E). The 19 viable coached students' respective mean ratings for the six statements (C1 through C6) are shown in Table 3. Note the higher variability in mean and median Coach ratings in comparison to the Belvedere ratings. Note also that only two statements, C4 ("The feedback I received from the Coach was easy to understand") and the reverse-scored C5 ("The Belvedere system would be better off without the Coach"), received favorable ratings, and only slightly favorable at that. Also note the unfavorable mean composite rating (the arithmetic mean of the six Coach-related statements). None of the Coach-related mean ratings differed on the basis of diagram quality.
Table 3
Attitude Ratings of Coach-Related Statements in Experiment 1
Ratings correlations. As predicted, the coached students' composite Belvedere ratings were highly positively correlated with their composite Coach ratings, r(17) = .74, p < .0005 (one-tailed). Using one-tailed rejection criteria, composite ratings of the Coach were correlated with each individual Belvedere item rating at the .05 level or better, and composite Belvedere ratings were correlated at the .01 level or better with ratings of each individual Coach item except for C3 ("Often I found the feedback from the Coach to be repetitive"), r(17) = .18, p = .23. As shown in Table 3, this statement had by far the most unfavorable mean rating of the six Coach-related statements. Most of the coached students (10 out of 19) gave this statement the strongest possible agreement rating of 9, which reverse-scored to 1 as indicated by the median rating in Table 3. Given the disparity in mean ratings between C3 and the Belvedere composite, the lack of a significant correlation for this statement is not surprising. Although the second least favorable mean rating among the Coach-related statements went to C6 ("I found the Coach to be annoying"), it is reassuring to see that not all students were annoyed by the repetitive feedback of the intrusive Coach.C1 C2 C3* C4 C5* C6* Composite
M 4.26 4.47 2.26 5.90 5.47 4.11 4.41
SD 2.71 2.61 1.76 2.13 2.53 2.98 2.03
Mdn 4.00 4.00 1.00 6.00 6.00 3.00 4.00
Note. * Reverse scoring was used on this item.
Verbal Summaries
Having not found as many significant between-group differences as expected on
my primary dependent measures, I turned to my secondary data source, the
verbal end-of-session summaries. Because the groups did not differ
significantly on most diagram completeness measures, I focused on the phase of
the verbal summaries that I surmised could have the highest payoff: additions
to summaries following diagram redisplays. I predicted that coached students
might be less likely to need to add to their summaries, because they would
remember more of their diagram content from having paid closer attention to
the specific referents of the frequent coaching they received. I coded the
third phase of each verbal summary (the part after redisplay of the final
Belvedere diagram) for the number of statements uttered and for the number of
relations (For, Against, or even And) that were either stated or strongly
implied. Students added 0 to 7 statements to their summaries, with a mean of
1.47 statements (SD = 1.83, Mdn = 1). As predicted, uncoached
students tended to add more statements (M = 1.88, SD = 2.29)
than did coached students (M = 1.11, SD = 1.24); however, this
difference was not significant, t(24.1) = 1.25, p = .22.
Students mentioned 0 to 6 relations in their additions to their summaries,
with a mean of 1.19 relations (SD = 1.65, Mdn = 0). Uncoached
students tended to add more relations (M = 1.41) than did coached
students (M = 1.00), but this difference also was not significant
(t < 1). Not surprisingly, both addition measures were highly
intercorrelated, r(34) = .74, p < .0001.
There were significant or near significant positive correlations between the covariates and both addition measures, ranging from .24 (p = .16) to .38 (p = .02). However, ANCOVAs for both measures showed no significant differences between groups (Fs < 1), with adjusted means slightly less divergent than unadjusted means. Adjusted means for number of statements added were 1.28 for coached and 1.71 for uncoached students. Adjusted means for number of relations added were 1.13 for coached and 1.29 for uncoached students. There was a barely significant interaction of coaching condition and median-split NFC in a two-way ANOVA on number of added statements, F(1, 32) = 4.39, p = .044. Coached students added more statements to their summaries if they had a high NFC (M = 2.00) than if they had a low NFC (M = 0.27), whereas uncoached students added more statements if they had a low NFC (M = 1.57) than if they had a high NFC (M = 1.30). This evidence, although weak, could be suggestive of the role of NFC on attention to detail in the diagramming task.
There were tendencies for students with adequate diagrams to include more statements (M = 1.80, SD = 1.99) and relations (M = 1.60, SD = 1.67) than students with inadequate diagrams (respective Ms = 1.06 and 0.69, SDs = 1.57 and 1.54). This is not surprising, given that those with inadequate diagrams had less to report in their summaries than did those with more fully developed diagrams. However, neither difference in means was statistically significant (respective ts = 1.21 and 1.69, ps = .23 and .10). There were no significant interactions between diagram quality and coaching condition.
Lag Times Following Advice Delivery
The dearth of significant between-group differences on many of my performance
measures has another possible explanation: Coached students may not have been
reading the advice presented to them. Indeed, many coached students admitted
during debriefing that they read only the first few advice presentations and
thereafter simply clicked the "Close" button any time a coaching dialog
appeared. However, I was not systematic about asking coached students how
often they actually read the advice. Unfortunately, there is no way for me to
determine exactly how long a coaching dialog box was even displayed on the
screen, much less how long the student may have been attending to it.
However, I thought I could glean a rough idea of how long each coaching
message was processed by measuring the elapsed log time between the action
that triggered the coaching and the following logged action, be it browsing or
diagramming.
This post-coaching lag time measure is not perfect, for several reasons. Firstly, due to the delay between diagram updates and advice presentations, a facile student's subsequent action may have actually preceded the coaching associated with the previous, triggering action, making it appear as if the coaching had been ignored. Secondly, if the action immediately following delivery of coaching is the creation (or textual update) of a statement box, its time-stamp corresponds to the time when the new (or updated) box was placed in the diagram, not to the time when the Add (or Edit) box dialog was opened. Therefore, lag times for such actions would be artificially inflated, especially for slow typists, and would not reflect time spent processing advice. Thirdly, students sometimes left coaching dialogs open while they moved boxes around in their diagrams or browsed the database, either for later review or to simply get the dialog "out of the way" until they finished what they were doing. In either case, a short or long lag time may not reflect time actually spent processing the coaching feedback. Finally, lag times may not properly account for downtimes during software crashes, although I scoured my notes and logs for the crashes of which I was aware and found only one instance of a crash immediately following a coaching delivery. However, these imperfections aside, I felt the lag time measure could provide at least a general idea of how much time coached students spent reading the advice.
For each of the 19 viable coached sessions, I measured the lag time in seconds following each advice delivery. In cases where the advice-triggering action was the final action taken by the user during the diagramming session (n = 3), the subsequent logged action was my closing of the Belvedere software after debriefing; therefore the extreme lag times for these cases (1079 to 3715 s) were omitted from analyses. The remaining 468 lag times ranged from 1 to 86 s, with a mean of 20.04 s (SD = 16.20) and a median of 15 s. The distribution of lags was positively skewed (coefficient of 1.50), with a kurtosis measure of 2.31. An alarming 14% of the lag times (67 of 468) were of 5 or fewer seconds, arguably too short a time span for most students to have read even the briefest of the Coach's feedback messages. Fully one third of the lag times (154) were 10 s or less. Visual inspection of the distribution revealed a possible bimodal characteristic to it, with modes of approximately 4 or 5 s and 14 or 15 s. This suggests the possibility that the lag times may represent two different distributions: one for advice messages that are ignored, and one for those that are read. I computed a bimodality coefficient of 0.610 using a formula from the manual of a popular statistical analysis software package (SAS Institute, 1999). Because this coefficient was somewhat higher than the criterion of 0.555 listed in the manual (the maximum value is 1.0), there is some evidence of possible bimodality in the lag time distribution. However, despite a search of several statistical references I was unable to locate an appropriate significance test.
Recall that a small proportion of the advice messages (8%) were in response to user requests. While the mean lag for the 431 intrusive messages was 19.02 s (SD = 15.23), the mean lag for the 37 requested messages was a much higher 31.89 s (SD = 21.81), and despite the marked disparity in sample sizes and variances the difference was highly significant, t(39.1) = 3.52, p < .005. After factoring out lag times for the 9 null advice requests, the mean for the requested messages rose to 34.93 s (SD = 21.74) and differed even more significantly from the intrusive mean (t(28.7) = 3.81, p < .001), despite an even smaller sample size. These findings suggest that students may have spent significantly more time processing requested advice than the intrusive advice that was thrust upon them.
Summary
The intrusive Coach presented advice frequently in this experiment, with the
average student in Condition IC receiving feedback from the Coach every minute
and a half or after every other diagramming action. On the error count
measures, intrusive coaching appeared to have many effects in the predicted
direction, although many of them were not statistically significant. Some
unexpected trends were noted as well, but they were also nonsignificant.
Unpredicted significant findings were that coached students had fewer And
links and skipped more pages of the web database. Local reactions to a unique
advice message were noted, showing that at least some students responded
positively to coaching. However, lag time analyses raised the possibility
that students may not have been attending to much of the coaching feedback,
especially the feedback that was presented intrusively (which accounted for
92% of all coaching presented). I revisit this issue in Experiment 2.
The latest existing version of the on-demand Coach was too dissimilar from the intrusive version I used in Experiment 1; for example, it lacked the delay factors that were added during early piloting with the intrusive Coach. In order to ensure that both Coaches for Experiment 2 would be as similar as possible, I modified the LISP source code for the intrusive Coach used in Experiment 1 and created a separate, nonintrusive version for the on-demand condition. The new on-demand Coach was identical to the intrusive Coach except for the following: (a) it presented advice only when the user requested it by clicking the on light bulb icon (or on the Next Idea button after an initial advice request); and (b) the light bulb would blink when the Coach had pending advice that was deemed important enough to warrant a minimal intrusion, as described in my Introduction under the section on Intrusive Coaching.
To help counteract the old problem of on-demand Coach users never asking for coaching, I used a periodic reminder prompt in the on-demand condition of this experiment. The prompt, a series of audible beeps, was chosen so as to be as minimally intrusive as possible. To that end, I controlled the prompt signals myself, requiring only brief verbal acknowledgments from the participants. As for when and how often to issue the prompts, I decided to use a time-based rather than an event-based criterion, for two reasons. First, because users vary widely in the speed with which they perform diagramming actions in Belvedere (e.g., some are more facile with the keyboard and mouse than others), during diagramming sessions of equal length faster users would log many more diagram events than would slower users, thereby inflating the relative frequency with which they would receive the reminder prompts under an event-based criterion. A consistent time-based criterion seemed more consistent with the goal of my on-demand condition, which was to see if coaching could be helpful without being annoying like the intrusive condition. Second, while running Experiment 1 (as well as earlier pilot studies) I noted that users often pause from diagramming activity for up to several minutes, while engaged in browsing or in reviewing the current state of their diagrams. These pauses often occur later in the sessions, after users have explored the database and have generated a partial diagram, at points when (based on their comments) users are unsure of what to do next -- points at which coaching could be helpful to them. Under a time-based criterion, reminder prompts could sound during such pauses, whereas they would never sound during such "idle" times under an event-based criterion. I set the length of time between prompts to be 3 min, approximately twice the average inter-coaching time interval in Condition IC of Experiment 1 (91.95 s).
Method
Experiment 2 used the same method as Experiment 1, with the following
modifications:
Participants
Participants were 46 undergraduate students from the University of Pittsburgh
(32 males and 14 females) and were also from the Introductory Psychology
research participation pool. All participants except one were fluent in
English, and all had normal or corrected vision.
Design
Participants were block-randomly assigned to either Condition IC (intrusive
coaching) or Condition DC (on-demand coaching). Condition IC was identical to
that of Experiment 1. In Condition DC, Belvedere's automated Coach was
available on demand but it never intervened on its own; however, its light
bulb icon would blink when it had important advice to deliver. The Belvedere
diagramming interface was otherwise identical for both conditions.
Apparatus
I used a digital watch with a repeating 3-minute countdown timer to issue the
beeping reminder prompts to participants in Condition DC.
Procedure
Sufficiently in advance of each scheduled session, I ensured that the Coach
LISP process corresponding to the participant's assigned condition was running
on my workstation. My verbal instructions to participants in Condition IC
were the same as in Experiment 1. Instructions for Condition DC were the same
as those for IC with the following exception: After telling the participant
about the online Coach, I told her that the Coach is available only on
demand by clicking the light bulb icon. I then told her that periodically she
would hear some beeping sounds coming from my desk, that these beeps were
simply to remind her that the Coach was available whenever she wanted it
(i.e., she was not compelled to ask for coaching when she heard them), and
that she should simply acknowledge hearing them. In both conditions, I then
explained the appearance of the Coach's feedback as in Experiment 1 (see Appendix D for my run script).
In addition, whereas I intervened only to help the Condition IC participants with hardware or software problems (as in Experiment 1), in Condition DC I also intervened as needed to ask whether the participants heard the reminder beeps (i.e., if they did not acknowledge hearing them on their own).
Also, whereas in Experiment 1 the end-of-session survey included attitude rating items about the Coach only for participants in Condition IC, in this experiment participants in both conditions received identical surveys with the complete ratings battery.
Results and Discussion
Data Limitations
Data omissions. Data from one participant were omitted because (a) by
mistake the feedback he received during his session was from an older,
sufficiently different version of the on-demand coach (without the delay
factor on rule activations); and (b) his coaching log file was accidentally
overwritten. Data from another participant were omitted because the Coach
software failed during his session (after 25 of 43 diagramming actions). The
participant, who was in Condition IC, created an unusual And-link construction
in his diagram that the Coach was not equipped to parse, causing the Coach
LISP process to abort even after multiple restart attempts. Therefore, the
participant received no coaching feedback on any of his final 18 diagramming
actions. Data from a third participant were omitted for multiple reasons: (a)
At the end of his diagramming session he admitted to knowing much more about
the TWA 800 crash than was available in the online database, from having
watched CNN reports and even a Discovery Channel special about the crash; (b)
his prior knowledge strongly influenced his problem solving during the
session, as indicated not only by his survey responses but also by the fact
that his diagram and his browsing history considered only the most likely
cause named in the NTSB's final report (see Footnote
6); (c) a screen display problem prevented me from conducting the third
stage of his verbal summary; and (d) it is likely he did not meet the
screening criterion of English fluency.[15] After these three omissions, data
from 43 participants (22 in Condition IC and 21 in Condition DC) remained.
Covariate data. I was granted access to official SAT scores of all participants; however, three students had neither SAT nor ACT scores on record. Of these three students, two reported that they did not remember their SAT scores and the third reported imprecise estimates. There were three other students who had composite ACT scores in lieu of SAT subscores; I determined their equivalent SAT total scores as in Experiment 1. Therefore, for the 43 viable participants I was able to record only 37 SAT Math and Verbal subscores and 40 SAT total scores. One student denied me access to his GPA, leaving me with 42 accessible GPA figures. All 43 viable participants had usable NFC scale data.
Attitude ratings. There were two students in Condition DC who never asked for any coaching. During debriefing both students reported having given a neutral (5) rating to each of the six Coach-related statements on the end-of-session survey, for lack of a better option (e.g., a not applicable response). To best reflect actual student attitudes toward the Coach, I omitted the Coach-related statement ratings of these two students from the attitude ratings analyses below.
Downtimes due to software failures. For reasons unknown, software crashes on the student's computer were much more prevalent in Experiment 2 than in Experiment 1, affecting 13 of the 43 viable sessions. Three of the affected sessions were plagued by multiple crashes. In all but one case (an Internet Explorer crash when I tried to launch it for the survey), crashes occurred during the diagramming phases of the sessions and involved Belvedere, Netscape, or both. Many of the software failures required me to reboot the student's computer. Fortunately, in all cases I was able to restore both the student's most recently browsed web page in Netscape and the student's Belvedere diagram in its entirety (or, in the worst case, the diagram state immediately preceding the software crash). Therefore, as detailed below, the impact of these failures seems to have been limited to inflated session times as in Experiment 1. Total downtimes (for one or more crashes) for the 13 affected sessions ranged from 47 s to 13.15 min, with a mean downtime of 4.40 min (SD = 3.47, Mdn = 3.90). Software crashes occurred almost equally often in both conditions (during seven IC and six DC sessions). Mean downtime did not differ significantly between IC sessions (M = 4.48, SD = 2.72) and DC sessions (M = 4.30, SD = 4.48).
Statistical Notes
As in Experiment 1, an alpha level of .05 was used for all statistical tests,
and any degrees of freedom I express as a decimal number represents an
adjustment for unequal variances using Satterthwaite's approximation (as shown
in Snedecor & Cochran, 1980).
Covariates: Descriptive Statistics
NFC scale. As in Experiment 1, each participant's NFC item ratings were
averaged into a single composite measure after reverse scoring of the nine
applicable items (Cacioppo et al., 1984). The 43 NFC composite scores ranged
from 4.11 to 8.17 with a mean of 6.48 (SD = 0.93, Mdn = 6.56),
indicating a slightly higher average need for cognition than the students in
Experiment 1. The mean NFC score was slightly but not significantly higher
for IC students (M = 6.53) than for DC students (M = 6.43).
GPA. Student GPAs (at the end of the term during which the experiment was conducted) ranged from 1.29 to 3.63, with a mean of 2.77 (SD = 0.54, Mdn = 2.86), somewhat lower than in Experiment 1. There was no significant difference between mean GPAs of IC (2.73) and DC (2.82) students.
SAT scores. As in Experiment 1, for each student I recorded both most recent and highest SAT subscores, if different. Math subscores ranged from 400 to 800 with a mean of 586 (SD = 94, Mdn = 590), and Verbal subscores ranged from 350 to 770 with a mean of 582 (SD = 81, Mdn = 600). The mean total SAT score including the three equated ACT scores was 1163 (n = 40), 5 points less than the mean sum of subscores. The means of the students' highest Math and Verbal subscores were 595 and 594, respectively. Means for IC students were slightly but nonsignificantly higher than those for DC students on all SAT measures (see Table 4). These scores were somewhat higher on average than those in Experiment 1.
Table 4
Mean SAT Scores by Condition in Experiment 2
|
Median splits. I again performed median splits using each of the three covariate measures, for use in two-way ANOVAs on my dependent measures. Any significant interactions revealed by these analyses are reported throughout. Note that respective median-split sample sizes for low and high NFC were 21 and 22, but for low and high GPA and SAT totals they were 22 and 21. F-ratios for these analyses were approximate due to unequal sample sizes, so again, any interactions significant at or near the .05 level should be interpreted with caution.
Session Durations
Approximate total session duration (rounded to the nearest 5-minute increment
and including all crash downtimes) ranged from 45 to 120 min, with a mean of
80.00 min (SD = 20.41, Mdn = 80). Subtraction of all downtimes
reduced the overall mean session duration to 78.67 min. The IC sessions
lasted significantly longer (M = 85.39, SD = 21.67) than did DC
sessions (M = 71.63, SD = 15.91), t(41) = 2.36, p
< .05. Diagramming durations (discounting the downtimes), using the same
definition as in Experiment 1, ranged from 19.93 to 92.02 min, with a mean of
49.49 min (SD = 18.58, Mdn = 46.45). Diagram sessions of IC
students were significantly longer in duration (M = 56.14, SD =
20.23) than were those of DC students (M = 42.53, SD = 14.00),
t(41) = 2.55, p < .05. Having ruled out crash downtimes as a
possible cause for these duration differences, I investigated the relative
frequencies of coaching in the two conditions.
Amount and Frequency of Coaching
The 43 viable students received a total of 812 messages from the Coach, 678 in
Condition IC and 134 in Condition DC. Of the 678 IC messages, 623 (92%) were
presented intrusively and 55 (8%) were presented upon request, in the same
relative proportions as in Experiment 1. Of the 55 requested IC messages, 24
(44%) were null advice messages and 31 were substantive (all 623 intrusive
messages were substantive). Of the 134 DC messages, all of which were
requested, 96 (72%) were substantive and 38 (28%) were null advice messages
The total number of substantive coaching messages displayed to each IC student ranged from 10 to 68, with a mean of 29.73 (SD = 14.31, Mdn = 30.5). Each DC student received 0 to 15 substantive messages, with a mean of 4.57 (SD = 3.88, Mdn = 4). The difference in means was significant, t(24.2) = 7.94, p < .0001. Therefore, not surprisingly, students requested coaching much less frequently than it was presented in the intrusive condition. Each IC student received 0 to 10 null coaching messages (upon request) with a mean of 1.09 per student (SD = 2.39, Mdn = 0), while each DC student received 0 to 6 null messages with a mean of 1.81 (SD = 1.83, Mdn = 1). These means did not differ significantly, t(39.2) = 1.11, p = .27.
Completeness of Final Diagrams
Total element count. As in Experiment 1 I computed some gross, overall
measures of diagram completeness. The total number of boxes in each final
diagram ranged from 10 to 33 with a mean of 17.19 (SD = 5.63,
Mdn = 16). Total number of links ranged from 7 to 45 with a mean of
21.21 (SD = 10.59, Mdn = 19). Although IC diagrams tended to
have slightly fewer boxes than DC diagrams (Ms = 17.09 and 17.29,
respectively; t < 1), they tended to have more total links than DC
diagrams (Ms = 23.09 and 19.24, respectively; t(41) = 1.20,
p = .24), consistent with my expectations.
Box types. The number of Hypothesis boxes per final diagram ranged from 2 to 10, with a mean of 4.47 and median of 4 (SD = 1.84). There was no significant difference between IC and DC means (4.27 and 4.67, respectively). The number of Data boxes per final diagram ranged from 6 to 23, with a mean of 11.79 (SD = 4.34, Mdn = 11). Respective means of IC and DC students (12.09 and 11.48) did not differ significantly. The number of Unspecified boxes per final diagram ranged from 0 to 7, with a mean of 0.93 (SD = 1.52, Mdn = 0). Although IC diagrams tended to have fewer Unspecified boxes than did DC diagrams (Ms = 0.73 and 1.14, respectively), consistent with my expectations, the difference was not significant (t < 1). The total number of unlinked boxes per final diagram, regardless of box type, ranged from 0 to 3, with a mean of 0.28 (SD = 0.70, Mdn = 0). Respective means of IC and DC students (0.27 and 0.29) did not differ significantly. Of the 12 total boxes left unlinked by the 43 viable students, 3 were Unspecified boxes, 3 were unique Data boxes, 4 were unique Hypothesis boxes, and the other 2 were duplicate boxes (1 Data and 1 Hypothesis) entered by one student in a large diagram (scrolling was required to see both copies of each box). I revisit the issue of unlinked boxes in my discussion of diagram errors below.
Link types. The number of For links per final diagram ranged from 1 to 30, with a mean of 11.81 (SD = 6.97, Mdn = 12). There was no significant difference between IC and DC means (13.23 and 10.33, respectively; t(41) = 1.38, p = .18), but the trend was in line with my predictions. The number of Against links per final diagram ranged from 1 to 22, with a mean of 8.16 (SD = 4.57) and a median of 7. There was no significant difference between IC and DC means (8.46 and 7.86, respectively), but this trend was predicted as well. The number of And links per final diagram ranged from 0 to 7, with a mean of 1.23 (SD = 1.67, Mdn = 1). Unlike in Experiment 1 there was no significant difference between groups, and the IC mean (1.41) was slightly higher than the DC mean (1.05). However, feedback on the coaching rule that targets And links (conjunct-for-hypothesis?) was presented to users in both conditions (10 in IC and 9 in DC), so the nonsignificant difference is not surprising.
Correlations with covariates. The only significant two-tailed correlation between my gross diagram completeness measures and my covariate measures was between number of And links and SAT Math subscore (r(35) = .36, p < .05). SAT total scores, with the higher sample size (n = 40), showed only marginal positive correlations with number of And links (.28, p = .07) and number of Data boxes (.26, p = .10). An ANCOVA on number of And links showed an only marginal effect of SAT totals (t = 1.78, p = .08), with no significant difference between adjusted means (1.37 for IC and 1.09 for DC). An ANOVA showed significant interactions of coaching condition and median-split GPA on number of Data boxes (F(1, 39) = 4.91, p = .033) and on number of Against links (F(1, 39) = 5.20, p = .028). DC students with low GPAs tended to have fewer Data boxes (M = 11.11) than those with high GPAs (M = 11.75), whereas IC students with low GPAs had more Data boxes (M = 13.92) than those with high GPAs (M = 9.44). Similarly, DC students with low GPAs had fewer Against links (M = 6.56) than those with high GPAs (M = 8.83), whereas IC students with low GPAs had more Against links (M = 9.85) than those with high GPAs (M = 6.44). There was also a significant interaction of coaching condition and median-split SAT total on the number of And links, F(1, 39) = 8.76, p = .005. IC students with high SATs had more And links (M = 2.27) than those with low SATs (M = 0.55), but DC students with high SATs had fewer And links (M = 0.50) than those with low SATs (M = 1.55).
Expert Diagram Comparisons
I compared student diagrams to the same expert diagram used for Experiment 1
(refer back to Figure 3) Once again, some of the
students who included more than the four key hypotheses in their diagrams
included the "Sabotage" and "Accident" group headings from the hypothesis
index as well (see Figure A6 in Appendix A). As
in Experiment 1, some of the other extraneous hypotheses included by students
were statements by witnesses or investigators, and some students entered some
specific conjectures of their own as well.
Expert diagram Data statements that did not appear in most student diagrams
included the regularly scheduled flight to Paris (n = 1), the flight
bound from JFK airport to France (n = 3), the 1997 NTSB statement about
remaining possible causes (n = 2), the service history of the plane
(n = 7), and the bomb-related statements about altitude (n = 6).
The other two low-frequency Data statements from Experiment 1 were included
with greater frequency in Experiment 2: the crash in Colombia due to pilot
error (n = 11) and the statement regarding the split-second noise on
the flight recorder (n = 11).
Errors in Final Diagrams
As in Experiment 1 I counted instances of uncorrected diagramming errors in
final student diagrams, disregarding any extraneous hypothesis boxes not
present in the expert diagram. I coded final diagrams for the following
errors: (a) the number of missing hypotheses, (b) the number of hypotheses
subject to confirmation bias (no Against links), (c) the number of unsupported
hypotheses (no For links), (d) the number of unique hypotheses without any
links at all, and (e) the number of unique data without any links. Error
counts per condition are shown in Table 5 along with
total errors per condition, with student subsample sizes shown in
parentheses.
Table 5
Final Diagram Error Counts by Condition in Experiment 2
|
As in Experiment 1, the most popular omitted hypothesis was HE (n = 9). Nine other students included it but left it unsupported; however, eight of these nine students were in Condition DC. The next most popular hypothesis to leave unsupported was MF (n = 6, 4 in DC and 2 in IC). These patterns help to explain the significant difference between groups on proportion of unsupported hypotheses. The most popular hypotheses subjected to confirmation bias were M and B (respective ns = 7 and 5), with only one student showing the bias on HE. No students showed confirmation bias on MF.
Diagram quality. Using the same definition of diagram quality as in Experiment 1, I found proportionally more inadequate final diagrams (27) than adequate ones (16) among the 43 viable students. The 22 IC student diagrams were evenly divided (11 adequate and 11 inadequate), but in the DC condition inadequate diagrams far outnumbered adequate ones (16 vs. 5, respectively). A chi-square analysis showed this apparent non-homogeneity between conditions to be almost statistically significant (3.15, p = .08), suggesting a possible advantage of more frequent coaching for overall diagram quality.
Distinct Coaching Effects on Diagramming
I once again examined student reactions (or lack thereof) to the coaching rule
that elicits the most distinct diagramming response,
attend-to-discrepant-evidence (refer back to Figure
4). Recall that coaching on this rule advises users to modify the default
belief strengths of selected diagram constructs and to activate a display
filter that is hidden within one of Belvedere's pull-down menus.
Of the 43 viable students, 30 received coaching on this rule at least once (20
in IC and 10 in DC). Of these 30 students, 16 changed the default belief
level of at least one of their boxes after delivery of the advice (10 in IC
and 6 in DC), but only 2 (both in IC) turned on the display filter as well.[16] The number of boxes modified by IC
students ranged from 1 to 16, while the number modified by DC students ranged
from 1 to 8, probably because the rule was triggered more frequently in
Condition IC (up to 10 times per session) than in Condition DC (only once per
session).
Coaching Effects on Web-Browsing
Of the 43 viable participants, only 14 (7 in IC and 7 in DC) browsed all 38
pages of the TWA database at least once. The other 29 students (15 in IC and
14 in DC) skipped from 1 to 10 pages each, with a mean of 3.38 (SD =
2.47, Mdn = 3). In this experiment, respective means of the more
coached (IC) and less coached (DC) groups did not differ significantly in
either the reduced sample of page-skippers (3.53 vs. 3.21) or the complete
sample (2.41 vs. 2.14), ts < 1.
Of the 38 total pages in the database, 27 were skipped by at least one student. The most commonly skipped page (n = 12) was the same as in Experiment 1, the government meeting mentioned almost in passing at the bottom of the hypothesis index (see Figure A6 in Appendix A). Three IC students skipped the HE hypothesis page and one DC student skipped the M hypothesis page (and thus their respective data indexes as well). Among all students who browsed each of the four hypothesis pages (19 in IC and 20 in DC), more skipped the indexes of evidence against them than skipped the indexes of evidence for them, for all but the MF hypothesis (respective ns were 8 vs. 1 for B, 7 vs. 2 for M, 0 vs. 0 for MF, and 3 vs. 0 for HE). As in Experiment 1, this browsing pattern helps to explain the tendency toward confirmation bias noted in the diagram errors, especially for the B and M hypotheses. However, this tendency appeared to be more equally distributed in browsing than in diagramming; of the 17 viable students who skipped an evidence index against at least one hypothesis, 9 were IC students and 8 were DC students.
Timing of Coaching Requests
Coaching reminder prompts. During the 21 viable DC sessions, the number
of reminder prompts (discounting those that sounded during crash downtimes)
ranged from 6 to 26 with a mean of 12.76 prompts per student (SD =
4.63, Mdn = 13). The number of prompts that were followed by a student
request for coaching (within 1 min of the prompt) ranged from 0 to 11 with a
mean of 3.33 prompts per student (SD = 2.96, Mdn = 3). In other
words, within a minute after a reminder prompt, the average student requested
advice 26% of the time. Only three DC students never followed up on a
reminder prompt; two never requested coaching at all, and the other requested
coaching only once, midway between two prompts. Prompted coaching requests
resulted in null advice 0 to 4 times per student, with a mean of 0.90
(SD = 1.04), a mode of 0 (n = 9), and a median of 1 (n =
7). Advice requests following reminder prompts accounted for 70 of the 134
total (52%) DC advice requests. Half of the null advice messages received by
DC students (19 of 38) followed a reminder prompt.
Not surprisingly, the number of prompted requests for advice was positively correlated with the number of reminder prompts, r(19) = .44, p < .05 (two-tailed). However, I also noted an even stronger positive correlation between the number of prompted requests and the number of null advice results (.77, p < .00005). Although the frequency of null results would naturally be lower for students who requested advice less often, I wondered whether the occurrence of null results would make students less willing to follow up on the reminder prompts. I regressed the number of prompted requests simultaneously on both the number of reminder prompts and the number of null advice results. The overall test was significant (F(2, 18) = 18.33, p < .0001, r2 = .67). Number of null results had a significant effect (B = 2.01, t = 5.12, p = .0001) and number of prompts had a nearly significant effect (B = 0.18, t = 2.01, p = .06). However, I have insufficient basis to claim that null results caused DC students to disregard their reminder prompts. In fact, the student who received the most null results (4) also made the most prompted coaching requests (11), both in absolute terms and proportionally to his number of reminder prompts (i.e., he followed up on 79% of 14 prompts).
Light bulb blinks. As mentioned in my Introduction, a minimally intrusive coaching feature had been added to Belvedere (beginning with version 2.0), by which the light bulb icon in the palette would slowly flash on and off (four times) when the Coach had some particularly important new advice to offer the user. Even though most pilot users never noticed the flashing, I decided to leave it enabled for the DC condition[17] in my experiment for its potential as an additional prompt to seek advice. Each such series of four flashes is hereafter counted as a single blink. The number of times such blinks occurred during each of the 21 viable DC sessions ranged from 0 to 25, with a mean of 8.95 blinks per session (SD = 6.51, Mdn = 9). During two sessions, the bulb never flashed because none of the crucial advice rules (see Appendix B) ever applied. For each session, the number of blinks was uncorrelated with the number of coaching requests, r(19) = -.06, p = .79. Over all DC sessions there were 188 blinks, but only 27 of them (14%) were followed immediately by a coaching request. Of the 19 students for whom the bulb ever blinked at all, there were 12 students who requested coaching immediately afterwards at least once. However, most of them (n = 8) responded to the blinks only once or twice. For their 12 sessions there were 11.50 mean blinks per session (SD = 3.83, Mdn = 10.5), but only 1 to 5 immediate coaching requests, with a mean of 2.25 requests per session (SD = 1.14, Mdn = 2).
One question to ask of these data is how many of the students actually noticed the blinking. Unfortunately, during session debriefings I was not as systematic as I should have been in asking whether students ever noticed the bulb blinking. As a result, I lack answers to this question for 7 of the 19 viable DC students for whom the bulb blinked at least once. However, of the other 12 students, 7[18] reported having noticed them and 5 reported not noticing them. Of the 7 students for whom I did not record a response, only 2 never requested coaching immediately after a bulb blink, so it is possible that the other 5 did notice them. It is also possible that their advice requests were coincidental to the blinks, or that they followed reminder prompts instead of blinks. However, because the clock used to time-stamp the bulb blinks in the log was not perfectly synchronized to the watch used to deliver the reminder prompts, and because the bulb blinks often began after a delay similar to that of the advice presentations, any attempt to synchronize the reminder prompts with the bulb blinks would be difficult at best.
Attitude Ratings
Ratings of Belvedere.
On the end-of-session survey (see Appendix E),
respective mean ratings of the six Belvedere statements B1 through B6 (after
reverse-scoring of items B2 and B5) were 5.81, 6.95, 6.93, 6.98, 6.09, and
6.23, each with a median rating of either 6 or 7 on the nine-point scale. The
mean composite rating was 6.50 (SD = 1.22, Mdn = 6.67), somewhat
less favorable than in Experiment 1. Once again I predicted that less
frequently coached (DC) students would report more positive attitudes toward
Belvedere than would IC students. I did find a marginally significant
difference in mean composite ratings between the two conditions, with the DC
students (M = 6.85, SD = 0.94) giving more favorable ratings on
average than IC students (M = 6.17, SD = 1.37), t(41) =
1.90, p = .065. Group mean ratings for the individual statements are
shown in Table 6. DC students gave a significantly more
favorable rating than IC students (after reverse-scoring) to statement B5
("It would have been easier for me to work on the assigned problem without
using Belvedere"), t(34.6) = 3.84, p = .0005. The
DC-favored difference in means for statement B4 ("Belvedere would be
helpful in collecting and organizing information for a paper or report")
was marginal, t(41) = 1.67, p = .10. All other differences were
also in the predicted direction, but none of them approached statistical
significance.
Table 6
Belvedere Attitude Ratings by Condition in Experiment 2
|
Ratings of the Coach. As noted in the Data Limitations section above, the neutral ratings of the two uncoached DC students were omitted from the following analyses. Table 7 shows the mean Coach-related ratings by condition for the remaining 41 students. Respective overall means for the six statements were 4.71, 4.90, 3.90, 6.61, 5.42, and 5.24. Median ratings ranged from 3 to 7, with an overall mean composite rating of 5.13 (SD = 1.82, Mdn = 5.33). The only significant difference in ratings between groups was for statement C5 ("The Belvedere system would be better off without the Coach"), to which the DC students gave a more favorable rating after reverse scoring (i.e., DC students disagreed and IC students agreed), t(39) = 2.53, p < .05. However, most other means fit the predicted pattern. Note from Table 7 that the mean composites of IC and DC students were on opposite sides of the neutral mark. In fact, the only statement to receive a favorable mean rating from IC students was C4 ("The feedback I received from the Coach was easy to understand"), which was also the only statement to receive an appreciably (but nonsignificantly) higher mean rating from IC than from DC students. Statement C3 ("Often I found the feedback from the Coach to be repetitive") received a very slightly more favorable rating from IC than DC students after reverse scoring, and it was also the only statement to receive an unfavorable mean rating from DC students. That is, while mean DC ratings were favorable in most cases, students in both conditions slightly agreed on average that the Coach was repetitive. However, neither group agreed to this statement as strongly as did the coached students in Experiment 1 (refer back to Table 3).
Table 7
Coach Attitude Ratings by Condition in Experiment 2
|
Ratings correlations. One-tailed rejection criteria were used for all analyses reported in this paragraph. As predicted, the students' composite Belvedere ratings were positively correlated with their composite Coach ratings, r(39) = .39, p = .005. Interestingly, the correlation was much higher for DC students (r(17) = .63, p < .005) than for IC students, among whom the correlation did not even reach statistical significance (r(20) = .24, p = .15). Among the IC students (n = 22), composite Coach rating was significantly correlated only with Belvedere item B6 (r = .44, p = .02), and composite Belvedere rating only with Coach-related items C1 and C2, respective rs = .38 and .41, ps < .05. Among the DC students (n = 19 after the uncoached omissions), composite Coach rating was significantly correlated at the .05 level or better with all Belvedere item ratings except for B4 (r = .24, p = .16) and B3 ("Belvedere helped me keep track of the various pieces of information relevant to the problem"; r = .32, p = .09), and composite Belvedere rating with all Coach-related items except C3 (r = .27, p = .14), C4 (r = .34, p = .08), and C5 (r = .37, p = .06).
Verbal Summaries
Having found mainly nonsignificant trends in my primary dependent measures as
in Experiment 1, I again turned to my secondary data source, the verbal
end-of-session summaries. I again focused on the third phase of the verbal
summaries: additions to summaries following diagram redisplays. I again coded
the third phase of each verbal summary for the number of statements uttered
and for the number of relations either stated or strongly implied. Students
added 0 to 5 statements to their summaries, with a mean of 1.14 statements
(SD = 1.57, Mdn = 0). As predicted, DC students tended to add
more statements (M = 1.29, SD = 1.65) than did IC students
(M = 1.00, SD = 1.51); however, this difference was not
significant (t < 1). Students mentioned 0 to 6 relations in their
additions to their summaries, with a mean of 0.95 relations (SD = 1.56,
Mdn = 0). DC students tended to add more relations (M = 1.19,
SD = 1.83) than did IC students (M = 0.73, SD = 1.24),
but this difference also was not significant (t < 1). As in Experiment
1, both addition measures were highly intercorrelated, r(41) = .86,
p < .0001.
Number of added statements had a significant negative correlation with QPA (-.39, p < .01) and with NFC (-.31, p < .05), and NFC also had a marginal negative correlation with the number of relations added (-.27, p = .08). An ANCOVA using all three covariates showed a significant overall covariate effect on the number of statements added, F(3, 38) = 3.96, p < .05. Although the adjusted means showed inflated differences in the predicted direction, there remained no significant difference between group means, which were 0.97 statements for IC students and 1.31 statements for DC students (F < 1). ANCOVAs on the number of added relations did not show a significant effect of covariates using any regression model.
As in Experiment 1 there were tendencies for students with adequate diagrams
to include more statements (M = 1.38, SD = 1.71) and relations
(M = 1.06, SD = 1.57) than students with inadequate diagrams
(respective Ms = 1.00 and 0.89, SDs = 1.49 and 1.58), although
neither difference in means was significant (ts < 1). However, a
two-way ANOVA showed a significant interaction between diagram quality and
coaching condition on the number of statements added, F(1, 39) = 9.88,
p < .005. The IC students added more statements if they had adequate
diagrams (M = 1.73) than if they had inadequate diagrams (M =
0.27), but DC students showed the opposite pattern, adding more statements if
they had inadequate diagrams (M = 1.50) than if they had
adequate diagrams (M = 0.27).
Lag Times Following Advice Delivery
As in Experiment 1, for each viable session I measured the lag time in seconds
between delivery of each advice message and the action immediately following
it. I again omitted any lag times for which the advice-triggering action
(i.e., a diagram action for intrusive messages, or a coaching request for
on-demand messages) was the user's final action; such lag times (n = 7)
ranged from 942 to 1426 s. I omitted 14 additional lag-time outliers ranging
from 106 to 238 s; these lag times followed documented cases of crash
downtimes, user scrolling of either the Netscape or Belvedere window (neither
of which are logged actions), verbal interactions between the user and myself,
slow typing or updating of text in a box dialog (as discussed at the end of
Experiment 1), or user inactivity. The remaining 791 lag times ranged from 1
to 98 s, with a mean of 24.23 s (SD = 20.25) and a median of 17 s. For
the 664 lag times in Condition IC, the mean was 22.81 s (SD = 19.63)
and the median was 16. As in Experiment 1, the distribution of IC lags was
positively skewed (coefficient of 1.43), with a kurtosis measure of 1.54 and a
slight hint of bimodality. The bimodality coefficient (computed from the same
formula used in Experiment 1) was 0.669, even higher than the coefficient for
the IC lags in Experiment 1.
The 127 lag times for Condition DC, on the other hand, had a mean of 31.64 s
(SD = 21.85) and a median of 28 s, and the lag distribution exhibited
much less skewness (coefficient of 0.86) and kurtosis (.008) with fewer signs
of bimodality. The bimodality coefficient for the DC lags was only 0.565, not
much higher than the criterion value representing a uniform distribution (SAS
Institute, 1999).
The difference in mean lag times between groups was highly significant,
t(789) = 4.56, p < .0001. Bartlett's test gave an almost
significant result for inequality of group variances (F(126, 663) =
1.24, p = .0515); however, the difference in means was significant even
assuming unequal variances (t(167.1) = 4.24, p = .0001).
Given the higher frequency of null advice messages in comparison to Experiment 1 (mostly in Condition DC), I then restricted analyses to substantive advice messages only, for which between group mean lags were 22.35 s (SD = 19.01, N = 642) for IC students and 34.71 s (SD = 22.25, N = 93) for DC students. These means also differed significantly, t(112.3) = 5.09, p < .0001. Among the DC students, respective mean lag times for null and substantive messages were 23.24 s (SD = 18.54, N = 34) and 34.71 s (SD = 22.25, N = 93). The difference in means was significant, t(125) = 2.68, p < .01, confirming my suspicion that the null advice messages would have shorter lag times. Indeed, the mean lag time for null messages was not much shorter than the mean IC lag time for substantive messages. In order to investigate possible differences in lag time between intrusive and requested advice messages, I pooled the lag time data from the IC conditions in both experiments, resulting in a set of 1043 intrusive messages and 58 requested messages. Despite the inordinately unequal sample sizes, the mean lag time of 20.96 s for intrusive messages (SD = 17.60) was significantly shorter than the 28.72 s mean for requested messages (SD = 21.40), t(61.4) = 2.71, p < .01. When I also pool the lag times from the DC condition, all of which followed requests for advice, the mean for requested messages increases to 32.41 s (SD = 22.05, N = 151), which differs even more significantly from the intrusive mean, t(178.7) = 6.11, p < .0001. Thus, lag time data from both experiments suggest that students spend much less time processing intrusive advice than they do requested advice.
Summary
Students in Condition IC received five times the number of advice messages
requested by DC students. However, most performance measures continued to
show only trends in the predicted direction, with fewer unpredicted trends
than in Experiment 1. Predicted trends in attitude ratings were stronger than
in Experiment 1, although ratings correlations were weaker among IC students
in this experiment. Once again, local reactions to a unique advice message
were noted, showing that students in both conditions responded positively to
at least some of the coaching. However, the lag time analyses show that the
time between coaching and subsequent actions is significantly longer in the
on-demand condition than in the intrusive conditions of both experiments.
Collapsed across groups, lag time is significantly shorter for intrusive
messages than for requested messages, despite the huge sample-size advantage
for intrusive messages. Therefore, even though the lag-time measure is
imperfect as described above, it seems to indicate that Belvedere users paid
more attention to requested coaching than to intrusive
coaching.
Ceiling Effects
It is conceivable that the problem solving task used in my experiments was too
easy. Although the TWA crash problem was selected for its accessibility to
students, it may have been too accessible. The proliferation of media
coverage of airplane crashes may have "lowered the bar" on the task of
evaluating and analyzing the evidence and hypotheses in my online database.
In fact, my reindexing of the original TWA database may have made the task
easier still. Although the reindexing allowed me to determine the relative
frequency with which users considered evidence for and against the various
hypotheses, the link structure made the evidential relationships obvious to
the users, providing more inquiry scaffolding than they would otherwise have
had with a database of unconnected hypotheses and data.
Self-Correction of Errors
As discussed in my results of Experiment 1, students who make a conscientious
effort to wade through a problem database with explicit evidential indexing
like the one I used, will likely either avoid or self-correct any errors they
might otherwise have left in their final diagrams. The similarity in error
counts between conditions should not be surprising, given that half of the
students in Experiment 1 and a third of those in Experiment 2 browsed the
entire database, while the remaining students browsed the majority of it. It
is possible that, had I not made the evidential relations explicit in the
database, more of the between-group differences in final diagram errors counts
might have been significant.
Student Ability Level
It is also possible that the task was too easy for college-aged students in
general, as opposed to younger students. Belvedere was originally designed
with middle-school students in mind, to address curricular deficiencies in the
teaching of scientific inquiry skills. Although one could claim that many
college freshmen (the typical population enrolled in undergraduate courses in
introductory psychology) are deficient in these skills as well, it is likely
that they would perform at a higher level than their younger counterparts, by
virtue of their greater knowledge if not of their age or grade in school
(Means & Voss, 1996). Although some of my covariate measures suggested
differences between groups on some of my performance measures, they had
generally little effect (possibly due to the potential ceiling effect).
The Apparent Advantage of Requested Advice
As noted in my results, students appeared to spend significantly more time
processing requested advice than they did intrusive advice. Although this
conclusion is based solely on analyses of post-coaching lag times, a measure
that suffers from many drawbacks as discussed in Experiment 1, the finding was
replicated with a larger sample size in Experiment 2. Therefore, it seems my
students were more receptive to advice when they asked for it themselves, even
if it was not provided when it became immediately applicable. However, it is
difficult to tease apart the issues of feedback timing and control in my
second experiment. Given the nature of on-demand coaching, in which feedback
delivery is by definition under the control of the user and may not occur at a
time when the feedback may first be helpful, the two aspects of feedback
timing (immediate vs. delayed) and feedback control (user-
vs. system-initiated) are inexorably linked in this study. The question
remains as to whether immediacy or locus of control is the more important
aspect of feedback with Belvedere.
When Was Advice Requested?
One exploratory question to ask of the data from participants in Condition DC
is when (i.e., under what circumstances) they requested advice from the
on-demand Coach. Although my introduction of a reminder prompt was meant to
increase their interactions with the Coach, it also made it difficult to
determine any patterns indicating when users felt they needed advice. Some
earlier speculative notions that student requests for advice might be
impasse-driven (e.g., VanLehn, 1988b) do not seem to apply because working
with Belvedere represents problem solving in the absence of correct answers
(D. Suthers, personal communication, May 23, 1998; A. Lesgold, personal
communications, October 23 & November 16, 1998, May 30, 2000). Not only are
there no true impasses in the sense of becoming blocked on the path toward a
correct answer, but also the ill-defined nature of the problem may make it
difficult for a user to even recognize an impasse in any other sense (e.g., a
knowledge deficit or a procedural stumbling block). Even if users in my
experiments did recognize any deficits in their problem-relevant knowledge,
the web database provided enough scaffolding for them to correct such deficits
without having to rely on coaching.
Of course, there was the possibility that DC students would simply request coaching only when reminded of its availability by the experimenter. Indeed, based on my definition of what constituted a prompted request for advice, more than half of all DC advice requests followed a reminder prompt. However, regardless of the criterion chosen for the timing of reminder prompts, I expected that students probably would not request advice after every reminder, and indeed they did not. Unfortunately, it is difficult to partial out the relative effects of reminder prompts and minimally intrusive light bulb blinks on DC student requests for coaching. That is, it is difficult to determine any emerging patterns from their advice requests beyond these coaching prompts.
When to Present Advice Intrusively?
Although no patterns are immediately apparent from the DC student advice
requests, one thing is certain: They did not request advice nearly as often as
it was provided by my intrusive Coach. The modifications that created this
intrusive Coach reflected an admittedly brute-force approach, adopted to
ensure the delivery of sufficient coaching feedback to examine its effects
within the Belvedere framework. However, given the dearth of positive
performance effects and the apparent negative affective impact of its advice,
it is probably safe to conclude that any future versions of an intrusive Coach
for Belvedere should scale back its frequency of advice presentation. Instead
of providing feedback every time a critical rule applies, as in my
experiments, an enhanced intrusive Coach might delay its presentation of
advice until one of several possible "key points" in an interaction, such as a
period of inactivity on the part of the user. Although such an enhancement
would require sensitivity to information not currently available to the Coach
(e.g., time spent typing into dialog boxes or browsing web pages), adding such
time-based sensitivities might be worth the investment.
Limitations of the Coach
In addition to the drawbacks inherent in its domain-general nature, the
argument pattern Coach suffers from several other limitations as well, any
number of which could negatively skew the attitudes of frequently coached
users like my Condition IC participants. Below I discuss some of the
limitations mentioned informally by several of my participants.
"Jumping the Gun"
Despite the introduction of a delay factor to the advice rules, which caused
the Coach to wait until a sufficient number of diagramming actions had passed
before offering applicable advice (see my Introduction section on Intrusive
Coaching), many of my students complained that the Coach would sometimes
advise them to take actions they were about to take anyway. In some cases,
students claimed that the Coach interjected such advice while they were
actually in the process of carrying out the action it recommended. One way to
address this problem would be to adjust the delay factors of the coaching
rules, so that the Coach would wait even longer than it currently does before
presenting advice. However, in order to thoroughly determine which specific
rules to adjust, not to mention how to adjust them, one would have to solicit
reactions from users at every advice delivery about the appropriateness of its
timing. While asking users to verbalize their thoughts when advice is
presented during problem solving would have several advantages, at least from
the standpoint of improving the effectiveness of the Coach, it would also
interfere with the very task the Coach aims to support. Soliciting such
frequent reactions to coaching would run the risk of placing users' comments
outside the context of the inquiry task.
Imposing Order on the Inquiry Process
Another possible drawback to the current Coach is the way it tries to
structure the process of creating an inquiry diagram. The Coach guides
Belvedere users to consider hypotheses and evidence together, such that if
they enter too many boxes of one type (Data or Hypotheses), the Coach will ask
the users to link them to some boxes of the other type. Some of my students
complained that they preferred to collect all of the relevant evidence
surrounding the plane crash before considering any of the possible causes.
Other students preferred to enter both hypotheses and evidence as they
encountered them, but to add relational links only after they had collected
all of their "thoughts" in the diagram. This inflexibility of the Coach,
while helping to ensure that hypotheses are supported and that data are
explained, does not allow for the possibility of multiple solution paths as
indicated by the preferences of these students.
Findings of a Similar Study
An aforementioned study by Toth et al. (in press) also investigated the
problem solving products of groups of students using Belvedere. Like mine,
this study focused not on domain knowledge but on the scientific inquiry
skills acquired while problem solving via "evidence mapping" in an ill-defined
domain. Although the study did not involve the automated Coach at all, it did
involve the use of reflective assessment rubrics, which made explicit the
criteria for evaluating reasoning representations such as a Belvedere diagram
or a prose summary. These rubrics were analogous to many of the same
evaluation criteria encoded into the Coach's pattern-matching rules and were
therefore similar, although the rubrics were introduced before and after
problem solving rather than online. The rubrics were seen as complementary to
the representational scaffolding provided by Belvedere, with the authors
concluding that "rubrics seem to encourage students to look for and record
disconfirming as well as confirming information more than mapping alone".
Although Toth et al. investigated graphical (Belvedere) versus textual (word processor) representations, I focus here only on the former aspect of their work, particularly on comparisons of the two groups who used Belvedere with and without the rubrics. They measured students' information searching behaviors by counting the number of relevant Hypotheses and Data entered in diagrams, as well as the number of For, Against, and And links. Their student participants were asked to write prose conclusions after problem solving with Belvedere, during which their diagrams were not visible (cf. my students' verbal summaries). The authors scored these prose conclusions based partly on whether they included evidence for a main hypothesis as well as evidence against it, similar to how I evaluated my students' diagrams. They also found no significant differences between their basic information search measures (akin to my body count measures), with only a trend favoring those who used the rubrics. They found that rubric users had significantly more links than non-users, supporting the nonsignificant trend favoring the IC condition in my Experiment 2. However, they found no significant differences between rubric groups on the specific numbers of For, Against, or And links, which paralleled my own findings (with the exception of the And links in Experiment 1). They also found no differences in reasoning scores on their students' prose conclusions, consistent with the lack of differences in my selective verbal summary comparisons.
Toth et al. concluded that further evidence mapping studies with Belvedere or other representation formats should seek to account for differences in student reasoning ability, which is something I tried to do with my surrogate covariate measures. They also stressed the need to evaluate the impact of such representations of the problem solving activities of individual users, something else my research has done. They also discussed the potential need for different representational supports for "inductive" reasoning (i.e., entering Data first, before Hypotheses) versus "deductive" reasoning (i.e., starting with Hypotheses before entering and linking Data), touching upon one of the major limitations of Belvedere's Coach as described above.
Conclusions
Returning now to assessing the impact of Belvedere's minimally intelligent
automated coaching, I am forced to conclude that immediate feedback seems to
have done more harm than good, at least in the present experiments. Although
there were several trends in the data to suggest performance benefits of such
feedback, there was stronger evidence of its negative affective consequences
on user attitudes. However, it is difficult to assess whether the harm done
by immediate feedback in these studies was due to its timing or to locus of
control. Many would agree that the intrusive Coach intervened too frequently
(and, perhaps, too immediately). However, its feedback may have been
too limited to warrant such frequent delivery. It is possible that better,
more useful feedback than currently provided may not be judged as annoying
even if delivered with the same frequency. Alternatively, if the current
intrusive Coach could be made to scale back its frequency of advice delivery
by intervening more selectively, control could become less of an issue.
Further research on the timing of advice requests could inform the design of a
more selective intrusive Coach, such that perhaps it could be made to
intervene at points when users tend to ask for advice anyway. Also, as
mentioned earlier in this discussion, if the Coach could be made aware of user
idle times perhaps it could be made to intervene at those times, instead of
popping up while the user is engaged in a flurry of browsing and diagramming
activity. Such modest delays of otherwise immediate coaching feedback could
help tip the cost/benefit scales back toward optimality.
Future Directions
Extensions to the Coach
Expert advice. One extension to Belvedere's Coach that has already been
implemented (see Footnote 2) is the addition of
expert domain knowledge, from which the Coach can draw in generating more
domain-specific advice to users. Although such a capability requires
additional knowledge engineering, the payoff could be high for reusable
problem domains (e.g., scientific debates relevant to a school curriculum in
which many students would need to tackle the same issues). One shortcoming of
the current argument pattern Coach is that it cannot recognize the contents of
statements entered in a diagram. Therefore, some of its advice rules (e.g.,
explain-all-the-data) are prone to suggesting possible relationships
between hypotheses and data where simply none exist. As described elsewhere
(Suthers et al., in press), adding minimal semantic annotations to a
self-contained web database (using a knowledge representation language
accessible to the Coach) can enable the Coach to recognize the basic content
of information chunks transferred into Belvedere from the database. The Coach
can then provide more specific feedback relevant to the actual information
entered, using the domain-general argument pattern Coach as a fall-back when
any non-annotated information (e.g., from outside the self-contained database)
is entered. The benefits relative to the costs of the additional knowledge
engineering still remain to be seen, but it is a logical next step from the
present research.
Graphical advice. Another possible limitation of the current advice delivered by Belvedere's Coach is the modality in which it is presented. Belvedere is a graphical environment in which the units of representation are visual objects (boxes and links). However, although many of the Coach's advice messages are accompanied by graphical highlighting of diagram objects, the advice itself is largely text-based. A possible extension to the current argument pattern coach would be for the Coach to present temporary placeholder boxes or links in the diagram when it provides advice (D. Suthers, personal communication, September 23, 2000). For example, on the confirmation bias rule, the Coach could present a temporary Data box with a temporary Against link between it and the target Hypothesis box. The temporary Data box could be empty or, if combined with the expert coaching functionality, could contain an actual piece of disconfirming evidence from an annotated database. These temporary constructs would disappear from the diagram once the user dismissed the advice (using an analog to the "Close" button on the current advice dialog boxes). Such an extension might make the advice more salient to users, and embedding the advice in the same modality as the diagramming task may make it less of a distraction. However, one argument against this scheme is that providing such placeholders, especially ones containing actual domain information, would be akin to providing users a "bottom out hint" (Aleven & Koedinger, 2000, p. 294) in a hierarchical sequence of help messages. Many tutorial systems prefer to begin with general advice on principles or solution processes that will help users to arrive at a correct answer on their own, reserving an actual correct answer itself for the final (bottom) hint in the sequence. Providing Belvedere users with an actual piece of disconfirming evidence for a confirmation-biased hypothesis would relieve them of the burden of finding and analyzing the status of such evidence on their own. If a task goal is to learn scientific inquiry skills, then providing help on that level would seem to run counter to the goal. However, the less intelligent version of the idea, in which empty boxes are temporarily presented by the Coach, would seem to be a viable compromise extension to the current Coach.
Same Coach, Different Scenarios
Finally, an obvious extension to the current research would be to use the same
types of coaching with (a) more difficult ill-defined problems, (b) less able
experimental participants, or (c) both. As mentioned earlier in this paper
(see Footnote 3), other self-contained web
databases already exist for more complex ill-defined problems (databases with
much less scaffolding of evidential relations than the one I used), and the
domain-general nature of coaching in Belvedere opens it to countless other
domains as well. To the extent that the relative lack of positive effects in
my experiments was due to the simplicity of the chosen ill-defined problem or
to the structure of its associated database, solvers of more complex problems
could conceivably get some more mileage out of the current Coach than did my
students. Alternatively, younger students or others more deficient than
college freshmen in the skills of scientific inquiry might find the Coach's
current feedback to be more helpful. Even the intrusive Coach in its current
incarnation might be less likely to jump the gun on users who did not find its
feedback to be superfluous to their own trains of thought. This could hold
true even in problem domains that provide as much inquiry scaffolding as did
my self-contained web database, depending on user reasoning ability. Only
after exploration of at least some of these avenues could one draw any general
conclusions about the impact of immediate feedback, even in its current form,
on ill-defined problem solving with Belvedere.
Below is an illustrated excerpt of a typical user's interaction with Belvedere and Netscape.
![]() |
| Figure A1. Example screen layout for the experimental sessions. |
![]() |
| Figure A2. Screen after clicking on "Evidence for this hypothesis". |
![]() |
| Figure A3. Screen after clicking on "Plane's proximity to dry land", clicking on Data icon, and copying statement. |
![]() |
| Figure A4. Screen after clicking on "Plane's proximity to dry land". |
![]() |
|
Figure A5. Screen after drawing For link from new Data box to an existing Hypothesis box. |
![]() |
| Figure A6. Screen after clicking on "Consider Possible Causes". |
Notes:
2-statement-circular-argument *
Notes:
Home page: http://advlearn.lrdc.pitt.edu/experiments/materials/JC/TWA/
[displays menu.htm in left pane and crshprob.htm in right pane]menu.htm: [six unnumbered links, numbered here for clarity]
The scientific problem we'll be dealing with today is an actual, real-life problem that remains unsolved to this day. Therefore, there is no correct answer to the problem, and you will not be expected to come up with a definitive solution to it yourself.
Before we get started on the problem, I'd like to request some information from you. We would like to compare your problem solving performance in this experiment today with some other general measures of academic and reasoning ability. Do you happen to remember your SAT scores offhand (Math? Verbal?)? Do you happen to know your current GPA (as of this term)?
The specific scientific problem we'll be dealing with today is: What caused the crash of TWA flight 800? Like I said, it's an actual problem that remains unsolved to this day; scientists still haven't figured out a definite cause for the crash. So again, there is no correct answer to this problem. Your task today is to try to "make sense" of the information we have in the database and to try to come up with an account of what you think may have caused the crash.
While you're working on the problem, I'm going to ask you to record your thoughts in Belvedere (POINT). Belvedere is a piece of software we've developed here at LRDC that allows you to graphically map out the relationships between hypotheses and evidence. Basically, it lets you draw argument diagrams on the screen.
If while browsing through the database you find a piece of evidence that you think is relevant, you can enter it into Belvedere using a Data box (POINT). When you click on the little Data icon (DEMO), a dialog box will appear here (POINT) and you can type in a summary of the evidence here (POINT), and then click here (POINT: "Add this to diagram") to place your little Data box somewhere in the diagram. In the same manner, if you find a hypothesis that you think is relevant, you can put it in a Hypothesis box (POINT). It works the same way as a Data box, but it will appear as a box with rounded corners, as opposed to square corners like a Data box. Also, if you find a statement that you think is relevant, but you're not sure whether it's Data or a Hypothesis, then you can put it in an Unspecified box, which has a kind of cloud-like shape. So those are the three types of boxes you can use for different types of statements.
To show the relationships between the statements in your boxes, you would connect them using these links (POINT). You would use a For link to show that one statement supports or explains another, or is somehow for another). This type of link would be colored green in your diagram. You would use an Against link to show that one statement contradicts another or is somehow against another. This link would show up as red in your diagram, with a red "X" drawn over the middle of it. And you could use the And link to show that two statements together are for or against another statement. This link would appear in black, with a little "ball" in the middle of it. In just a minute I'll show you an example of what these boxes and links look like in a Belvedere diagram. One thing about drawing links: Belvedere doesn't work like many other drawing programs, and link-drawing is a little counterintuitive. To draw a link, click on its icon, move the cursor inside the box you want to draw it from, then click and hold the mouse while dragging the cursor inside the box you want to draw the link to... Do you have any questions at this point?
Sample Text & Diagram (NS)
I'm going to show you a short excerpt of some text about an unrelated scientific problem, along with a corresponding representation of the text as a Belvedere diagram. I'm going to ask you to read the text and compare it to the diagram, so that you understand how the information in the diagram matches up to what's in the text. Let me know when you're done... (after reading) Do you have any questions about how the text and the diagram correspond?
If in Condition NC, skip to *** below
Belvedere has a computerized Coach that will keep track of what you're doing in the diagram and may want to make suggestions about how to improve your diagram or about what you may want to look for next in the database. You can get advice from the Coach whenever you want it by clicking on the lightbulb icon (POINT). The advice will appear in a little box in the upper left corner of the screen, with "Here's an Idea" across the top of it.
| Condition I (IC) | Condition D (DC) |
| I'm telling you this because sometimes the Coach may also speak up on its own, even if you don't click on it -- so if a box appears there you'll know it's a message from the Coach. | Periodically throughout the session, you will hear beeping sounds from behind you. These beeps are simply to remind you that advice is available from Belvedere whenever you want it (you don't have to click the bulb when you hear the beeps). Please let me know each time you hear them... |
If a Coach box pops up such that it covers up part of your diagram, you can move the Coach box around on the screen. I'm telling you this because sometimes the Coach will highlight parts of your diagram in yellow, when its advice refers to specific parts of it; so you may need to move the Coach box around on the screen to see what parts turned yellow.
***
That's about it. I've told you the basics about how to use Belvedere, but I
may have left out some of the details. But, I'm not trying to put you through
the ringer here; if I see you having trouble with something, I'll jump in and
help you. I'll be sitting back here doing other work, so I may not be paying
close attention to what you're doing the entire session; so if
you are having trouble with something, please feel free to ask me for help at
any point. Also, feel free to ask me questions at any time during the
session.
Remember -- Your final goal is not to come up with a definitive answer to the question (not even the best experts have been able to do that!), but rather to try to "make sense" of the information and try to figure out the most likely cause of the crash...
Do you have any questions at this point? Then I'll ask you to start by reading the Problem Statement here (POINT), and then you can proceed through the database however you see fit.
(prepare SAT perm form and feedback sheet)
Verbal Summaries (stop timer if Cond. DC)
OK, for the next part of the session, I'm going to ask you some questions and ask you to give verbal responses to them. So I want to tape record just this part of the session... (click Desktop icon to hide all)
For the last major part of the session today, I'd like you to fill out a brief survey about the experiment. It has some questions about the crash, some items about Belvedere, and some more general items...
SAT/GPA permission form
OK, I have one last thing to ask of you. I mentioned earlier that we'd like to compare your performance today with other more general measures. So I'm asking each participant in this experiment for permission to access their official SAT scores and GPA at the end of this term, on the University's ISIS system. If you do give me permission to do so, rest assured that your data would be held completely confidential (only I will see it), that it would be used only for statistical averaging purposes, and that it would not be tied to your name in any way -- it would only be stored with your arbitrary ID number for this experiment...
Debrief: ask whether S has heard of the NTSB's final report? Ever see bulb blink (DC only)? Find the Coach or beeps annoying?
Ask S not to discuss problem or experiment with classmates
Notes:
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
| Very Strongly Disagree | Strongly Disagree | Moderately Disagree | Slightly Disagree | Neutral | Slightly Agree | Moderately Agree | Strongly Agree | Very Strongly Agree |
Aleven, V., & Koedinger, K. R. (2000). Limitations of student control: Do students know when they need help? In G. Gauthier, C. Frasson, & K. VanLehn (Eds.), ITS 2000: Proceedings of the 5th International Conference on Intelligent Tutoring Systems (pp. 292-303). Berlin: Springer-Verlag.
Anderson, J. R., Boyle, C. F., Farrell, R., & Reiser, B. (1984). Cognitive principles in the design of computer tutors. In Proceedings of the Sixth Annual Conference of the Cognitive Science Society (pp. 2-9). Boulder: University of Colorado, Institute of Cognitive Science.
Anderson, J. R., Boyle, C. F., & Reiser, B. J. (1985). Intelligent tutoring systems. Science, 228, 456-462.
Anderson, J. R., Corbett, A. T., Koedinger, K. R., & Pelletier, R. (1995). Cognitive tutors: Lessons learned. The Journal of the Learning Sciences, 4(2), 167-207.
Anderson, J. R., & Reiser, B. J. (1985). The LISP tutor. Byte, 10, 159-175.
Brown, J. S., Collins, A., & Duguid, P. (1989). Situated cognition and the culture of learning. Educational Researcher, 18(1), 32-41.
Burton, R. R., & Brown, J. S. (1982). An investigation of computer coaching for informal learning activities. In D. Sleeman & J. S. Brown (Eds.), Intelligent tutoring systems (pp. 79-98). New York: Academic Press.
Cacioppo, J. T., & Petty, R. E. (1982). The need for cognition. Journal of Personality and Social Psychology, 42, 116-131.
Cacioppo, J. T., Petty, R. E., & Kao, C. F. (1984). The efficient assessment of need for cognition. Journal of Personality Assessment, 48, 306-307.
Cavalli-Sforza, V. (1998). Constructed vs. received graphical representations for learning about scientific controversy: Implications for learning and coaching. Unpublished doctoral dissertation, Intelligent Systems Program, University of Pittsburgh, PA.
Chu, R. W., Mitchell, C. M., & Jones, P. M. (1995). Using the operator function model and OFMspert as the basis for an intelligent tutoring system: Towards a tutor/aid paradigm for operators of supervisory control systems. IEEE Transactions on Systems, Man, and Cybernetics, 25(7), 1054-1075.
Clancey, W. J. (1986). Qualitative student models. Annual Review of Computer Science, 1, 381-450.
Collins, A. (1996). Design issues for learning environments. In S. Vosniadou, E. De Corte, R. Glaser, & H. Mandl (Eds.), International perspectives on the design of technology-supported learning environments (pp. 347-361). Mahwah, NJ: Erlbaum.
Conati, C., & VanLehn, K. (1999). Teaching meta-cognitive skills: Implementation and evaluation of a tutoring system to guide self-explanation while learning from examples. In AIED '99: Proceedings of the 9th World Conference of Artificial Intelligence and Education. Amsterdam: IOS Press.
Connelly, J. W. (1989). An empirical investigation of the effective degrees of feedback content in GIL, an intelligent tutor for programming. Unpublished manuscript, Princeton University, Princeton, NJ.
Connelly, J. (1997). Specialty exam. Cognitive psychology program, University of Pittsburgh. Available: http://www.pitt.edu/~connelly/comps.html
Connelly, J., & Lesgold, A. (1999). Intelligent tutoring systems. In J. G. Webster (Ed.), Encyclopedia of electrical and electronics engineering (Vol. 10, pp. 529-541). New York: Wiley.
Corbett, A. T., & Anderson, J. R. (1990). The effect of feedback control on learning to program with the Lisp tutor. In Proceedings of the Twelfth Annual Conference of the Cognitive Science Society (pp. 796-803). Hillsdale, NJ: Erlbaum.
Corbett, A. T., & Anderson, J. R. (1992). LISP Intelligent Tutoring System: Research in skill acquisition. In J. H. Larkin & R. W. Chabay (Eds.), Computer- assisted instruction and intelligent tutoring systems: Shared goals and complementary approaches (pp. 73-109). Hillsdale, NJ: Erlbaum.
De Corte, E. (1996). Changing views of computer-supported learning environments for the acquisition of knowledge and thinking skills. In S. Vosniadou, E. De Corte, R. Glaser, & H. Mandl (Eds.), International perspectives on the design of technology-supported learning environments (pp. 129-145). Mahwah, NJ: Erlbaum.
Fischer, P. M., & Mandl, H. (1988). Improvement of the acquisition of knowledge by informing feedback. In H. Mandl & A. Lesgold (Eds.), Learning issues for intelligent tutoring systems (pp. 187-241). New York: Springer-Verlag.
Fix, V., & Wiedenbeck, S. (1996). An intelligent tool to aid students in learning second and subsequent programming languages. Computers and Education, 27(2), 71-83.
Gertner, A. S., & VanLehn, K. (2000). Andes: A coached problem solving environment for physics. In G. Gauthier, C. Frasson, & K. VanLehn (Eds.), ITS 2000: Proceedings of the 5th International Conference on Intelligent Tutoring Systems (pp. 133-142). Berlin: Springer-Verlag.
Gott, S. P., Lesgold, A., & Kane, R. S. (1997). Tutoring for transfer of technical competence. In S. Dijkstra, F. Schott, N. Seel, & R. D. Tennyson (Eds.), Instructional design: Vol II: Solving instructional design problems (pp. 221-250). Mahwah, NJ: Erlbaum.
Katz, S., & Lesgold, A. (1993). The role of the tutor in computer-based collaborative learning situations. In S. P. Lajoie & S. J. Derry (Eds.), Computers as cognitive tools (pp. 289-317). Hillsdale, NJ: Erlbaum.
Katz, S., & Suthers, D. (1998). Guiding the development of critical inquiry skills: Lessons learned by observing students interacting with subject-matter experts and a simulated inquiry coach. Paper presented at the American Educational Research Association 1998 Annual Meeting, April 13-17 1998, San Diego, CA.
Koedinger, K. R., & Anderson, J. R. (1993b). Reifying implicit planning in geometry: Guidelines for model-based intelligent tutoring system design. In S. P. Lajoie & S. J. Derry (Eds.), Computers as cognitive tools (pp. 15-45). Hillsdale, NJ: Erlbaum.
Koedinger, K. R., Anderson, J.R., Hadley, W.H., & Mark, M. A. (1995). Intelligent tutoring goes to school in the big city. In AI-ED 95: Proceedings of the 7th World Conference on Artificial Intelligence in Education (pp. 421-428). Washington, DC: Association for the Advancement of Computing in Education.
Kulik, J. A., & Kulik, C.-L. C. (1988). Timing of feedback and verbal learning. Review of Educational Research, 58(1), 79-97.
Legree, P. J., Gillis, P. D., & Orey, M. A. (1993). The quantitative evaluation of intelligent tutoring system applications: Product and process criteria. Journal of Artificial Intelligence in Education, 4(2/3), 209-226.
Lesgold, A. (1994a). Assessment of intelligent training technology. In E. L. Baker & H. F. O'Neil Jr. (Eds.), Technology assessment in education and training (pp. 97-116). Hillsdale, NJ: Erlbaum.
Lesgold, A. (1994b). Ideas about feedback and their implications for intelligent coached apprenticeship. Machine-Mediated Learning, 4, 67-80.
Lesgold, A., Katz, S., Greenberg, L., Hughes, E., & Eggan, G. (1992). Extensions of intelligent tutoring paradigms to support collaborative learning. In S. Dijkstra, H. P. M. Krammer, & J. J. G. van Merriënboer (Eds.), Instructional models in computer-based learning environments (pp. 291-311). Berlin: Springer-Verlag.
Mark, M. A., & Greer, J. E. (1993). Evaluation methodologies for intelligent tutoring systems. Journal of Artificial Intelligence in Education, 4(2/3), 129-153.
McKendree, J. (1990). Effective feedback content for tutoring complex skills. Human-Computer Interaction, 5(4), 381-413.
Means, M. L., & Voss, J. F. (1996). Who reasons well? Two studies of informal reasoning among children of different grade, ability, and knowledge levels. Cognition and Instruction, 14(2), 139-178.
Merrill, D. C., Reiser, B. J., Ranney, M., & Trafton, J. G. (1992). Effective tutoring techniques: A comparison of human tutors and intelligent tutoring systems. The Journal of the Learning Sciences, 2(3), 277-306.
Nathan, M. J. (1998). Knowledge and situational feedback in a learning environment for algebra story problem solving. Interactive Learning Environments, 5, 135-159.
Paolucci, M., Suthers, D., & Weiner, A. (1996). Automated advice-giving strategies for scientific inquiry. In C. Frasson, G. Gauthier, & A. Lesgold (Eds.), ITS96: Proceedings of the Third International Conference on Intelligent Tutoring Systems (pp. 372-381). New York: Springer-Verlag.
Polson, M. C., & Richardson, J. J. (1988). Foundations of intelligent tutoring systems. Hillsdale, NJ: Erlbaum.
Reiser, B. J., Friedmann, P., Gevins, J., Kimberg, D. Y., Ranney, M., & Romero, A. (1988). A graphical programming language interface for an intelligent LISP tutor. In Proceedings of CHI'88, Conference on Human Factors in Computing Systems (pp. 39-44). New York: ACM.
Reusser, K. (1996). From cognitive modeling to the design of pedagogical tools. In S. Vosniadou, E. De Corte, R. Glaser, & H. Mandl (Eds.), International perspectives on the design of technology-supported learning environments (pp. 81-103). Mahwah, NJ: Erlbaum.
SAS Institute Inc. (1999). SAS OnlineDoc®, Version 8. Cary, NC: Author. Available: http://v8doc.sas.com/sashtml/stat/chap23/sect13.htm#idxclu0263
Schneider, D., & Dorans, N. (1999, June). Concordance Between SAT® I and ACTTM Scores for Individual Students. Research Notes (RN-07). New York, NY: The College Board. Available: http://www.collegeboard.org/research/html/rn07.pdf
Schofield, J. W., Evans-Rhodes, D., & Huber, B. R. (1990). Artificial intelligence in the classroom: The impact of a computer-based tutor on teachers and students. Social Science Computer Review, 8(1), 24-41.
Schooler, L. J., & Anderson, J. R. (1990). The disruptive potential of immediate feedback. In Proceedings of the Twelfth Annual Conference of the Cognitive Science Society (pp. 702-708). Hillsdale, NJ: Erlbaum.
Seidel, R. J., & Park, O. C. (1994). An historical perspective and a model for evaluation of intelligent tutoring systems. Journal of Educational Computing Research, 10(2), 103-128.
Shute, V., & Glaser, R. (1990). A large scale evaluation of an intelligent discovery world: Smithtown. Interactive Learning Environments, 1, 51-77.
Shute, V. J., & Regian, J. W. (1993). Principles for evaluating intelligent tutoring systems. Journal of Artificial Intelligence in Education, 4(2/3), 245-271.
Snedecor, G. W., & Cochran, W. G. (1980). Statistical methods (7th Ed.). Ames, IA: Iowa State University Press.
Stasz, C., Ormseth, T., McArthur, D., & Robyn, A. (1989, March). An intelligent tutor for basic algebra: Perspectives on evaluation. In Instructional views of intelligent computer-assisted instruction: Data and issues. Symposium conducted at the annual meeting of the American Educational Research Association, San Francisco, CA.
Suthers, D. (1993). Preferences for Model Selection in Explanation. Paper presented at the Proceedings of the 13th International Joint Conference on Artificial Intelligence (IJCAI-93), Chambery, France.
Suthers, D., Connelly, J., Lesgold, A., Paolucci, M., Toth, E. E., Toth, J., & Weiner, A. (in press). Representational and advisory guidance for students learning scientific inquiry. To appear in K. Forbus & P. J. Feltovich (Eds.), Smart machines in education. Menlo Park, CA: AAAI Press.
Suthers, D., & Jones, D. (1997). An architecture for intelligent collaborative educational systems. In B. du Boulay & R. Mizoguchi (Eds.), Proceedings of AI-ED 97 World Conference on Artificial Intelligence in Education (pp. 87-94). Tokyo, Japan: IOS Press.
Suthers, D. D., Toth, E. E., & Weiner, A. (1997). An integrated approach to implementing collaborative inquiry in the classroom. Proceedings of the Second International Conference on Computer Supported Collaborative Learning (CSCL'97), Toronto, December 10-14, 1997. pp. 272-279.
Suthers, D., & Weiner, A. (1995). Groupware for developing critical discussion skills. In J. L. Schnase & E. L. Cunnius (Eds.), Proceedings of CSCL '95: The First International Conference on Computer Support for Collaborative Learning (pp. 341-348). Mahwah, NJ: Erlbaum.
Suthers, D., Weiner, A., Connelly, J., & Paolucci, M. (1995). Belvedere: Engaging students in critical discussion of science and public policy issues. In AI-ED 95: Proceedings of the 7th World Conference on Artificial Intelligence in Education (pp. 266-273). Washington, DC: Association for the Advancement of Computing in Education.
Sweller, J. (1988). Cognitive load during problem solving: Effects on learning. Cognitive Science, 12, 257-285.
Toth, E. E., Suthers, D. D., & Lesgold, A. M. (in press). Mapping to know: The effects of evidence maps and reflective assessments on scientific inquiry skills. Science Education.
Toth, J. A., Suthers, D., & Weiner, A. (1997). Providing expert advice in the domain of collaborative scientific inquiry. In B. du Boulay & R. Mizoguchi (Eds.), Proceedings of AI-ED 97 World Conference on Artificial Intelligence in Education. Tokyo, Japan: IOS Press.
Twidale, M. (1993). Redressing the balance: The advantages of informal evaluation techniques for intelligent learning environments. Journal of Artificial Intelligence in Education, 4(2/3), 155-178.
VanLehn, K. (1988a). Student modeling. In M. C. Polson & J. J. Richardson (Eds.), Foundations of intelligent tutoring systems (pp. 55- 78). Hillsdale, NJ: Erlbaum.
VanLehn, K. (1988b). Toward a theory of impasse-driven learning. In H. Mandl & A. Lesgold (Eds.), Learning issues for intelligent tutoring systems (pp. 19-41). New York: Springer-Verlag.
VanLehn, K. (1996). Conceptual and meta learning during coached problem solving. In C. Frasson, G. Gauthier, & A. Lesgold (Eds.), ITS96: Proceedings of the Third International Conference on Intelligent Tutoring Systems (pp. 29-47). New York: Springer-Verlag.
VanLehn, K., Freedman, R., Jordan, P., Murray, C., Osan, R., Ringenberg, M., Rose, C., Schulze, K., Shelby, R., Treacy, D., Weinstein, A., & Wintersgill, M. (2000). Fading and deepening: The next steps for Andes and other model-tracing tutors. In G. Gauthier, C. Frasson, & K. VanLehn (Eds.), ITS 2000: Proceedings of the 5th International Conference on Intelligent Tutoring Systems (pp. 474-483). Berlin: Springer-Verlag.
Veerman, A. L. (2000). Computer-supported collaborative learning through argumentation. Unpublished doctoral dissertation, University of Utrecht, the Netherlands. Available: http://eduweb.fss.uu.nl/arja/Veerman-thesis-pdf.zip
Voss, J. F., & Post, T. A. (1988). On the solving of ill-structured problems. In M. T. H. Chi, R. Glaser, & M. J. Farr (Eds.), The nature of expertise (pp. 261-285). Hillsdale, NJ: Erlbaum.
Wan, D., & Johnson, P. M. (1994). Experiences with CLARE: A computer- supported collaborative learning environment. International Journal of Human- Computer Studies, 41(6), 851-879.
Wenger, E. (1987). Artificial intelligence and tutoring systems: Computational and cognitive approaches to the communication of knowledge. Los Altos, CA: Morgan Kaufmann.
Wertheimer, R. (1990). The geometry proof tutor: An "intelligent" computer- based tutor in the classroom. Mathematics Teacher, 84(4), 308-317.