The Discovery of Argon: A Case for ErrorStatistical Reasoning
Aris Spanos
Virginia Tech,
Lord Rayleigh and Sir William Ramsay discovered the inert gas ‘argon’ in
atmospheric air in 1895, after a series of experiments involving the production
of nitrogen using two different procedures, broadly defined as ‘atmospheric’ and
‘chemical”. The discovery process was based on a carefully designed sequence of
experiments combined with an informal (by today’s standards) analysis of the
resulting data. This combination guided them to further, and further experiments
until they were able to establish the existence of argon in the atmospheric
air, by eliminating other possible explanations. The basic objective of the paper
is twofold. First, to capture the sound scientific intuition underlying their
reasoning and their pragmatic attempts to address Duhemian problems in their
efforts to explain the the source of the initial discrepancy. Second, to formalize
their ‘reasoning from error’ which led them to infer the existence of argon in
the atmospheric air and eventually to the discovery of the other noble gases. It
is argued that the error statistical account proposed by Mayo (1996) provides
an appropriate framework for accomplishing both objectives.
∗I would like to thank Alan Chalmers for invaluable advice on the history of this episode. I am
grateful to Deborah Mayo for numerous suggestions that helped to improve the paper.
1
1 Introduction
Lord Rayleigh (18421919) was awarded the 1904 Nobel Prize in physics and his
collaborator, Sir William Ramsay (18521916), was awarded the 1904 Nobel Prize in
chemistry, for their role in the discovery of argon, an inert gas in the atmosphere.
The discovery of argon was of paramount importance for Chemistry because it was
instrumental in the discovery of helium, neon, krypton, xenon and radon by Ramsay
in the late 1890s and early 1900s, discoveries which helped to revise and complete
the periodic table; see Freund (1968). At the same time Mendeleev’s (18341907)
antagonistic reaction to the discovery of argon might have been the primary reason
for not being awarded the Nobel prize in Chemistry in 1906; see Gordin (2004).
The discovery of the argon resulted from the careful unraveling of an empirical
‘discrepancy’, detected initially by Rayleigh (professor of Cambridge University’s
Cavendish Laboratory), in the density of nitrogen gas produced by two different procedures,
broadly described as ‘atmospheric’ and ‘chemical’. The conclusion, based
on joint work between Rayleigh and Ramsay (professor of chemistry at University
College, London), that the atmospheric air contains an inert gas they called argon,
hitherto unknown, was reached after a long ‘discovery’ process based on trial and error.
The idea of this paper is to revisit the RayleighRamsay historical episode with
the view to ‘capture’ and systematize their learning from error process. Their ‘trial
and error’ process was based on a carefully designed sequence of experiments, combined
with an informal (by today’s standards) analysis of the resulting data, which
guided them to further and further experiments until they reached the conclusion
that atmospheric nitrogen contained argon absorbed from atmospheric air.
The perspective used for the proposed formalization is that of the errorstatistical
account (see Mayo, 1996). The errorprobing reasoning can be used to formalize
the scientists’ resourceful ways in dealing with Duhemian problems, as they arose
in their actual research situations. In particular the emphasis is place on a variety
of techniques for avoiding and uncovering error (ibid, pp. 47) captures essential
strategies that Rayleigh and Ramsay used to both convince themselves, and the numerous
skeptics, that their results were not experimental artifacts but that they had,
indeed, discovered a new chemical element. The reasoning is that of arguing from
error (ibid, p. 7), and it is precisely this reasoning, Mayo argues, that is formally
captured in some of the standard statistical techniques, e.g., significance tests, analysis
of variance, etc. Those who question whether statistical reasoning could possibly
lie at the heart of scientific learning in general on the grounds that these techniques
are developments of the 20th century overlook the extent to which they are actually
systematizing and sharpening tools that scientists (and humans, in general) have
used for many centuries to learn by induction in the face of errors and uncertainty.
Indeed, the discovery of argon provides an excellent example of arguing from error
which began as ‘an empirical anomaly’ whose thorough investigation eventually led
to a substantial growth in experimental knowledge.
2
In section 2 the paper summarizes the story of the discovery of argon, paying
particular attention to the systematic efforts first by Rayleigh and then jointly with
Ramsay to establish the robustness of the empirical discrepancy between atmospheric
and chemical nitrogen, and then explain its source, by eliminating other possible explanations.
Section 3 of the paper attempts a formalization of the statistical analysis
performed by Rayleigh on his experimental data in order to make a case that the kind
of inductive reasoning employed can be captured using the errorstatistical approach.
An important dimension of the errorstatistical approach is concerned with the validity
of the premises of inductive inference. It is shown that some of the (informal)
empirical tests used by Rayleigh can be viewed as informal misspecification tests in
order to establish the validity of the premises for his inferences. Section 4 repeats the
statistical analysis of section 3 to more precise data generated by the collaborative
efforts of Rayleigh and Ramsay. It is shown that the new data give rise to more
incisive inferences.
2 The story of the discovery of Argon
The story of the discovery of argon is interesting from the philosophy of science
perspective for several reasons. First, it constitutes a case where the new knowledge
was not the result of prediction from a wellestablished theory, but a systematic search
to explain an empirical discrepancy detected in an experiment. Indeed, one of the
primary reasons for the initial reluctance to accept it was the fact that it could not be
accommondated into the periodic table as understood at the time. Second, the fact
that the new knowledge could not be accommondated in the theoretical framework at
the time led to a major overhaul of that framework, including a complete revamping
of the periodic table changing its based from atomic weights to atomic numbers.
Third, the discovery of argon provides an example where the careful analysis of the
data from experiments led the scientists in question to a systematic sequence of other
experiments that eventually led them to the unequivocal conclusion of the existence of
argon, by trial and error; a clear example of learning from error. Fourth, the statistical
analysis performed by these scientists in the 1890s is technically primitive by today’s
standards, but highly sophisticated from the inductive reasoning perspective. As
demonstrated below, Rayleigh was able to draw reliable inductive inferences because
his data analysis involved establishing a primitive form of statistical adequacy for his
premises of inference; see Spanos (2006) for futher discussion.
2.1 The initial detection of an empirical discrepancy
Lord Rayleigh’s investigations that led to the discovery of argon began in the early
1890s with an attempt to measure the (molar) mass of nitrogen, after his successful
determination of the mass of oxygen and hydrogen. In the 1890s the conventional
wisdom was that atmospheric air was “a mixture of oxygen, nitrogen, and small
3
quantities of carbon dioxide and water vapour, together with a trace of ammonia.”
(Ramsay, 1915, p. 148). Rayleigh produced nitrogen gas from atmospheric air by removing
the other known constituents; calling it "atmospheric nitrogen". The original
procedure he used was to bubble air through liquid ammonia and then pass it through
a hot tube. The water produced could be removed by drying agents and the nitrogen
product joined nitrogen from the atmospheric sample. He was ready to publish his
results, but decided to pursue some additional experiments in order to eliminate any
potential errors due to the particular process used to produce the nitrogen:
“But then I reflected that it is always advisable to employ more than one method,
than the method I had used — Mr. Vernon Harcourt’s method — was not that
which had been used by any of those who had preceded me in weighing nitrogen.
The usual method consists in absorbing the oxygen of air by means of redhot
copper; and I thought that I ought at least to give that method a trial, fully
expecting to obtain forthwith a value in harmony with that already afforded
by the ammonia method.” (Rayleigh, 1895, p. 702)
To his surprise, the nitrogen produced by passing air over hot copper, thus removing
the oxygen as copper oxide, was lighter; he called this "chemical nitrogen".
That is, when he compared the mass of nitrogen from these two different methods
he found a small but ‘persistent’ (and repeatable) discrepancy that could not be
explained away as experimental error:
“... the gas obtained by the copper method, ..., proved to be onethousandth
part heavier than that obtained by the ammonia method; and on repetition,
that difference was only brought out more clearly.” (1895, p. 702)
2.2 Magnifying the detected empirical discrepancy
Rayleigh’s next step exemplifies the powerful intuition of an exceptional scientist:
“The next step in the enquiry was, if possible, to exaggerate the discrepancy.
One’s instinct at first is to try to get rid of a discrepancy, but I believe that
experience shows such an endeavour to be a mistake. What one ought to do
is to magnify a small discrepancy with a view to finding out the explanation;
and, as it appeared in the present case that the root of the discrepancy lay in the
fact that part of the nitrogen prepared by the ammonia method was nitrogen
out of ammonia, although the greater part remained of common origin in both
cases, the application of the principle suggested pure oxygen for atmospheric
air in the ammonia method, so that the whole, instead of only part, of the
nitrogen collected should be derived from the ammonia itself. The discrepancy
was at once magnified some five times" (Rayleigh, 1895) [emphasis added].
That is, he devised a way to magnify the observed discrepancy by making specific
changes to the experimental setup in order to eliminate any errors due to using
atmospheric air instead of pure oxygen.
4
2.3 Establishing the robustness of the empirical discrepancy
Rayleigh used a variety of different methods to produce both atmospheric and chemical
nitrogen in order to eliminate any errors due to the peculiarities of the method
of production. The sound experimental intuition exemplified by Rayleigh in utilizing
such a variety of methods can be best understood as a way to ensure the robustness
of his results to certain background assumptions (Mayo, 1996, p. 6). The idea was
that if the different ways to produce, say atmospheric nitrogen, gave rise to the same
measurement of density, one may argue that it is impossible that they all conspired to
do so, this similarity is strong evidence of a reliable or robust measurement. In other
words, if results were due to experimental artifacts, they are given an excellent chance
to have been detected by at least one of the distinct ways of producing nitrogen. In
this case, he wanted to ensure that the results of the two broad methods of producing
nitrogen were not ‘sensitive’ to the particular process, instruments or chemicals used.
For the atmospherically produced nitrogen Rayleigh used four (4) different methods
to eliminate the oxygen using ammonia: hot iron, ferrous hydrate, hot iron,
electrified and copper. For the chemically derived nitrogen he also used four different
methods: nitric oxide (NO) by hot iron, nitrous oxide (N2O) by hot iron, from N2O
by hot iron electrified, and ammonium nitrite (NH+4 NO−2 ). Each of the four methods
is used to obtain an average, and it is expected there will be some variability even
if they are all measuring the same value. The questions is whether they are close
enough (within the experimental error bounds) to be regarded as having arisen from
the same process. Rayleigh compared the averages of each particular method within
the atmospheric nitrogen category:
by hot iron (1893):
x1=2.31017+2.30986+2.31010+2.31001
4 =2.31003,
by copper (1892):
x2=2.31026.
by hot iron, electrified (1894):
x3=2.31163+2.30956
2 = 2.31059,
by ferrous hydrate (1894):
x4=2.31024+2.31010+2.31028
3 =2.3102,
with the overall average x = 2.310221, and concluded that the deviations:
(x−x1)=.000191, (x−x2)=.000021, (x−x3)=.000369, (x−x4)=.000039 (1)
were ‘small enough’ to infer that all four production processes can be realistically
viewed as giving rise to the same (homogeneous) product. That is, the differences in
(1) are not ‘large enough’ to suggest that the nitrogen produced by each method was
qualitatively different.
The same conclusion was reached by Rayleigh when comparing the means from the
four different way of producing chemical nitrogen (NO — nitric oxide, N2O — nitrous
5
oxide, NH+4 NO−2 — ammonium nitrite):
from NO by hot iron (1893):
y1=2.30143+2.29890+2.29816+2.30182
4 =2.30008,
from N2O by hot iron (1893):
y2=2.29869+2.29940
2 = 2.29904,
from NH+4 NO−2 (1894):
y3=2.29849+2.29889
2 =2.29869
from N2O by hot iron, electrified (1894):
y4=2.30074+2.30054
2 = 2.30064
The overall average was y = 2.299706, and the small differences:
(y−y1)=.000374, (y−y2)=.000666, (y−y3)=.0010169, (y−y4)=0.000934,
(2)
indicated that the nitrogen produced by each method was not qualitatively different.
The statistical analysis performed on this data by Lord Rayleigh was rudimentary
(by today’s standards), based primarily on ‘averaging’ and comparing the differences
of these averages. Averaging was a wellknown method to enhance the reliability of
estimating the magnitude of a ‘true’ quantity when the observed data differ from it
by nonsystematic random effects, going back to Gauss in 1809. It was known at
the time that the sample mean (Y =1
nPn
k=1 Yk) provides a more precise estimate of
the ‘true’ mean (μ) than any one individual observation (Yk) because the variance of
the sample mean is n times smaller since V ar(Y )=V ar(Yk)
n ;the larger the number of
obsrvations n the higher the precision. This, and much else in his effort to probe the
different ways his inferences could be in error, will be discussed more extensively in
section 3.
Having established the homogeneity of the two different broad ways of producing
nitrogen using the atmospheric process as well as the chemical process, Rayleigh
proceeded to compare the overall average weight of the nitrogen produced by the two
processes xand y, respectively. The difference between the two overall means:
(x − y) = 0.010515, (3)
was more than 67 times the average difference within the atmospheric nitrogen in (1)
and more than 15 times that within the chemical nitrogen in (2). His reasoning has
a strong family resemblance to the Analysis of Variance (ANOVA) comparison based
on within and between groups variation. On this basis Rayleigh concluded that:
“The difference is about 11 milligrams, or about onehalf per cent.;
and it was sufficient to prove conclusively that the two kinds of nitrogen
— the chemically derived nitrogen and the atmospheric nitrogen — differed
in weight, therefore, of course, in quality, for some reason hitherto
unknown.” (Rayleigh, 1895, p. 702) [emphasis added].
The ‘atmospheric nitrogen’ had a density of 1.2572 grams per liter, whereas the
‘chemical nitrogen’ had a density of 1.2511 grams per liter. How reliable was this
result?
6
“I need not spend time in explaining the various precautions that were
necessary in order to establish surely that conclusion. One had to be on
one’s guard against impurities, especially against the presence of hydrogen,
which might seriously lighten any gas in which it was contained.”
(Rayleigh, 1895, p. 702) [emphasis added].
2.4 ‘Reasoning from error’ to explain the discrepancy
Rayleigh (1895) describes numerous precautions and experimental controls in establishing
the robustness of these results, including storing the nitrogen for eight months
and reweighting it in order to ensure that the discrepancy persisted. He concluded
that the difference (x − y) = 0.010515 is not an artifact because it was well outside
the magnitude of his experimental error. Again, Rayleigh’s analysis uses reasoning
which is analogous to a modern ttest assessing the difference between two means;
see section 3. He published his results in a 1894 paper entitled “On an Anomaly
Encountered in Determination of the Density of Nitrogen Gas”. He was so concerned
with this empirical discrepancy that he asked for help from other scientists, especially
chemists:
“In order, if possible, to get further light upon a discrepancy which puzzled
me very much, and which, at the time, I regarded only with disgust and
impatience, I published a letter in Nature inviting criticisms from chemists
who might be interested in such questions. I obtained various useful suggestions,
but none going to the root of the matter." (Rayleigh, 1895, p. 702)
On the basis of his own evidence, Rayleigh reached a preliminary conclusion:
“Upon the assumption that similar gas should be obtained by both methods,
we may explain the discrepancy by supposing either that:
h1 : the atmospheric nitrogen was too heavy on account of imperfect removal
of oxygen, or that
h2 : the ammonia was too light on account of contamination with gases lighter
than pure nitrogen,” (Rayleigh, 1894, p. 340) [h1 and h2 inserted]
Rayleigh went on to investigate both of these possible explanations thoroughly by
designing new experiments. Using the results of these experiments, in conjunction
with established substantive knowledge in both physics and chemistry, he was able
to eliminate both explanations.
He eliminated h1 both on experimental as well as theoretical grounds. New experiments
introduced oxygen which was then burned, ensuring that no oxygen was left.
Using the similarity of the density of oxygen and nitrogen, he argued that, the observed
discrepancy would have required that the nitrogen should have containe 1
30 of
its volume of oxygen, or 1
6 of that present in air, which was impossible on theoretical
grounds! (ibid., 1894, p. 341)
Using similarly convincing arguments he eliminated the possible explanation h2:
“Of the possible impurities lighter than nitrogen, those most demanding
7
consideration are hydrogen, ammonia, and water vapour. The last may be
dismissed at once, and the absence of ammonia is almost equally certain.
The question of hydrogen appears the most important. But this gas, and
hydrocarbons, such as CH4, could they be present, should be burnt by the
copper oxide; and the experiments already referred to, in which hydrogen
was purposefully introduced into atmospheric nitrogen, seem to prove
conclusively that the burning would really take place.” (ibid., p. 342)
Rayleigh went on to consider and eliminate other possible explanations that might
have given rise to h2, such as:
h3 : atmospheric nitrogen might contain three atom molecules, a heavier
allotropic form of nitrogen, say, N3.
It was known at the time that ozone (O3) can be created by electrifying oxygen
using ‘the action of an electric discharge’:
“Further experiments were tried upon the action of the silent electric
discharge — both upon the atmospheric nitrogen and upon the chemically
derived nitrogen — but neither of them seemed to be sensibly affected by
such treatment” (Rayleigh, 1895, p. 703)
Reflecting on the hypotheses h1 and h2 we can see that they are not exhaustive.
To render them exhaustive one needs to redefined them as follows:
h01 : the atmospheric nitrogen was too heavy on account of a heavier gas, and
h02 : the chemical nitrogen was too light on account of a lighter gas.
The obvious culprits for a heavier and lighter gas were oxygen and hydrogen,
respectively. Both were eliminated, however, by designing experiments where the
atmospheric and chemical nitrogen were purposely mixed with additional oxygen and
hydrogen, respectively, and removing these contaminations using the various methods
originally used to produce the nitrogen. Rayleigh proved that these methods were
very effective in removing these two contaminants. His arguments constitute excellent
examples of dealing with Duhem’s problem by eliminating potential explanations.
At the end of a long trial and error process, Rayleigh’s primary conclusion was:
“... altogether, the balance of evidence seemed to incline against the
hypothesis of abnormal lightness in the chemically derived nitrogen being
due to dissociation, and to suggest strongly, as almost the only possible
alternative, that there must be in atmospheric nitrogen some constituent
heavier than true nitrogen.” (Rayleigh, 1895, p. 703) [emphasis added]
That is, Rayleigh revised his original hypothesis h1 to h01. His intuition that
the atmospheric nitrogen may contain a heavier gas led him to seek advice from
a famous chemist, Sir William Ramsay, and that initiated an intensive collaborative
research effort that probed exhaustively the hypotheses h01 and h02. They first repeated
Rayleigh’s experiments with additional controls in order to eliminate any form of
impurity as the source of the discrepancy. (Rayleigh and Ramsay, 1895, pp. 1912).
8
2.5 The discovery of Argon
On account of their training, Rayleigh focused primarily on the physical aspects and
Ramsay on the chemical dimensions of this empirical discrepancy puzzle. As an
experienced chemist Ramsay recalled the older experiments by Cavendish in 1785
who demonstrated that after both the oxygen and nitrogen were extracted from atmospheric
air there was still a small residue: “... That residue amounted to about
1/120 part of the nitrogen taken” (Rayleigh, 1895, p. 704).
After employing a thorough ‘reasoning from error’, Rayleigh and Ramsay eliminated
h02, and began a joint intense effort to probe h01 by isolating and identifying the
residual gas noted earlier by Cavendish:
“The simplest explanation in many respects was to admit the existence
of a second ingredient in air from which oxygen, moisture, and carbonic
anhydride had already been removed. The proportional amount required
was not great. If the density of the supposed gas were double that of
nitrogen, onehalf per cetnt. only by volume would be needed; or, if the
density were but half as much again as that of nitrogen, then one per cent.
would still suffice. But in accepting this explanation, even provisionally,
we had to face the improbability that a gas surrounding us on all sides,
and present in enormous quantities, could have remained so long
unsuspected.” (Rayleigh and Ramsay, 1895, p. 192)
Ramsay devised an experiment with much smaller experimental error than that
of Cavendish to remove all the nitrogen from his sample of ‘atmospheric nitrogen’
by passing it repeatedly over heated magnesium, with which nitrogen reacts to form
magnesium nitrite  a solid. As a result of this experiment he was left with 1%
residual of the original volume which would not react. The same experiment applied
to ‘chemical nitrogen’ left no residual! In their own words:
“... the conclusion seems inevitable that "atmospheric nitrogen" is a
mixture and not a simple body.” (ibid., p. 208)
When they measured the density of the residual in the ‘atmospheric nitrogen’ they
discovered that it was denser than nitrogen. In order to eliminate even the remote
possibility that the unidentified gas was not the result of the process used to isolate
it, Ramsay and Rayleigh devised yet another experiment where atmospheric nitrogen
was passed through a long sequence of porous clay pipes surrounded by a vacuum.
The potion of nitrogen diffused through the clay was less dense than the remainder,
proving that the heavier gas must have been in the original air. The last alternative
explanation they wanted to eliminate again was the possibility that the residual gas
was a heavier allotropic form of nitrogen N3. This possibility was championed by two
famous chemists, James Dewar (professor of chemistry in the Royal Institution) and
Dmitri Mendeleev (a professor of chemistry at St. Petersburg University), in their
attempt to save the periodic table as it was shaped at the time; see Gordin (2004),
p. 211.
9
The most crucial test to eliminate allotropic nitrogen as the heavier gas was the
spectral analysis applied by Ramsay to the residual which established unequivocally
that it was a hitherto unindentified gas. It should be noted that the spectral analysis
was, at the time, the best test to identify the nature of any chemical element. Hence,
using reasoning from error they were able to demonstrate experimentally that the discrepancy
is exclusively due to a hitherto unknown inert gas found in the atmosphere,
which they called ‘argon’ (the Greek word for ‘inactive’) because of its chemical inertness.
It is now known that Argon has atomic number 18 (the total number of
protons in its atomic nucleus), it has atomic weight 39.948, it constitutes 1.3 percent
of the atmospheric air by weight and 0.94 percent by volume; see Emsley (2001).
Having established the presence of argon in atmospheric nitrogen they went further
and argued that:
“If the newly discovered gas were not in the atmosphere, the discrepancies
in the density of "chemical" and "atmospheric" nitrogen would remain
unexplained.” (p. 235)
In Mayo’s (1996) terminology, they reached this conclusion by ensuring that h01
passed a severe test in the sense that it “has withstood a scrutiny that it would very
likely have failed, were it not correct.” (ibid., p. 30).
2.6 Growth of experimental knowledge
Convincing the scientific establishment that they identified a new element that could
not even fit into Mendeleev’s periodic table, as it stood at the time, was not an easy
task. To achieve that they embarked on a sequence of new improved experiments using
additional controls in an attempt to eliminate as many impurities from the original
data as possible. The new experiments of Rayleigh and Ramsay (1895) yielded more
precise data which enabled them to enhance the discrepancy to:
(x − y) = 0.011167.
The monatomic nature of the new substance was established by measurements
of constantpressure and constantvolume heat capacities, but its existence remained
controversial because it did not fit into Mendeleev’s periodic table. Eventually they
were able to provide conclusive physical evidence that the residual was indeed a
new element, argon, and determine its density; 1.7824 grams per liter, compared to
nitrogen’s density of 1.2506 grams per liter. The additional probing enabled Rayleigh
and Ramsay to ensure that the claim:
h4 : the atmospheric nitrogen was too heavy because it contained
a heavier gas from atmospheric air, identified as argon,
also passed a severe test; see Mayo (1996).
Rayleigh and Ramsay made a preliminary announcement of their discovery of
argon at the Oxford meeting of the British Association in August of 1894. Predictably,
their claim was met with skepticism bordering on hostility; it was not the first time,
and it will not be the last, that someone has erroneously claimed the discovery of a
new chemical element. This particular discovery was even more problematic since:
10
(i) it did not fit into the periodic table,
(ii) it raised the possibility of a zerovalency group of elements, and
(iii) its inertness made it impossible to determine its atomic weight by chemical
means; the chemists of that period distrusted physical evidence concerning atomic
weights. Dewar and Mendeleev were convinced that argon was in fact triadic nitrogen
N3; Dewar registered his dissension by not attending the formal presentation of the
results at the Royal Society in January of 1895. Rayleigh must have been greatly
frustrated by these incessant and unjustified (in his mind) criticisms to such an extent
that he would soon return to his research in physics excusing his move by stating that
in that field: "second rate men seem to know their place" (see Brock, 1992, p. 336).
The hostility against their claim dissipated only when the other noble gases were
discovered and Mendeleev’s periodic table was reshaped by the early 1900’s. Even
Mendeleev, who argued vehemently against the claim, changed his mind:
“In 1903, he considered Ramsay’s findings "some of the most brilliant
experimental discoveries of the end of the 19th century," and admitted
that his early hypothesis of triatomic nitrogen was incorrect. What
changed his mind? Mendeleev cited five pieces of evidence that swayed
him: the finding that argon’s density was just barely greater than 19,
while N3 would have been around 21; Ramsay’s discovery of helium
in 1895, which also displayed chemical inertness; the later discoveries
of the other inert gases neon and krypton; the uniqueness of their spectra;
and Ramsay’s proof of the constancy of chemical features when correlated
with density.” (Gordin, 2004, p. 211)
Ramsay (1899) went on to use the established densities, as well as the atomic
weights of nitrogen and argon, to determine the ‘theoretical’ value of the discrepancy
to be γ∗=0.01186; note that estimated discrepancy bγ=0.011167 was reasonably accurate.
Ramsay (1915) provides a retrospective detailed account of the quest that led
to the discovery of argon.
Rayleigh and Ramsay arrived at the further conclusion that if the periodic law
and the discovery of a new inert elemental gas were both correct, then there must
be a whole family of such elements, rendering Mendeleev’s periodic table incomplete;
see Freund (1968). Ramsay’s subsequent work confirmed that conclusion by isolating
helium first and then discovering neon, krypton, and xenon by the end of the century;
adding a completely new column to the periodic table. Hence, the discovery of argon
was more than just isolating yet another element. It was instrumental in giving rise
to major revisions of the experimental knowledge of the 1870s.
Argon’s discovery began the process of questioning the basic pillars of how matter
was understood at the time:
(i) integrability (each atom was integral  it had no substructure),
(ii) immutability (each specific element had fixed mass and could
not be transformed into another), and
(iii) valency (a numerical charge that determined how a given atom
11
would combine with others).
The discovery of the other noble gases, followed immediately by the discovery of
radioactivity, by Pierre and Marie Curie in 1898 , and the electron (subelements of
the atom) by J. J. Thompson in 1897, undermined completely the three basci pillars
of knowledge on matter as it stood at the end of the 19th century; see Gordin (2004),
p. 209. Argon called into question the valency of matter by initiating the zerovalency
group. It also created a major problem with the periodic table because its atomic
weight of 39.948 was bigger than that of potassium 39.098, but their atomic weights
turned out to be 18 and 19, respectively. The periodic table was recast by Moseley
in 1914 in terms of atomic numbers, replacing Mendeleev’s periodic table in terms
of atomic weights. It turned out that the atomic number, not only identifies the
chemical properties of an element, but also facilitates the description of other aspects
of atoms and nuclei. These early developments, which began with the discovery of
argon, had an enormous impact on both chemistry and nuclear physics over the next
several decades of the 20th century; see Brock (1992.
3 Data analysis and the ErrorStatistical approach
3.1 A brief summary of the errorstatistical approach
The term ErrorStatistical approach was coined by Mayo (1996) to denote a modification/
extension of the framework for frequentist inductive inference, usually associated
with Fisher, Neyman and Pearson. The modification/extensions come primarily
in the form of:
(i) Emphasizing the learning from data (about the phenomenon of interest)
objective of empirical modeling.
(ii) Paying due attention to the validity of the premises of induction via statistical
adequacy, using thorough misspecification testing and respecification.
(iii) Emphasizing the central role of error probabilities in assessing the reliability
of inference, both predata as well as postdata.
(iv) Supplementing the original framework with a postdata assessment of
inference in the form of severity evaluations.
(v) Bridging the gap between theory and data using a sequence of interconnected
models, theory (primary), structural (experimental), statistical (data) built on
two different, but related, sources of information: substantive subject matter
and statistical information (chance regularity patterns); see Spanos (1999).
(vi) Advancing thorough probing of the different ways an inductive inference
might be in error, by localizing the error probe in the context of the different
models in (v); see Mayo (1996), Spanos (2006).
The primary objective of this section is to revisit Rayleigh’s data analysis and
inference, using the errorstatistical perspective. The discusssion in section 2 above,
describing the sequence of experiments by both Rayleigh and Ramsay and their ‘learning
from error’ process, constitutes an excellent example of probing for error as de
12
scribed in (vi) at the level of an experimental model. Some of the relevant errors in
their probing were quantitative and statistical, but most of them were not. This is
important to bring out because probing for error using severe testing reasoning does
not have to be quantitative; see Mayo (1996).
Having said that, this section the paper focuses on formalizing the ‘learning from
error’ process that is based on quantitative assessments of the experimental data
using (i)(v). The formalization is interesting because it demonstrates the capacity
of the errorstatistical approach to capture and systematize the ‘learning from error’
experience of scientists at the ‘trenches’. It is important to emphasize that this does
not amount to a reconstruction of the form advocated by Bayesians; see Howson and
Urbach (1993).
3.2 Embedding the material experiment into a statistical
model
The statistical model that one can use to embed Rayleigh’s material experiment in
its context is a simple modification of the simple Normal model where {Zk, k∈N} is assumed to be a Normal, Independent and Identically Distributed (NIID) process
with mean μ and variance σ2 :
M0 : Zk v NIID(μ, σ2), k∈N.
The modification consists in allowing for the possibility that the data
constitute two heterogeneous subsets; have different means. That is,
z := (z1, z2, ..., zn)=(x1, x2, ..., xn1 , y1, y2, ..., yn2),
where n = n1 + n2, constitute a realization of the same NI process but with two
different means (see Cox and Hinkley, 1974), i.e. the two means simple Normal
model:
M1 : Xk v NIID(μ1, σ2), Yk v NIID(μ2, σ2), k∈N. (4)
Viewed in the context of M1, Rayleigh’s substantive (primary) hypothesis can be
recast in the form of the difference between the two means:
H0 : μ := (μ1 − μ2) = 0, vs. H1 : μ := (μ1 − μ2) > 0. (5)
The alternative hypothesis is onesided because there was substantive information
ensuring that μ1 > μ2; rendering the subset μ1 < μ2 of the parameter space irrelevant.
Using NeymanPearson (NP) hypothesis testing we can test (5) using (Cox and
Hinkley, 1974):
τ (Z) =
(X−Y )
��s21
n1
+
s22
n2
H0 v St(ν), C1(α) := {z : τ (z) > cα}, (6)
X= 1
n1 Pn1
j=1 Xj , Y = 1
n2 Pn2
j=1 Yj, s21
=��n1
j=1(Xj −X)2
(n1−1) , s22
=��n2
j=1(Yj−Y )2
(n2−1) ,
and ‘H0 v St(ν)’ reads ‘distributed under the null as Student’s t with ν=(n1+n2−2)
degrees of freedom’. For α = .05, cα=1.734 (α=.025, cα=2.101).
13
3.3 Establishing a statistical discrepancy
Table 1  Rayleigh data (1894)
Obs. Atmospheric (x) in g Chemical (y) in g
1. 2.31017 2.30143
2. 2.30986 2.29890
3. 2.31010 2.29816
4. 2.31001 2.30182
5. 2.31024 2.29869
6. 2.31010 2.29940
7. 2.31028 2.29849
8. 2.31163 2.29889
9. 2.30956 2.30074
10. 2.31026 2.30054
Rayleigh’s data z0 in table 1 yield the following estimates:
x=2.310221, y=2.299706, s21
=.000000292699, s22
=.000001724, n1=n2=10,
(7)
which give rise to an observed test statistic:
τ (z0) = (x−y) ��s21
n1
+
s22
n2
= (2.310221−2.299706) √.000000292699
10 +.000001724
10
= 23.415[.0000000], (8)
where the number in square brackets denotes the pvalue. The tiny pvalue leads
to a rejection of the null of no discrepancy, and in conjunction with a small sample
size, indicates that z0 provides strong evidence against the null; see Mayo and Spanos
(2006).
Rayleigh’s intuition that the detected discrepancy was not an artifact, is confirmed
by this formal test, but the issue is whether one can appraise the size of the discrepancy
μ = (μ1 − μ2) > 0 warranted by his data z0. That is, go beyond the established
statistical significance to the substantive significance. The traditional NP hypothesis
testing does not allow one to establish the presence of a substantive discrepancy because
it’s too coarse in the sense that it concludes with the ‘accept/reject H0 decision.
One way to establish the presence or absence of a substantive discrepancy one can
use the severity evaluation proposed by Mayo (1996).
3.4 Establishing a substantive discrepancy
In the context ofMayo’s (1996) errorstatistical account, a test T provides evidence for
a hypothesis H only to the extent that the data x0, not only accord with hypothesis
H, but, in addition, such a good result would have been highly unlikely if H were
false. This requires the evaluation of the test’s probativeness with respect to the ways
a hypothesis H might be false. If a test T has very high probativeness to detect all
14
departures from H, when the test is applied, and no departure from H is detected,
then we are justified in concluding that there is no evidence that any of the probed
departures are present. Moreover, the high probativeness of the test allows us to go
a step further and conclude that the absence of departures provides evidence that
H is true — the higher the probativeness to detect departures from H the stronger
the evidence for H. That is, we have evidence for H only to the extent that the
hypothesis withstood a severe probe of the ways it could be false and survived; see
Mayo and Spanos (2006).
In the case of NeymanPearson testing where H0 is rejected, the severe testing
reasoning can be used to establish the maximum size of the discrepancy from H0
warranted by data z0. In this case one can argue that, by rejecting the null, one
‘passes’ the claim μ = (μ2 − μ1) > γ, for some γ ≥ 0. The idea underlying the
severity assessment is to evaluate the probativeness of test Tα in detecting departures
from the claim μ > γ, i.e. departures in the direction μ ≤ γ, under the scenario
that the μ > γ is false. That is, the relevant severity assessment comes in the form
of evaluating ‘the probability of observing sample realizations z less in accord with
μ > γ than z0, under the scenario that μ > γ is false, or equivalently, μ ≤ γ is true’.
More formally (see Mayo and Spanos, 2006):
Sev(τ (z0), μ > γ) = P(τ (Z) ≤ τ (z0); μ > γ is false) = P(τ (Z) ≤ τ (z0); μ ≤ γ is true),
Evaluation of severity for different values of γ, in the case of test Tα, takes the form:
Sev(23.415, μ > γ) =PÃτ (Z) ≤ 23.415 − γ μqs21
n1
+ s22
n2¶−1!,
giving rise to the numerical results shown below:
Relevant Inference (μ2 − μ1) > γ,
γ .008 .009 .01 .0105 .011 .012
Sev(23.415, μ > γ) 1.000 0.998 0.867 0.513 0.147 0.002
(9)
On the basis of the above severity evaluations, one can deduce, with high severity,
that the maximum discrepancy licensed by data z0 is in the range [.009, .01).
3.5 Statistical adequacy: securing the reliability of inference
It is often insufficiently appreciated that the above statistical analysis that gave rise
to the inference of the presence of both a statistical and a substantive discrepancy,
depends crucially on the probabilistic assumptions constituting the underlying statistical
model (the premises of inference); see Mayo and Spanos (2004). When any of
15
the assumptions of the statistical modelM1 are invalid, any inferences based on it are
likely to be unreliable because the nominal error probabilities tend to differ from the
actual error probabilities (Mayo, 1996). An important facet of the errorstatistical
account is the statistical adequacy of the premises (see (ii) in section 3.1).
The model assumptions underlying M1 are:
[i] Normality: Xk v N(.), Yk v N(.), k∈N={1, 2, ...},
[ii] Independence over k,
[iii] Different means, E(Xk) = μ1, E(Xk) = μ2, but constant over k,
[iv] Same variance, V ar(Xk) = V ar(Yk) = σ2, constant over k.
How does one assess the appropriateness of these assumptions? In practice the
assessment takes the form of an informal checking using data plots, as well as a formal
testing using misspecification tests.
3.5.1 Misspecification Testing (assumptions [i][ii])
The k−plots of data x := (x1, x2, ..., xn1) and y := (y1, y2, ..., yn2), shown in fig. 1,
suggest no obvious departures from assumptions [i][iii] (μ1 > μ2). The Normality
assumption [i] seems reasonable; the absence of any cycles in the kplots indicates
no departures from assumption [ii]; the arithmetic average seems constant over the
observation period, suggesting no departures from assumption [iii]; see Spanos (1999),
ch. 5.
In d e x
Data
1 2 3 4 5 6 7 8 9 1 0
2 .3 1 2
2 .3 1 0
2 .3 0 8
2 .3 0 6
2 .3 0 4
2 .3 0 2
2 .3 0 0
2 .2 9 8
V ar iab le
x
y
T ime S e r ie s P lo t o f x , y
Fig. 1: kplots of (xk, yk), k = 1, ..., n.
(10)
A closer comparison of the two plots in fig. 1, however, indicates that there might
be a problem with assumption [iv] because the yk data (dotted line) exhibit more
variation.
The small sample size is an issue for formal misspecification testing, but it’s not
a major handicap.
• Testing Normality using the ShapiroWilks test yielded: SW (z) = .879[.121].
• Tessting Independence using a runs test yielded: R(z) =0.000078[.782];
neither test indicates any departures  see Spanos (1986), Mayo and Spanos
(2004).
16
Assumptions [iii] and [iv] will be formally tested in next two subsections. Testing
[iii] formally will confirm Rayleigh’s sound scientific intuition when he deduced the
homogeneity of the two types of nitrogen using a simple comparison based on (1)(2).
3.5.2 Testing the constancy of the two means (assumption [iii])
Asmentioned in section 3, before Rayleigh went on to compare the difference (x−y)=0.010515
based on the two broad categories of nitrogen, atmospheric and chemical, he performed
an informal homogeneity of the means assessment within each group, and
concluded that each of the two categories separately comprised homogenous nitrogen.
This was an important assessment for Rayleigh because if nitrogen within each
of the two categories was not homogeneous the comparison of the overall means of
the two groups would have been meaningless.
In modern statistical terminology Rayleigh performed a misspecification test (see
Spanos, 1986, and Mayo and Spanos, 2004) to assess the homogeneity of the means
within each group. That is, he tested:
[iii] E(Xk) = μ1, E(Xk) = μ2, constant over k∈N.
The homogeneity within each of the two groups is assumed by the statistical model,
and thus ensuring its validity is part of securing the reliability of inference; testing
without the boundaries of the statistical model  see Spanos (1999). One way one can
formally test assumption [iii] is to use an ANOVA test (see Cox and Hinkley, 1974)
as a misspecification test.
The ANOVA test for m subgroups. with ni observations in the ith subgroup,
takes the general form:
F (Y) = ��mi
=1 ni(yi−y)2
m−1
��mi
=1
(ni−1)s2i
n−m
H0 v F ((m−1), (n−m)), C1 := {y : F (y) > cα}, (11)
where n=Pm
i=1 ni is the total number of observations;
yi= 1
ni Pni
j=1 yij is the sample mean of the ith subgroup;
y is the overall mean based on all observations;
s2i
= 1
(ni−1)Pni
j=1(yij − yi)2 is the sample variance of the ith subgroup;
‘H0 v F(k, l)’ reads ‘under H0 is F distributed with (k, l) degrees of freedom, and C1
denotes the rejection region for an α−significance level test. The Ftest statistic is
the ratio of the variation explained by the mean heterogeneity Pm
i=1 ni(yi − y)2 and
the residual variation Pm
i=1(ni − 1)s2i
.
The ANOVA information for the chemical nitrogen data in table 1 is given in
table 2.
Table 2  ANOVA for chemically produced nitrogen gas
Source of variation DF Sum of squares
Explained: Pm
i=1 ni(yi−y)2 3 .0000004198
Residual: Pm
i=1(ni − 1)s2i
6 .0000022143
Total: Pm
i=1Pni
j=1(yij−y)2 9 .0000026341
17
where ‘DF’ stand for ‘Degress of Freedom’. The observed value of the test statistic:
F (y0)=(.0000004198/3)
(.0000022143/6)=0.37917[.772],
in conjunction with a high pvalue (in square brackets) indicate that there is no strong
evidence against the homogeneity assumption. No evidence against H0, combined
with the information that the nitrogen gas produced by these methods is qualitatively
similar, can be interpreted as indicating that all the chemically based methods do
produce homogeneous nitrogen gas.
Similarly, for the atmospherically produced nitrogen gas data (table 1), the ANOVA
information is given in table 3.
Table 3  ANOVA for atmospherically produced nitrogen gas
Source of variation DF Sum of squares
Explained: Pm
i=1 ni(xi−x)2 3 .00000042781
Residual: Pm
i=1(ni − 1)s2i
6 .0000022143
Total: Pm
i=1Pni
j=1(xij−x)2 9 .0000026421
The value of the test statistic:
F (x0)=(.00000042781047/3)
(.0000022143/6) =0.3864[.767],
and the associated pvalue indicate no evidence for departures from the homogeneity
assumption.
Hence, the misspecification tests for the homogeneity of the means within each
subgroup (assumption (iii)) revealed no departures for the data in table 1, confirming
Rayleigh’s original inference based on an informal ANOVA test. It is interesting to
note that the numerator of the Ftest statistic (11) provides a more sophisticated way
to evaluated the individual differences (y − yi) , i = 1, 2, ..., m, than the comparison
in (2) used by Rayleigh.
3.5.3 Testing the equality of the two variances (assumption [iv])
Another model assumption which needs to be tested is [iv], the homogeneity of the
variances between the two categories, which can be assessed by testing the hypotheses:
H0 : σ21
= σ22
, vs. H1 : σ22
> σ21
, (12)
in the context of the ‘encompassing’ statistical model:
M2 : Xk v NIID(μ1, σ21
), Yk v NIID(μ2, σ22
), k∈N. (13)
Note that under H0, modelM2 reduces to the original modelM1. A wellknown test
for (12) is:
v(Z) = ³( s22
n2−1
)/( s21
n1−1 )´H0 v F(n2 − 1, n1 − 1), C1 := {z : v(z) > cα}, (14)
18
where F(n2 − 1, n1 − 1) denotes the Fdistribution with (n2 − 1, n1 − 1) degrees of
freedom; see Cox and Hinkley (1974). For F (9, 8), α = .05, cα = 3.179, and thus for
the data in table 1:
v(z0) = ³s22
s21
´= .000001724
.000000292699 = 5.890[.007],
with pvalue .007, in conjunction with a small sample size, provide strong evidence
against the null.
In view of this rejection and the reasonableness of the other probabilistic assumptions
[i][iii] underlying M1, it might seem judicious to respecify the original model
by replacing it with M2. However, it can be shown that using a test of the primary
hypothesis (5) which allows for the heterogeneity of the two variances, such as the
Welch (1938) test, does not affect the inferences in sections 3.33.4.
4 ‘Sharpening’ the Experimental Results
In a deliberate effort to improve the quality of the original data (table 1), Rayleigh
and Ramsay (1895) implemented additional controls and took more precautions to
eliminate as many of the experimental errors as possible. Despite the fact that table
4 contains fewer observations, the ‘cleaning up’ of the data renders them more
‘trustworthy’ in measuring the discrepancy in ‘weight’ between the chemical and atmospheric
nitrogen gas.
Table 4  RayleighRamsay data (1895)
Atmospheric (x) Chemical (y)
1. 2.3103 2.3001
2. 2.3100 2.2990
3. 2.3102 2.2987
4. 2.2985
5. 2.2987
The RayleighRamsay data (table 4) yield the following estimates:
x=2.310166667, y=2.299, s21
=.00000002334, s22
=0.00000041, n1 = 3, n2=5,
where the test statistic is:
τ (z0) = (x−y) ��s21
n1
+
s22
n2
= (2.310166667−2.299) √.00000002334
3 +0.00000041
5
= 37.2678[.000000],
which leads to a strong rejection of the null of no discrepancy at α = .05, cα = 1.943
(α = .025, cα = 2.447). In view of the fact that the sample size is small, the pvalue
in square brackets suggests that data z0 provide strong evidence against the null; see
Mayo and Spanos (2006).
19
Evaluating the severity of the claim μ := (μ1 − μ2) > γ, for different values of γ
yields:
Relevant Inference (μ2 − μ1) > γ,
γ .008 .009 .01 .0105 .0108 .011 .012
Sev(37.2678, μ > γ) 1.000 1.000 0.996 0.966 0.867 0.701 0.016
(15)
On the basis of the above severity evaluations based on test {τ (Z), C1}, with only 8
observations, one can deduce that the maximum size of the discrepancy warranted
by the RayleighRamsay data z0 is in the range [.009, .011).
As in the case of the data in table 1, the reliability of the inference, as well as
the severity evaluations (15), depend crucially on the appropriateness of the assumed
statistical model M1. It turns out that assumption (iv) is less problematic in this
case and the use of the Welch (1939) test does not change the above inferences.
On the basis of the above severity evaluations one can deduce that the maximum
discrepancy warranted by data z0 is within the range [.01, .011]. Taking into account
the fact that no experimental data, even today, can be totally free of all possible
impurities, as well as the small sample size, the severity evaluation of the maximum
size of the discrepancy warranted by the data in table 4, established the severity evaluations
in (15), turned out to provide a very good approximation to the substantive
value of γ∗ = .01186 given by Ramsay (1899).
In concluding this section, it is interesting to note that the use of Confidence
Intervals (CI), as a method to capture the substantive discrepancy γ∗, can be shown
to be less effective in this particular case; the general reasons for this are discussed
in Mayo and Spanos (2006).
5 Conclusion
The errorstatistical perspective has been used to capture, as well as systematize, the
‘learning from error’ process that led Rayleigh and Ramsay to the discovery of argon.
Their probing for error at all stages of the ‘trial and error’ process, and their careful
but rudimentary statistical analysis of the data from their experiments, played an
important role in guiding their learning process. It is shown that the errorstatistical
account, based on severe testing reasoning (Mayo, 1996), can be used to describe,
codify and provide a naturalistic account of their quest. In particular, this approach
was shown to accomplish three related objectives:
(a) it provides a coherent rationale for the probing for error process that
systematically eliminates potential explanations in order to address Duhem’s
problem ‘locally’, at the level of a particular experiment,
20
(b) it captures the intuition underlying their informal probing for error procedures,
including their ‘diagnostic checks’ to secure the validity of the (implicit)
statistical model,
(c) it formalizes and systematizes their intuitive inductive reasoning that gave
rise to a reliable measure of the magnitude of the discrepancy warranted by
their data, using postdata severe testing evaluations.
Of particular interest is their use of primitive (by today’s standards) statistical
analysis in guiding them toward additional experiments which eventually led to the
establishment of a substantive discrepancy between the density of atmospheric and
chemical nitrogen. That in turn led them to the discovery of argon in the atmospheric
air, which precipitated a major revision of the periodic table. The importance of the
discovery of argon was recognized early in the 20th century by awarding two Nobel
prizes to Rayleigh and Ramsay in 1904, in physics and chemistry, respectively, and
not awarding one in chemistry to Mendeleev in 1906.
