Introduction to Experimental Methods

From Scientificmetho

The Researcher Robert Woodsworth, in the 1938 text, Experimental Psychology, invented the blueprint we follow in psychological research today. He stated that the defining feature of the experimental method was the manipulation of what he referred to as the independent variable (although he didn't invent the term) upon the measured dependent variable, which is then interpreted in a systematic fashion. He stated that the dependent variable ought to be an observable phenomena that can be measured in some meaningful way, and defined precisely in an empirical, operationizable, fashion. In order to find a causal relationship, the experimenter must also "hold all other (extraneous) conditions constant, save for the independent variable" and then observe, record and publish the changes.

Woodworth also detailed the steps of the correlational method which has no true independent variable. While it cannot be used to uncover causality, it still has its uses, and at times, such as when working with non-equivalent groups, it is the only method possible.


Establishing Independent and Dependent Variables

Independent Variables

Independent variables must have at least two levels. The simplest experiments compare the effects of the independent variable by testing them on one group, and comparing the effects against a second group which does not recieve the independent variable. These groups are referred to as the experimental group and control or placebo group. In cases where holding back the experimental IV is unethical, such as witholding a medicine from sick people, a waiting list is used instead of a placebo group. Yoked groups are control groups that experience some of the same aspects of what the experimental group experience, but not others aspects. A research design testing the effects of perceived control on stress used "executive rats" who could "pull a lever" to reduce a shock. Yoked rats were shocked, but could not "pull any lever". (It wasn't really a lever, but I don't have all day to explain the apparatus)

Variables come in many varieties, which can be grouped as situational variables which are features of the environment that we can control, task variables which are different kinds of problems to solve, instructional variables which are different manners to solve a task variable.

Controlling Extraneous Variables

Extraneous variables are variables other than the Independent variable that may influence the dependent variable. When the experimenter fails to control for them, they produce confounds which influence, but cannot be separated from the effects of the independent variable. In scientific terms, this basically fucks up the experiments.

Manipulated Vs. Subject Variables

Manipulated Variables

Manipulated variables things that we can control within the environment, such as the titration of a dosage. A study using only manipulated variables and random assignment to groups is called a "true" experiment. You are allowed to make causal claims in this situation, as the independent variable precedes the dependent variable and covaries with the dependent variable. Some scientists go as far as to say you really shouldn't call anything other than a manipulated variable an independent variable. They're probably right...

Subject Variables

Subject Variables are pre-existing characteristic differences between people, such as gender. These are also referred to as ex-post-facto variables, non-manipulated variables and natural variables, mainly in a desparate attempt to make the obvious sound complex and to confuse you into believing something utterly unfathomable is going on.

Pity a study using subject variables, for it can only be a "quasi" experiment. This stature limits the kind of conclusions you can draw - i.e. you are basically stuck with correlational findings. Since we cannot rule out that uncontrollable, post-hoc aspects of the subject variables are not responsible for the measured difference, we can never say the groups are equivalent, and therefore all we can say is that there is that the groups performed differently, although you would never say it that simply.


First, let me define the word "construct" as it will appear here. A construct is a theoretical entity, a working, operational hypothesis that strives to give us a meaningful understanding of some studied phenomena.

Next, let me quickly define validity (We'll get back to reliability later). Validity concerns - how close does a construct come to measuring what you seek to measure. How do you measure validity? Well, you can't quite do it in an abstract way - you must take a look at each sort of validity, and then follow its methods to assure true validity.

Let's take a look at some important kinds of validity.

Predictive Validity

Predictive Validity is the entire point of research. If a theory has no predictive value, if it can't give us an idea of what sort of things we will learn in the future, there is no point in worrying about any of the other validity measures on this page - its already dead. The 19th century sociologist Comte said it best:

"Know in order to predict" - Alexander Comte

You measure predictive validity easily - if your theory can predict events clearly and accurately ahead of time (i.e., the point of prediction), then it has predictive validity. Theories such as evolution and quantum theory have superb predictive validity. Theories such as religion have absolutely miserable predictive validity. Face Validity This is learned opinion of whether a construct appears valid, based on experience. It's intuitive - you measure it according to your experience. It's not the most scientific form of validity, and it's never the sole form of validity used, but it is important to recognize that this is used by experimenters.

Statistical-Conclusion Validity

This validity is the extent to which a researcher uses statistics properly. You measure this accuracy by applying correct knowledge of statistics to the review of the statistics under consideration. This requires expertise.

Construct validity

Construct validity refers to the adequacy of the definitions for both the independent and dependent variable. These variables should be defined in empirical terms in a manner that others can measure them the same way. This is what we mean when we say we operationalize our terms. A variable has good construct validity if it is measurable, and if the measure accurately reflects the construct. The construct validity of a test is the extent to which the test may be said to meaure a theoretical construct or trait. It's not a simple matter to decide how valid a construct is - sometimes it is hotly debated, because construct validity is directly influenced by one's theory of psychology. Examples include: scholastic aptitutde, neuroticism, anxiety, etc.

Construct Identification Procedures

A construct is developed to explain and organize observed response consistencies. It derives from established interrelationships among behavioral measures. Specific techniques that contribute to construct identification include:

  • Developmental changes - such as age differentiation
  • Correlation with other pre-established tests

Factor Analysis was developed as a means of identifying traits, and is a method of analyzing interrelationships between data. It is used to uncover clusters of behaviors that suggest common traits.

Internal Consistency

Internal consistency - a measure of homogeneity

Convergent and Discriminant Validity

In order to show construct validity, we must not only show that a test correlates well with other variables with which it should correlate, but also that it does not correlate significantly with variables from which it should differ. Convergent validity measures the first correlation, discriminant validity, the latter.

Campbell and Fiske (1959) proposed a systematic experimental design for the dual approach of convergent and discriminant validation, which they called the multitrait-multimethod matrix. Logically, this assessment requires that at least two traits and two measures are to be examined. Let's use an example to learn about multitrait, multimethod matrices. In our example, we will assume three traits and three measures are to be used. The three traits are: dominance, sociability and achievement motivation. The three methods of evaluation could be a self report (lies!), a projective test and a peer rating. We will use letters, A, B, C... to symbolize traits, and numbers, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,... (sorry, got carried away) to represent methods of evaluation. Therefore, "A1" would indicates a measure of dominance according to a self report, "A2" would be dominance as measured by projective test, and so on.

The hypothetical correlations in the following table include reliability coefficients and validity coeffiecients (the percentage in the parenthesis). Relibility is a monotrait, monomethod correlation - a correlation of the measure against - itself! Theoretically, this should therefore be perfect (i.e., r=1.0) but to be scientific, we substitute an estimate of reliability.You can estimate reliabilities a number of different ways (e.g., test-retest, internal consistency). There are as many correlations in the reliability diagonal as there are measures -- in this example there are nine measures and nine reliabilities. The first reliability in the example is the correlation of Trait A, Method 1 with Trait A, Method 1.

In the validity coefficients, the scores obtained are for comparisons of measures of the same trait by different methods. Each method is checked against the other independent measure of the same trait. The table also includes correlations between different traits measured by the same method in red triangles (heterotrait, monomethod) and correlations between different traits measured by different measures in orange triangles (heterotrait, heteromethod). Now, as complex as this all is, the results should be simple - leaving aside your reliability coefficients, your validity coefficients should be the highest scores (monotrait, heteromethod), followed by heterotrait, monomethod scores (red), and (hopefully) lagging woefully behind, your heterotrait, heteromethod scores(orange). Going back to our concrete examples, self reports on dominance should correlate higher with projective tests of dominance than they do with self reports of sociability. If this is not the case, you have uncovered methodology variance - error or deception! In our example, you can see that method two better correlated with method three - indicating perhaps that self reports (method 1) were confounded by the social desirability bias.

In order to find these scores on the matrix, you'll first have to refer back to which letter represents dominance (It's "A") and then look at the block containing A2, which indicates a correlation between A1 and A2 of .57. The block just below it, contains the correlation comparing A1 to A3 - .56. Now, move over to your right, and you'll see the block containing the correlation between A2 and A3 - .67, a bit higher, as we might expect. Now, to check the the other measures, you look in the triangles. The red triangles contain correlations among measures that share the same method of measurement. For instance, A1-B1 = .51 in the upper left heterotrait-monomethod triangle. Note that what these correlations share is method, not trait or concept. If these correlations are high, it is because measuring different things with the same method results in correlated measures. Or, in more straightforward terms, you've got a strong "methods" factor. Then look at the correlations in the orange triangles. These are correlations that differ in both trait and method. For instance, A1-B2 is .22 in the example. Again, you'll want these to be the lowest scores, as these are correlations between measures that differ both on trait and method.


Content Validity

Content Validity depends on the relevance of the experiment to the phenomena being studied. Content validity is often discussed in the making of tests - the question is, how relevant are these questions towards measuring a behavior or ability? Concerns for content valditity begin during the very formation of the experiement or test, and content validity is often measured by correlative studies concerning real life successes or failures of selected samples.

Criterion Prediction Validity

Criterion-Prediction Validity procedures indicate the effectiveness of a test in predicting an individual's performance in a specified activity.

External Validity

External Validity refers to the generalizability of results to the population from which the sample was drawn. Ecological and chronological validity refer to the generalizability of results to other environments and to other times.

Internal Validity

Internal Validity refers to the degree which an experiment is methodologically sound and free of confounds. Threats to internal validity, confounds in within-subject designs, include the following:


When an event occurs between a pre and post test that produces large changes unrelated to the treatment, a history confound occurs.


Maturation is another such change that may be unrelated to the treatment. This obviously is more likely to occur when pre/post tests are held far apart chronologically.

Regression to the Mean

Regression to the Mean can be a confound when samples just happen to select extreme scores.

Practice Effect

Practice Effect In a pretest, posttest situation, having experienced a test the first time helps the subject improve upon the second attempt.

Instrumentation Effects

Confounds occur when there are subtle changes in the measures from pre to post testing. Since the researchers bias will most likely make any errors in favor of rejecting the null hypothesis, most researchers use double blind or even triple blind measures. (Triple blind measure means even the statistician is unaware of which group is the experimental group.)


Reliability is a function of whether a tool arrives at the same result each time it measures the same phenomena. A simple rule in experimentation is that valid results are reliable, but reliable results are not necesarily valid. You can be consistently (reliably) wrong! For evidence, see any religion.

Measures of reliability in within-subjects designs are split half reliability and test-retest reliability. You use split half reliability when you are only testing subjects once - each experimental group takes an equivalent part of the test. You use test-retest when you are testing the groups multiple times.

As you might guess, split half tests suffer from the fact that they may still differ in some significant manner, and test-retest methods suffer from practice effects, and possible maturation effects.

Subject Confounds

Selection Bias is a factor in any study of subject variables. Is there something different inherently, about a person who would go for psychological treatment as opposed to a person with a similar concern who would not? This factor would confound any research dealing with the efficacy of therapy.


Attrition is a connected confound. Are there characterlogical differences between experimental subjects that remain in a study, as opposed to those who drop out?

Control Problems in Experimental Research

Between Subject Designs

If the independent variable is a subject variable, subjects can not be randomly placed into experimental groups! Therefore, there must be a between-subject design -or a comparison of different groups. While steps may be taken to control all other extraneous variables and create an equivalent group, experimental effects may always be confounded by subject differences of the groups. For example, no matter how egalitarian the U.S. may be at the present, compared to the past, there is no real way to ensure equality of groups across gender and race.

The problem of creating equivalent Groups

The best way to create equivalent groups is to randomly assign a randomly selected sample to each experimental group. This does not really eliminate confounds - what is does is when each subject has an equal chance to appear in any group, the confounds are spread equally across all measures, offsetting each other! In other words, in order to reduce error effects, you spread out error everywhere! There are various computer programs to randomly place subjects.

When random assignment is not possible, as with between subject studies, matching can be used. In matching, subjects are paired together on some traits that are deemed both measurable and important. When it is a measure such as IQ, a list is created with a descending order. Then, for each pair (or set of 3, 4, etc.) , one is randomly assigned to a group.

Within Subject Designs

Within experimental subjects avoid some of the previous confounds, because each experimental subject is exposed to all levels of the independent variable(s). For this reason this design is also referred to as a repeated-measures design. Since everyone will experience each level of the IV, the confounds of nonequivalent groups are eliminated since each experimental level uses the same subjects. Also, since each subject experiences every level of the IV, the researcher can recruit less subjects for his study, which is good news if you are studying left handed Albanian Siamese twins.

However, there are still confounds to control. Sequence effects are confounds that occur do to the order of presentation of the levels of the independent variable(s). Practice or Carryover effect is a confound wherein experience with one repeated measure helps one improve their abilities when tested again.

Various methods exist to control these confounds. The simplest is to test each person at each level of the IV, but only once. This removes practice effects. Counterbalancing changes the order of presentation for randomly selected subjects and therefore helps reduce sequence effects. Reverse counterbalancing is the simplest manner of changing the order - you simply reverse presentation of the experimental conditions, provided that you are going to test each level more than once. However, in some cases, it is impossible for each permutation of the levels of the IV to be counterbalanced. For example, if there are six experimental conditions in an experiment, there would be 720 possible permutations of the experimental conditions. In this case, partial counterbalancing would randomly select out a reasonable number of these permutations to examine. To do this professionally, you use what is called a Balanced Latin Square which is simply a block randomization technique.

In some cases, reverse counterbalancing is possible, but too predictable. In this case, Block Randomization procedures are used to cull out randomly assigned permutations of the levels of the study.

The problem of controlling Sequence effects

In some cases, counterbalancing fails altogether, because one condition in the experiment may lead to insight learning Insight learning provides a general abstract knowledge that may be of aid in figuring out any other similar puzzle. When this occurs, the experimental sequence that leads to insight learning is said to lead to an asymmetric transfer of knowledge. If asymmetric transfer is occurring in a study, a researcher really has no choice but to turn to a between-subjects design.

Control Problems in Developmental Research

Problems with Cross Sectional Studies

Cross sectional studies use a between-subject approach to studying a phenomena. A study of the differences of 3 year olds, five year olds, and 7 year olds would collect three different groups. The benefit is that the study can be done NOW. However, these studies suffer from the problem of nonequivalency known as Cohort effects or the environmental and chronological influences on a particular age group that are not evident for another age group at a different historical time. For example, children of the 1960s dealt with the assassination of their president and the threat of nuclear war, while the children of the 1980s dealt with AIDS and an increased drug problem, comparative to the 1960s.

Problems with Longitudal Studies

Longitudal studies use a within-subject or repeated measure approach, - they study the same group over time and avoid this problem. However, they suffer from attrition and selection bias, and worst of all, require YEARS of time in some cases. Terman’s "Termites" is the most famous example of a longitudal study.

Problems with Biasing

Experimenter Bias

This biasing occurs when the hopes, expectations and desires of the research unduly influence measurement. Since this can even happen unconsciously, the way to reduce his error is through Triple Blind procedures, where the experimenter, study subjects and the statistician are unaware of which experimental condition is being tested.

Subject Bias

Subject bias occurs when the expectations of the experimental subjects confound research results.

Hawthorne Effect

A particular example would be the Hawthorne Effect. The Hawthorne Effect is a name for the tendency of people to work harder when they are being observed. This is also known as observer bias.

Social Desirability

The social desirability bias unduly effects self reports of people’s behaviors.

Good Subject Problem

The Good Subject Problem occurs when experimental subjects do their very best to ensure that the researcher’s hypothesis is supported. Presumably, there is a Bad Subject bias as well. Since subjects want to be seen as intelligent and good, there is Evaluation Apprehension - effects caused by anxiety over being observed.

One way to avoid these problems is to mask the demand characteristics of a study. The demand characteristics are the aspects of the study that reveal the hypothesis being tested. Benign deception is therefore used. To ensure that the deception worked, a Manipulation Check can be used during the debriefing. Subjects are asked if they know what the true hypothesis being tested was.

In the end, however, there will always be bias in most research, because research volunteers, by their very nature, tend to be more intelligent, helpful and possessing of a need for social approval. While the problem of reluctant volunteers exists, it is less common.

Personal tools