Inferential Statistics

From Scientificmetho

Descriptive Vs. Inferential Statistics Descriptive statistics allow you to describe attributes of a distribution of data. Two ways of doing that are with measures of central tendency and measures of dispersion. Measures of central tendency include the Mean, Median and Mode. Each of these describes the center point of the distribution and how the data points cluster about it. Measures of dispersion include the Range, Variance, and Standard Deviation, among others (but these three are the most common). These statistics describe how spread-out the distribution is. So, with one of each measure (actually only two values), anyone could describe the shape of an entire distribution of data. The most common statistics reported are the mean and standard deviation.

Inferential Statistics

A limitation of descriptive statistics is that they cannot tell us anything about distributions other than the one we have collected data for. This might not seem like a limitation, but consider the fact that researchers do not study whole populations directly. They sample from those populations and collect data from those samples. Descriptive statistics do a fine job of describing those sample distributions, but the researcher really wants to know if the data collected on the sample are representative of the population from which the sample was drawn. In other words, the researcher wants to be able to infer whether or not the results he/she found in the sample would occur in the population at large. That's where inferential statistics come in. With inferential statistics, you are trying to reach conclusions that extend beyond the immediate data alone. For instance, we use inferential statistics to try to infer from the sample data what the population might think. Or, we use inferential statistics to make judgments of the probability that an observed difference between groups is a dependable one or one that might have happened by chance in this study. Thus, we use inferential statistics to make inferences from our data to more general conditions; we use descriptive statistics simply to describe what's going on in our data.

Why the inferential process is important to research

Remember that the scientific method is used to generate theories that attempt to describe, explain, predict and control some studied phenomena. Descriptive analysis/statistics are used to describe a chosen sample's characteristics. Inferential statistics are used to infer or explain something about the population that the (hopefully) representative sample was drawn from. These explanations will allow us to predict future outcomes (that serve us doubly by further validating the theory.) Finally, in cases where we can manipulate the independent variable, we can control or shape the phenomena to meet some supposedly desired outcome. Inferential analysis can only come after descriptive analysis, but without the inferential process, science would be nothing more than the endless cataloguing disconnected facts.

Our book sucks at actually explaining what inferential analysis is, so I will say that inferential analysis is another name for hypothesis testing. Hypothesis testing takes place in experimental research, where we hope to find statistically significant differences between experimental conditions that may lead us to reject the null hypothesis in favor of our alternative hypothesis.


Ok. Null Hypothesis is the assumption that there is no significant measurable difference between the control conditions and the experimental conditions for your dependent variable, in your experiment. You can think of this as "the status quo". It is denoted as H0, pronounced H, sub "o". Without evidence to the contrary, accepting the null hypothesis is the most parsimonious explanation for the relationship between your research conditions. We therefore seek to show that the null hypothesis is not the best explanation for the studied phenomena.

The research hypothesis, also called the alternative hypothesis, is denoted as "H1". (Multi factoral research can have several alternative hypotheses, H2, H3...) Research is all about rejecting the null hypothesis in favor of finding a significant difference (H1), between effects for the control group and effects for the experimental group, but we can only do this when the alternative hypothesis is the most parsimonious explanation for this difference. We can make this claim when an inferential test shows us that it is statistically more likely that a difference is due to the effects of our independent variable than to the confounds of error and chance.

Failing to reject H0 means that any slight difference you did find between groups was probably due to chance.

The role that chance plays in any scientific endeavor

Determining absolute causality in a complex universe is a scientific impossibility. We can never know every determinate for an effect, first, because we just don't have the time to find them all, and second, we don't know all the kinds of things to look for yet. What this means is that it is always possible that your experiment will reveal exactly the outcome that your hypothesis predicted, even though the hypothesis is wrong.

Chance is a term to describe this reality. Chance is any variability from the null hypothesis that is due to variables you are not studying. Some of this may be due to individual differences between groups, random unpredictable factors, or, for all we know, unseen mystical forces, such as the ghost of Yul Brenner. The real point of experiment is to control for this error variance to reduce the degree that chance plays in our understanding of the relationship between variables. The better we do this, the more we can claim that differences are due to Systematic Variance - a term denoting our control of the studied Independent variable(s).

As Mel Brook's "Young Frankenstein" might say: "Systematic Variance - GOOD! Error Variance - HUNNNNMMM!" The Central Limit Theorem and its relevance to Inferential Statistics The central limit theorem, one of the most interesting results of the theory of probability, is the basis for inferential statistics. The theorem states that the sum of a large number of independent observations from the same distribution has, under certain general conditions, an approximate normal distribution. Moreover, the approximation steadily improves as the number of observations increases. (We witnessed the power of the central limit theorem when we observed that dice rolling experiment. After a certain number of rolls, we saw the formation of a normal bell curve.) This occurs regardless of the shape of the population - whether normally distributed, slanted, or curvilinear, any sample with a size of 30 or more will be normally distributed. (If the sample size is smaller than 30, we need to apply nonparametrics that do not assume a normal distribution.)

A suggested explanation is that these phenomena are sums of a large number of independent random effects and hence are approximately normally distributed by the central limit theorem. The relevance of this fact is that it means that nothing about the distribution of scores within the population need be known to generalize results from a sample to a population.

Statistical significance

Just as we can never know all the causal determinants for an effect, we can never draw a perfectly representative sample from a population, short of sampling the entire population itself. Sample Error is a term to represent this problem. (We've already discussed methods of reducing this error, such as random sampling and matching groups) A more prosaic error to consider is the simple mistake of claiming that a significant difference in our experiment exists when it doesn't. This leads to a discussion of type I and II error.

What Type I and Type II errors are, and their importance to research

We already know that in science, a researcher's hypothesis is never proven, instead, we tentatively support a hypothesis when we fail to disprove it and instead disprove, or reject the Null hypothesis. Since we cannot prove things in science, we must state with what confidence we are supporting our hypothesis. We use the term Alpha Level to denote the probability that we are mistakenly rejecting the Null Hypothesis. The most commonly used alpha is p=.05, which means that about one out of every 20 research findings that you have read commits the type I error - claiming a significant difference between experimental conditions and control conditions, when none exists. (Just one more reason why we need to replicate findings and cite more than one source in our research...)

We can reduce the risk of a type I error to a point. After all, we set the alpha, we decide the level of risk we will take, and we are responsible for reducing experimental and sample error. Type II errors, or the erroneous acceptance of the null hypothesis, are related more to sample size and the sensitivity of our measures of the dependent variable, factors that tend to be less under our control.

We can reduce the occurrence of Type II errors by increasing the size of our sample, but there is a serious concern with this solution. A a trivial difference between groups becomes significant if you pull a large enough sample - but at some point this statistically significant difference becomes meaningless.

The difference between significance and meaningfulness

A proposed intervention may represent a statistically significant improvement over some pre-existing method, without actually providing a meaningful difference. The new method's slight improvements may not warrant the extra cost of both time and money to implement it. This distinction may be referred to as practical significance, which is a judgment call based on economic and political considerations known as a cost-benefit analysis.

The steps in completing a test of statistical significance

Remember how we state our hypothesis in a conditional statement?

1) The negation of your hypothesis forms the Null Hypothesis 2) Establishing the alpha level 3) Selection of the appropriate statistic 4) Computation of the obtained value for statistical significance 5) Determination of the critical value - the minimum value needed for significance 6) Comparison of the obtained value to the critical value 7) Reject the Null hypothesis if the obtained value is more extreme than the critical value

Some of the basic types of statistical tests and how they are used

Statistical tools deal with data. So, in order to address the appropriate use of statistics, we should begin review the data types. There are four basic data types:

Ratio Data Interval Data (These are discrete variables) Ordinal Data Nominal Data (These are continuous variables) They are shown in order of power. Power is a measure of the amount of information contained in the data.

Nominal data have the lowest power because they contains the least amount of information. The information in nominal data consists of names, categories, or frequencies of occurrence. The numbers on football players' jerseys, for example, are nominal data. They only indicate the category of where they play on the field, otherwise, the numbers have no meaning. They don't relate to a player's age, weight, or anything else, other than perhaps IQ.

The problem with nominal data is that it does not have the mathematical properties necessary to permit the meaningful computation of means (averages). For example, you could add all the football jersey numbers on the offensive team, divide by 11, and compute a mean of those numbers. But the mean you compute would be meaningless, because the original numbers are themselves meaningless.

Surveys that use Likert-type data scales (Strongly Agree, Agree, Neutral, Disagree, Strongly Disagree) to collect attitudinal data generate nominal data. The following example will illustrate the problem of computing (and interpreting) means with nominal data. Suppose I ask 1000 people a single question: "How do you feel about the President's economic policy?" In this example, I am using the five point Likert scale above. In this hypothetical example, let's say all 1000 people return their surveys, and that 500 respondents Strongly Agree with his policy while the other 500 Strongly Disagree with it. Some people erroneously try to convert these Likert scale data into numerical data by assigning weights to each category like this:

Strongly Agree = 5, Agree = 4, Neutral = 3, Disagree = 2, and Strongly Disagree = 1 Then, they would attempt to compute a mean response value by multiplying the number responding in each category by the weight of that category, adding up all these products, and dividing by the total number of surveys. When we do that with our survey, we get:

5 * 500 + 1 * 500 = 2500 + 500 = 3000/1000 = 3 So, what we have is a mean response value of 3 (Neutral). What this says is that "on average, those who responded to the survey are neutral toward the President's economic policy." Obviously, this is an erroneous conclusion. That's the essence of the problem you can run into when trying to compute means with nominal data -- the results are almost always un-interpretable. The most accurate way to report nominal data is to stay descriptive and use percentages of responses.

Ordinal Data contain both name and position information. That is, not only do we know what category a particular piece of data is in, we also know what position it occupies relative to all other data. An example is the order of finishing in a race. First, second, third, etc., are categories that racers occupy at the end of the race. But, by knowing the particular category, we know exactly where they finished relative to all other racers. One limitation with ordinal data is that it is impossible to say whether the distance between, say, first and second is the same as between second and third. In other words, the intervals between the points on the ordinal scale are not equal. Because of that, the ordinal scale is also not useful for computing means.

We call Nominal and Ordinal data scales discrete data scales because the intervals (spacing) between values on these scales are not necessarily equal from one value to the next, and because there are no intermediate points between each value (for example, there can be no position between first and second in a race, or no position between Strongly Agree and Agree on a Likert Scale). These facts make it impossible to use these scales to reliably compute mean values.

Interval Data contains all the attributes of Nominal and Ordinal data as well as possessing one additional attribute -- there are equal intervals between each point on the interval scale. This attribute allows us to reliably compute means. The zero point on an interval data scale, however, does not indicate the absence of the attribute being measure -- it's just another value on the scale. An example is the Fahrenheit (NOT KELVIN) temperature scale: a temperature of 0 on the scale does not mean the absence of all movement/heat, it's just another temperature value.

Ratio Data transcends yet includes all the attributes of Nominal, Ordinal, and Interval data as well as possessing an "absolute zero." This means that there is a point on the ratio scale the indicates the absence of the attribute being measured. Some examples of ratio scales are weight, age, wealth, velocity, etc.

Both the Interval and Ratio data scales have equal intervals between points on their scales and there is a continuous range of values between any two values on the scales (for example, between the numbers 0 and 1 on a continuous scale, there are an infinite number of fractional values). Because of these attributes, we call these continuous data scales. These scales can be reliably used to compute means.

Why have we been talking so much about the ability to reliably compute means? Because some of the most powerful statistics available to analyze data require the computation of mean values in the data (for example, if you are looking for statistically significant differences between means of two or more groups). If these very powerful statistics are applied to nominal or ordinal data sets, their results will be unreliably because of the problems associated with computing means of these types of data, as illustrated above. So, to in order to know if the researcher is using an appropriate statistical test on data he/she has collected, determine the type of data being collected to see if it is discreet or continuous. Knowing the data type is not the only thing needed to assess the appropriateness of a statistical test, but it is one of the most important.

You also need to consider the research design when looking for the appropriate statistical test.

Parametrics Vs. Nonparametrics

There are two classes of inferential statistics: parametric and nonparametric. The more powerful class is parametric. What makes parametric statistics so powerful is their ability to estimate and cancel out random sampling error. Parametric statistics can do this because they rely on certain assumptions about the population containing the attribute(s) being measured:

the attribute is normally distributed throughout the population the variability in the attribute is fairly evenly distributed throughout the population the sample mean and standard deviation are continuous data (ratio or interval) Researchers also study problems involving variables that cannot be measured on continuous scales. such as questions concerning attitudes and preferences. It is not appropriate to use parametric statistics with these variables because the assumptions listed above are often violated, the population is not normally distributed, and ordinal and nominal data are used. To enable researchers to study these sorts of questions, a different class of statistics was developed that do not rely on any population parameters or assumptions. They are called nonparametric statistics.

Because they do not rely on assumptions about the population, these statistics cannot estimate or cancel random sampling error. They are considerably weaker that then parametric tests. What this means is that a difference between two or more groups or a relationship between two variables must be considerably larger to register as statistically significant with a nonparametric statistic than with a comparable parametric statistic.

Despite their lack of statistical power, nonparametric statistics are ideal for use with variables that generate discreet data (nominal or ordinal data.) So, they permit researchers to answer a wider range of questions than would be allowed with parametric statistics alone. The following table shows some of the more typical parametric and nonparametric statistical tests used in social science research today. They are categorized by parametric/nonparametric and by the two primary types of inferential studies: group difference and relationship (association) studies.

Inferential Statistics 
Parametric Tests Nonparametric Tests 

Group Difference Studies

Parametric Tests

  • t-test


  • Chi Square
  • Mann Whitney U Test
  • Wilcoxin signed-rank test
  • Kruskal-Wallace test

Relationship (Association) Studies

  • Pearson Correlation Coefficient
  • Correlation ratio, eta * Contingency Coefficient
  • Rank-difference correlation, rho
  • Kendall's tau
  • Biserial correlation
  • Widespread biserial correlation
  • Point biserial correlation
  • Tetrachoric correlation
  • Phi coefficient

Appropriate and Inappropriate Uses of Inferential Statistical Tests

An inferential statistical test is appropriately used if the statistical test (parametric or nonparametric) matches the type of data being analyzed. If parametrics are used with discreet data, what is likely to occur is that the researcher may find a statistically significant correlation, because of the high power of the parametric test. The trouble with very powerful tools is that if they are used indiscriminantly, they sometimes amplify random occurrences to make them appear real.

To determine whether or not a researcher is using an appropriate inferential statistic is to examine the data being analyzed and apply the following rule:

If the data being analyzed are discrete in nature, then the most appropriate inferential statistic a researcher can use is a non-parametric statistic. If the data being analyzed are continuous, then either parametric or non-parametric statistics are appropriate. Multivariate statistics and their application A bivariate approach investigates the relationship between two variables. A Multivariate analysis considers the relationship of more than two variables. Just think how annoying this would be to remember if it were the other way around.

A Multivariate analysis of variance MANOVA, is a measure that examines whether group differences occur on more than one dependent variable. A MANOVA is similar to running a bunch of T Tests, but without the increase in error. You use a Post Hoc procedure to compare means to one anotehr and control for the Type I error.

Factor analysis and its application

A second and widely used multivariate technique is called a factor analysis. In this procedure, a large number of variables are measured and correlated with each other. A correlation matrix may then be used to illustrate which groups of variables cluster together to form factors. A good example is the factors of verbal fluency and spatial skills tests that are clustered in the Weschler Adult Intelligence Scale. Scores on the tests belonging to each factor correlate well with each other. There is practically no correlation across the tests - giving the factors discriminative validity as well.

This analysis also uncovers factor loadings which are correlations between each of the test and each of the identified factors, which you would expect to be high.

The use of meta-analysis in behavioral and social science research

One of the greatest strengths of the scientific method is that results can be replicated. Meta analysis was created by Gene Glass, and is a method of analyzing the results of a group of studies, to see what an entire body of research can tell us about the studied phenomenon.

Meta-analysis is a difficult process, because different research studies rarely use the same methodologies and measures.

First, as many studies as possible or as representative a group of studies as possible on a particular phenomenon are collected.

Second, the results of the studies need to be converted to some common metric so that they can be compared to one another. The metric used is called the effect size. This value is derived through a comparison of the observed differences between the results for the experimental group and the control group as measured by some standard unit. The larger the effect size, the large the difference between the two groups. The use of the standard unit allows comparisons between different groups and outcomes.

Third, the researchers develop a system to code4 the various dimensions of the study including a description of the subjects, type of independent variable used, research design selected, type of outcome measured and conclusions reached.

Finally, a variety of descriptive and correlational techniques are used to examine the outcomes of the studies as a whole. The researcher looks for trends or commonalities in the direction of the outcomes across the factors that were identified and coded according to the previous steps.

A classic example is the oft cited meta-analysis by Smith and Glass (1977) showing that while psychotherapy works better than a placebo, there was significant difference between the types of therapy used.

Personal tools