https://www.uws.edu.au/observatorypenrith
Statistics Overview
Dr Mahsa Razavi
Week 9
2020
Week 7
https://www.uws.edu.au/observatorypenrith
‘Statistics is a body of methods and theory that is applied to quantitative
data’ (Collis and Hussey, 2014, p. 226) so you will need to quantify any
qualitative data
Two main branches
• Descriptive statistics are a group of statistical methods used to
summarize, describe or display quantitative data (may be sufficient for
an undergraduate research project)
• Inferential statistics are a group of statistical methods used to draw
conclusions about a population from quantitative data relating to a
random sample
A statistic is a number that describes a sample
A parameter is a number that describes a population
Introduction
https://www.uws.edu.au/observatorypenrith
Descriptive Statistics
https://www.uws.edu.au/observatorypenrith
A frequency distribution shows us a summarized grouping of data
divided into mutually exclusive classes and the number of occurrences
in a class.
It is a way of showing unorganized data e.g. to show results of an
election, income of people for a certain region, sales of a product within
a certain period, student loan amounts of graduates, etc.
Some of the graphs that can be used with frequency distributions are
histograms, line charts, bar charts and pie charts.
Frequency distributions are used for both qualitative and quantitative
data.
Simple Frequency Distribution
https://www.uws.edu.au/observatorypenrith
These are the numbers of newspapers sold at a local shop over the last 10
days:
22, 20, 18, 23, 20, 25, 22, 20, 18, 20
Count how many of each number :
Papers Sold Frequency
18 2
19 0
20 4
21 0
22 2
23 1
24 0
25 1
Simple Frequency Distribution
It is also possible to group the values. Here
they are grouped in 5s:
Papers Sold Frequency
15-19 2
20-24 7
25-29 1
Binning by 5
https://www.uws.edu.au/observatorypenrith
Frequency distribution
• A frequency is the number of observations for a
particular data value in a variable (Collis and Hussey,
2014, p. 235)
– A frequency distribution is an array that summarizes the
frequencies for all the data values in a particular variable
– A percentage frequency distribution is a descriptive
statistic that summarizes a frequency as a proportion of
100
• Eg The survey found that 633 companies out of 790 in
the sample had a turnover of less than £1m
– Percentage frequency = 633 x 100 = 80%
790
https://www.uws.edu.au/observatorypenrith
Measures of central tendency
Several statistics can be used to represent the
“center” of a distribution. These statistics are
commonly referred to as measures of central
tendency.
https://www.uws.edu.au/observatorypenrith
Measures of central tendency
• Supposing these 6 marks were your exam results
– The mean is the arithmetic average: 438 = 73%
6
– The median is the mid-value of the data values
arranged in size order: 64% 64% 70% 78% 80%
| 82% | so between 70% and 80% = 70 + 78 = 74% 2 |
– The mode is the most frequently occurring value =
64%
| Module 1 | Module 2 | Module 3 | Module 4 | Module 5 | Module 6 |
| 82% | 78% | 80% | 64% | 70% | 64% |
https://www.uws.edu.au/observatorypenrith
Measures of dispersion
• Measures of dispersion should only be calculated for
ratio or interval variables
– The range represents the difference between the
maximum value (the upper extreme or Eu) and the
minimum value (the lower extreme or EL) in a frequency
distribution arranged in size order (Range = Eu – EL)
– The interquartile range represents the difference between
the upper quartile (Q3) and the lower quartile (Q1) which is
the spread of the middle 50% of a frequency distribution
arranged in size order (Interquartile range = Q3 – Q1)
• But neither takes account of all the data values
https://www.uws.edu.au/observatorypenrith
Measures of dispersion
• The standard deviation (sd, stdv) takes account of all the
data values
– It is based on the error and the variance, which are two statistical
models used to measure how well the mean represents the data
– In this context, the error is the difference between the mean and
the data value (observation) and the variance is the average
error between the mean and the data
• The standard error (se) is the standard deviation between
the means of different samples
– A large standard error relative to the sample mean suggests the
sample might not be representative of the population
https://www.uws.edu.au/observatorypenrith
Interquartile Range (IQR)
Provides a measure of the spread of the middle 50% of the scores.
The IQR is defined as the 75th percentile to the 25th percentile
The interquartile range plays an important role in the
graphical method known as the boxplot.
It is easy to compute and extreme scores in the distribution have
much less impact but it suffers as a measure of variability because
it discards too much data.
Researchers want to study variability while eliminating
scores that are likely to be accidents.
https://www.uws.edu.au/observatorypenrith
The variance is a measure based on the deviations of individual scores from
the mean. As noted in the definition of the mean, however, simply summing the
deviations will result in a value of 0. To get around this problem the variance is
based on squared deviations of scores about the mean.
The sample variance is then:
When the deviations are squared, the rank order and relative distance of scores
in the distribution is preserved while negative values are eliminated. Then to
control for the number of subjects in the distribution, the sum of the squared
deviations, S(X – `X), is divided by N (population) or by N – 1 (sample). The
result is the average of the sum of the squared deviations and it is called the
variance.
Variance
https://www.uws.edu.au/observatorypenrith
Standard deviation
• The standard deviation (sd, stdv) is the square root of
the variance
– A large standard deviation relative to the mean suggests the
mean does not represent the data well
• The standard deviation is related to a theoretical
frequency distribution known as the normal distribution
– It is bell-shaped and symmetrical and has tails extending
indefinitely either side of the centre
– The mean, median and mode coincide at the centre
– 68% of the data will fall within 1 sd of the mean, 95% will fall
within 2 sd and 99.7% will fall within 3 sd of the mean
https://www.uws.edu.au/observatorypenrith
Measures of normality
• A normal distribution is a theoretical frequency
distribution – it is a mathematical model representing
perfect symmetry, against which empirical data can be
compared. If your data is supposed to take parametric
stats you should check that the distributions are
approximately normal.
• The best way to do this is to check the skew and Kurtosis
measures from the frequency output from SPSS. For a
relatively normal distribution:
• skew ~= 0
• kurtosis~=0
• If a distribution deviates markedly from normality then
you take the risk that the statistic will be inaccurate. The
safest thing to do is to use an equivalent non-parametric
statistic.
https://www.uws.edu.au/observatorypenrith
Normal distribution
https://www.uws.edu.au/observatorypenrith
Skew & Kurtosis
A positive skew (greater than 1) indicates a
distribution that has a positive tail greater than
a normal distribution, i.e. the peak is more
towards lower values & mean is greater than
median
A positive Kurtosis (greater than 1) indicates
a distribution that is more peaked than a
normal distribution
https://www.uws.edu.au/observatorypenrith
Mean Median Mode
https://www.uws.edu.au/observatorypenrith
Inferential Statistics
https://www.uws.edu.au/observatorypenrith
Stating the objectives
• The analysis is guided by the hypotheses you
developed from the theoretical framework you
described in your literature review
– A hypothesis is a proposition that can be tested for
association or causality against empirical evidence
(data based on observation or experience, eg survey
data)
• Each hypothesis is formulated as a statement
about a relationship between two variables and
can be expressed in the null or the alternative form
https://www.uws.edu.au/observatorypenrith
The null and the alternative hypothesis
• The null hypothesis (H0) states that the variables
are independent of one another (ie there is no
association)
• The alternative hypothesis (H1) states that the
variables are associated (ie there is an
association)
– H
0 is the default
– H
1 is accepted only if the test result provides
significant evidence to reject H0
• If you predict the IV has an effect on the DV in a
https://www.uws.edu.au/observatorypenrith
Population parameters
• A random sample is needed to obtain
estimates of theoretical population parameters
• Inferential statistics include parametric and
non-parametric tests, and you need to
examine your population to determine whether
parametric tests are appropriate
– Parametric tests make certain assumptions about
the distributional characteristics of the population
under investigation
https://www.uws.edu.au/observatorypenrith
Parametric tests
• To use parametric tests, four basic assumptions
about the research data must be met (Field, 2000)
1. The variable is measured on a ratio or interval scale
2. The data are from a population with a normal
distribution
3. There is homogeneity of variance (variances are
stable in a test across groups of subjects, or the
variance of one variable is stable at all levels in a test
against another variable)
4. The data values in the variable are independent (they
come from different cases, or the behaviour of one
subject does not influence the behaviour of another)
https://www.uws.edu.au/observatorypenrith
Non-parametric tests
• The reason why these assumptions are so
important is that the calculations that underpin
parametric tests are based on the mean of the
data values
• However, non-parametric tests do not rely on the
data meeting these assumptions because the
statistical software arranges the frequencies in
size order and performs the calculations on the
ranks rather than the data values
• Non-parametric tests must be used for
– Variables measured on a ratio or interval scale that do
not have a normal distribution
– All ordinal or nominal variables
https://www.uws.edu.au/observatorypenrith
Parametric vs Non-Parametric
• The basic distinction for parametric versus non-parametric is:
• If your measurement scale is nominal or ordinal then you use
non-parametric statistics
• If you are using interval or ratio scales you use parametric
statistics.
https://www.uws.edu.au/observatorypenrith
Bivariate analysis
• A bivariate analysis tests data from two
variables
• Can examine a hypothesised relationship
between a measured variable and a variable as
suggested by a theoretical framework
https://www.uws.edu.au/observatorypenrith
Bivariate and multivariate analysis
| Purpose | For parametric data |
For non-parametric data |
| Tests of difference for independent or dependent samples |
t-test | Mann-Whitney test |
| Tests of association between two nominal variables |
Not applicable | Chi-square test |
| Tests of association between two quantitative variables |
Pearson’s correlation |
Spearman’s correlation |
| Predicting an outcome from one or more variables |
Linear regression | Logistic regression |
https://www.uws.edu.au/observatorypenrith
Correlation
• ‘Correlation is a measure of the direction and strength of
association between two quantitative variables.
Correlation may be linear or non-linear, positive or
negative’ (Collis and Hussey, 2014, p. 270)
• Most statistics try to fit straight-line models to the data
and the correlation coefficient measures the linear
dependency of the two variables
– +1 represents perfect positive linear association (both
variables increase together)
– 0 represents no linear association
– -1 represents perfect negative linear correlation (one
variable increases as the other decreases)
https://www.uws.edu.au/observatorypenrith
Linear vs Logistic regression
• Linear regression is a measure of the ability
of an IV to predict an outcome in a DV where
there is a linear relationship between them
• Logistic regression is used where the DV is a
dummy variable and one or more of the IVs
are continuous quantitative variables (others
can be ordinal or dummy variables)
https://www.uws.edu.au/observatorypenrith
Time series analysis
• Time series analysis is a statistical technique for
forecasting future events from time series data
– A time series is a sequence of measurements of a
variable taken at regular intervals over time
• The purpose of time series analysis is to examine
the trend and any seasonal variation, both of which
can be further analysed using linear regression
– A trend is a consistently upward or downward
movement in time series data
– Seasonal variation is where a pattern in the movement
of time series data repeats itself at regular intervals
https://www.uws.edu.au/observatorypenrith
Setting the significance level
• When using a statistical test, we want to be sure that
the effect genuinely exists, but there are two cases
when a test result leads to an incorrect result (an error)
– H0 is true, but the test leads to its rejection (a Type I error)-
False Positive
– H1 is true, but the test leads to acceptance of H0 (a Type II
error)-False Negative
• We specify the critical region that determines whether a
test result is significant by setting the significance level
– At a significance level of 0.05, we are accepting a 5%
probability that the test will lead to a Type I or Type II error,
and we can be 95% certain that the effect exists
The post Statistics Overview appeared first on My Assignment Online.