1 Data
Types of Statistical Data
Data Processing
2 Sampling
Sampling Terminology
Sampling Designs
Probability Designs
Simple Random Sampling
Systematic Sampling
Stratified Sampling
Cluster Sampling
Multistage Sampling
Sampling Issues
3 Excel Examples
Dr Wei Wei (Monash University) ETF2121/5912 March 19, 2020 2 / 34
Data
Data
All forms of statistical investigation depend on data.
Suppose that we have a research question to address. We must
ascertain
which concepts we wish to measure.
whether or not appropriate data are available.
if data are available, what form it is in.
If appropriate data are not available, it must be collected and we must
decide how to do so.
If appropriate data are available/collected, we must consider how to
present and analyse it.
Dr Wei Wei (Monash University) ETF2121/5912 March 19, 2020 3 / 34
Data Types of Statistical Data
Primary or Secondary Data
Data that is obtained is either primary or secondary.
Primary data is collected for a specific need.
Require a sample survey, experiment or specific study to generate the
data.
Secondary data are collected for some other purpose and are already
available.
External sources of secondary data include government departments,
industry associations, academic institutions, and commercial research
organisations.
Example: Information from Australian Bureau of Statistics (ABS),
Trading Economics, FRED, World bank open data, and so on.
Internal sources of secondary data include sales figures, publication
records or customer evaluations.
Dr Wei Wei (Monash University) ETF2121/5912 March 19, 2020 4 / 34
Data Types of Statistical Data
Time series, Cross-sectional, and Panel Data
Cross-Sectional Data are data collected at a single point in time
e.g., GDP from all OECD countries in year 2015.
Time Series Data are data on one variable collected at different points
in time.
e.g., annual GDP from Australia from 1985 to 2015
Panel Data are data on multiple variables collected at different points
in time
e.g., annual GDP from all OECD countries from year 1985 to 2015
e.g., annual GDP and unemployment rate from Australia from 1985 to
2015
Dr Wei Wei (Monash University) ETF2121/5912 March 19, 2020 5 / 34
Data Types of Statistical Data
Big Data
Everything we do are increasingly leaving a digital trace.
Big data are extremely large and complex data sets that are:
based on everyday activities (such as your shopping trips to
Woolworths, your twitter conversations, and your smartphone photos);
usually recorded before any research questions are being asked.
How is it turned into useful information?
Check this reference out for more information.
Dr Wei Wei (Monash University) ETF2121/5912 March 19, 2020 6 / 34
Data Data Processing
Data Processing
Database software such as Microsoft Access, Structured Query
Language (SQL) facilitates the relatively simple capture of the survey
information in a format that is readily able to be accessed in numerous
ways.
Presented in various formats or transported to other analysis software
(such as Excel and EViews) for statistical analysis.
Common data format: .csv or .txt or .xlsx or .dat
We can take a look at the Gender Data.
Dr Wei Wei (Monash University) ETF2121/5912 March 19, 2020 7 / 34
Sampling Sampling Terminology
Sampling Terminology
Sample Survey: a process of gathering data from a representative
subset of the (theoretical) population
Example: opinion polls from market research companies such as Ipsos,
political opinion polls such as those on fivethriryeight.
Questionnaire must be developed. See an example of US presidential
approval questionnaire here.
A list of accessible population members, also called a frame, must be
compiled.
Sampling units: the members in a frame. Depending on the context,
sampling units could be individual people, households, companies,
cities, etc.
Sampling design: sample must be selected from the frame (list of
accessible population members) according to certain design.
Dr Wei Wei (Monash University) ETF2121/5912 March 19, 2020 8 / 34
Sampling Sampling Designs
Sampling Designs
The first step is designing an effective sampling plan that will yield
representative samples of the population under study.
A sampling plan is a description of the approach that will be used to
select a subset from a frame prior to any data collection activity.
Non-Random (Non-Probability) Sampling
Random (Probability) Sampling
Population size: denoted by N
Sample size: denoted by n
Sampling fraction: c = n=N
Dr Wei Wei (Monash University) ETF2121/5912 March 19, 2020 9 / 34
Sampling Probability Designs
Probability Designs
Simple Random Sampling
Systematic Sampling
Stratified Sampling
Cluster Sampling
Multistage Sampling
Dr Wei Wei (Monash University) ETF2121/5912 March 19, 2020 10 / 34
Sampling Probability Designs
Simple Random Sampling
To select a simple random sample:
Number the members in the frame consecutively.
Use a random number generator to select each member of the sample
by number.
Each member has a probability of n=N of being selected.
Dr Wei Wei (Monash University) ETF2121/5912 March 19, 2020 11 / 34
Sampling Probability Designs
Example 2.1
A dietician wants to compare cereal products to assess whether the
amount of calories, sodium, fibre, carbs, sugars (in milligrams) as well
as the shelf life (in years) stated on the boxes are in fact what they are.
In order to do so, she wants to select a random sample of size 10 from
67 commonly sold cereal products and subject their contents to
rigorous laboratory testing. The data is in the file Cereal Data.xlsx.
Draw a random sample of 10 products.
Dr Wei Wei (Monash University) ETF2121/5912 March 19, 2020 12 / 34
Sampling Probability Designs
Random Sampling
Figure: Scott Adams’s Dilbert cartoon
Dr Wei Wei (Monash University) ETF2121/5912 March 19, 2020 13 / 34
Sampling Probability Designs
Systematic Sampling
The units in the population are
randomly ordered (at least with respect
to the characteristics you are
measuring).
The sampling interval k = N=n is
determined.
A random number between 1 and k is
selected.
The population member corresponding
to this random number is selected.
Thereafter every kth member is selected
until the sample of size n is obtained.
Dr Wei Wei (Monash University) ETF2121/5912 March 19, 2020 14 / 34
Sampling Probability Designs
Example 2.2
Consider the data from the Excel file Cereal Data.xlsx in Example
2.1. Draw a systematic sample of size 10 from the 67 cereal products.
Determine k.
N
n
67
10
= 6:7
If k is not a whole number, it must be truncated. In this case, k = 6
Select a random number between 1 and 6.
Suppose the random number obtained is 4, then first select the 4th
cereal, and every 6th thereafter, namely select the 10th, 16th etc. until
10 cereals are selected .
Dr Wei Wei (Monash University) ETF2121/5912 March 19, 2020 15 / 34
Sampling Probability Designs
Example 2.3
Consider the wages of employees for a construction company as our
population. The company has 200 employees and a sample of 20
employees is required. The employees are organised by teams; each
team consists of a team leader and 9 other workers.
Dr Wei Wei (Monash University) ETF2121/5912 March 19, 2020 16 / 34
Sampling Probability Designs
Example 2.3: Systemic sampling
Suppose we use systemic sampling: N = 200, n = 20, k = 10
If the random number of 5 was selected the sample would consist of
only other workers.
If the random number of 1 was selected the sample would consist of
only crew leaders.
No combination would include both crew leaders and other employees.
Hence a systematic sample may not be a representative sample under
these circumstances.
Dr Wei Wei (Monash University) ETF2121/5912 March 19, 2020 17 / 34
Sampling Probability Designs
Stratified Sampling
Identify various sub-populations referred to as strata within the total
population.
Select a simple random subsample or a systematic subsample from
each stratum instead of from the entire population.
Useful when there is considerable variation between the various
strata and relatively little variation within a given stratum.
Most appropriate when the population exhibits large heterogeneity.
Dr Wei Wei (Monash University) ETF2121/5912 March 19, 2020 18 / 34
Sampling Probability Designs
Example 2.4
Suppose that a radio station wants information about FM Radio
listeners between the ages of 20 and 50 years in Melbourne. Assuming
they have access to a frame of FM Radio listeners, discuss a possible
method of drawing a sample of FM Radio listeners.
Dr Wei Wei (Monash University) ETF2121/5912 March 19, 2020 19 / 34
Sampling Probability Designs
Stratified Sampling: from frame to strata
We first assign all members in the frame into appropriate strata.
Most important aspects are:
Deciding on the characteristics that determine each stratum.
Determining the number of strata needed.
Rule:
Each element in the frame must be included in only one stratum.
All elements in the frame must appear in some strata.
Let Ni denote the number of members in stratum i , then
X
i
Ni = N
Dr Wei Wei (Monash University) ETF2121/5912 March 19, 2020 20 / 34
Sampling Probability Designs
Stratified Sampling: deciding sample size in each strata
Each stratum i has Ni members. We need to choose a sample size ni
for each stratum i such that
P
i ni = n.
Proportional allocation
Sample size in each stratum is proportional to the total number of
elements in that stratum.
The advantage of proportional sample sizes is that they are very easy
to determine.
The disadvantage is that they ignore differences in variability among
the strata.
disproportional allocation
Takes into account the standard deviation of the stratum when
determining sample size.
Dr Wei Wei (Monash University) ETF2121/5912 March 19, 2020 21 / 34
Sampling Probability Designs
Proportional Allocation
n1 = N1
N x n
n2 = N2
N x n
n3 = N3
N x n
Dr Wei Wei (Monash University) ETF2121/5912 March 19, 2020 22 / 34
Sampling Probability Designs
Stratified Sampling: selecting members in strata
Random selection within a stratum:
Treat each strata as separate population;
for each i , select ni out of Ni using random sampling or systematic
sampling.
Combine all samples from strata to obtain the sample.
Dr Wei Wei (Monash University) ETF2121/5912 March 19, 2020 23 / 34
Sampling Probability Designs
Example 2.5
Consider the sampling problem from Example 2.3. The company has
200 employees and a sample of 20 employees is required. The
employees are organised by teams; each team consists of a team leader
and 9 other workers. How to design stratified sampling?
Dr Wei Wei (Monash University) ETF2121/5912 March 19, 2020 24 / 34
Sampling Probability Designs
Cluster Sampling
Elements in population are divided into numbers of clusters or groups.
Commonly based on geographical characteristics.
A random sample of these clusters is chosen.
Single-stage sampling, where all people in the chosen clusters are
surveyed.
Two-stage sampling where a random sample of items is chosen within
each sampled cluster.
Dr Wei Wei (Monash University) ETF2121/5912 March 19, 2020 25 / 34
Sampling Probability Designs
Cluster Sampling vs Stratified Sampling
Both Cluster sampling and stratified sampling divide the frame into
subgroups before sampling.
Cluster sampling works best
if the clusters are similar to each other (homogenous groups);
are as heterogenous as possible within each cluster.
Stratified sampling works best if
the groups are heterogenous ;
each group is homogenous within.
Dr Wei Wei (Monash University) ETF2121/5912 March 19, 2020 26 / 34
Sampling Probability Designs
Example 2.6
Consider the sampling problem from Example 2.3. The company has
200 employees and a sample of 20 employees is required. The
employees are organised by teams; each team consists of a team leader
and 9 other workers. How to design cluster sampling for this problem?
Dr Wei Wei (Monash University) ETF2121/5912 March 19, 2020 27 / 34
Sampling Probability Designs
Multistage Sampling
Many real sampling applications are more complex than the ones
described so far resulting in multistage sampling schemes.
For example, the Gallup organization uses multistage sampling in
nationwide surveys.
First stage: draw a random sample of 300 locations
Second stage: city blocks or other geographical areas randomly selected
from the first stage locations
Third stage: draw a simple random sample (SRS) or systematic sample
of households from each second stage area
A total about 1500 households comprises a typical Gallup poll.
Dr Wei Wei (Monash University) ETF2121/5912 March 19, 2020 28 / 34
Sampling Probability Designs
Multistage Sampling
Dr Wei Wei (Monash University) ETF2121/5912 March 19, 2020 29 / 34
Sampling Sampling Issues
Errors in sampling
Errors in sampling
Non-sampling errors: occurs when the sample is not representative of
the population. It may arise from faulty sampling frame, non-response
bias, failure of the respondent to understand the questions, or errors in
the recording and processing of the data.
Sampling errors: arises because a sample statistic cannot be expected
to agree exactly with the unknown population parameter it is designed
to estimate since a sample is only a subset of a population.
Dr Wei Wei (Monash University) ETF2121/5912 March 19, 2020 30 / 34
Excel Examples
Excel Example
A dietician wants to compare cereal products to assess whether the
amount of calories and fibre stated on the boxes are in fact what they
are.
In order to do so, she wants to select a random sample of size 10 from
67 commonly sold cereal products and subject their contents to
rigorous laboratory testing. The data is in the file Cereal Data.xlsx.
Use Excel to:
Draw a simple random sample of 10 products.
Draw a systematic sample of 10 products.
Use manufacturer as the criteria for strata, draw stratified sample of 10
products using proportional allocation.
Dr Wei Wei (Monash University) ETF2121/5912 March 19, 2020 31 / 34
Excel Examples
Excel: Simple random sampling
1 Click Data – Data Analysis – Sampling to generate 10 indexes for the
selected cereals;
2 Use Excel function VLOOKUP to fill in the cereal name, calories and
fiber content of the selected cereals.
Syntax = VLOOKUP (value, table, col_index, [range_lookup])
value – The value to look for in the first column of a table.
table – The table from which to retrieve a value.
col_index – The column in the table from which to retrieve a value.
range_lookup – [optional] TRUE = approximate match (default).
FALSE = exact match.
Dr Wei Wei (Monash University) ETF2121/5912 March 19, 2020 32 / 34
Excel Examples
Excel: Systematic sampling
1 Determine k.
N
n
67
10
= 6:7
If k is not a whole number, it must be truncated. In this case, k = 6
or Excel function =TRUNC(67/10)
2 Generate a random number between 1 and 6 using Excel function =
randbetween(BOTTOM,TOP). Freeze the random numbers with
Copy/Paste Special/ Values commands.
3 Generate a series of increasing index.
4 Use VLOOKUP to fill in the cereal name, calories and fiber content of
the selected cereals.
Dr Wei Wei (Monash University) ETF2121/5912 March 19, 2020 33 / 34
Excel Examples
Excel: Stratified sampling
1 Sort the data according the manufactures.
2 Obtain the name (characteristic) of each strata using Advanced filter
(Unique records only)
3 Count the number of members Ni in each strata using
=COUNTIF(range, criteria)
4 Obtain the size of subsample for each strata, ni
5 Use simple sampling to select ni indexes within each strata
Dr Wei Wei (Monash University) ETF2121/5912 March 19, 2020 34 / 34
The post Data Analysis in Business: Data and Sampling appeared first on My Assignment Online.