Limited Offer Get 25% off — use code BESTW25
No AI No Plagiarism On-Time Delivery Free Revisions
Claim Now

7089CEM: Introduction to Statistical Methods for Data Science

This document is for Coventry University students for their own use in completing their
assessed work for this module and should not be passed to third parties or posted on any
website. Any infringements of this rule should be reported to
facultyregistry.eec@coventry.ac.uk.
Faculty of Engineering, Environment and Computing
7089CEM: Introduction to Statistical Methods for
Data Science
Assignment Brief
Module Title

Introduction to Statistical Methods for
Data Science
Jan and May start 7089CEM
Coursework Title
Modelling and analysis of gene expression data
Hand out date:
22/05/2020
Lecturer
Dr Fei He
Due date and time:
19/06/2020, 18:00
Estimated Time (hrs): 4 weeks
Word Limit*: 3000 – 4000
Coursework type:
Individual assignment
% of Module Mark: 100%
• Submission arrangement: online via CUMoodle, and Turnitin.
• File types and method of recording: Report (Word), Programme code (R, or Matlab script)
• Mark and Feedback date: 2 weeks after submission
• Mark and Feedback method (e.g. in lecture, written via Gradebook): provided in Moodle

Individual Cohort (Sept/Jan):
Module Code

Module Learning Outcomes Assessed:
• Demonstrate knowledge of underlying concepts in probability and statistics used in Data
Science.
• Select and apply appropriate statistical methods or techniques to solve problems or analyse
data sets.
• Use modern software to solve real world problems and analyse large data sets.
• Interpret the results of their analyses and communicate those results accurately.
Task and Mark distribution:
Coursework Description:
The aim of this assignment is to fit a non-linear time series model to the gene expression data set. Gene
expression is one of the most important biological processes where information from a gene is used to
synthesize a functional gene product, such as protein. The expression of a gene can be controlled (or
regulated) by another gene or several other genes, through a gene product (protein) called transcription
factor. Understanding how genes regulate each other, i.e. gene regulation, is important to investigate a
complex diseases, and how cell respond to environmental stimuli.
Data:
The ‘simulated’ 5 gene expression time-series data, are given in the excel file (gene_data.csv). The first
column contains the sampling time in minutes, the rest 5 columns are the time-course expression data

This document is for Coventry University students for their own use in completing their
assessed work for this module and should not be passed to third parties or posted on any
website. Any infringements of this rule should be reported to
facultyregistry.eec@coventry.ac.uk.
of 5 genes”#, “%, “&, “‘, “(, respectively. All these 5 genes are subject to additive noise (assuming
independent and identically distributed (“i.i.d”) Gaussian with zero-mean) with unknown variance.
Task 1: Preliminary data analysis
You should first perform an initial exploratory data analysis, by investigating:
• Time series plots
• Distribution for each gene
• Correlation and scatter plots (between combination of two genes) to examine their
dependencies
Task 2: Dimension reduction
• We would like to reduce the dimension of time (for all 5 genes) to two using PCA, you can choose
to use either eigen-decomposition method or the singular value decomposition method.
• Plot these 5 genes in the reduced 2-dimensional space, with different notations or colours.
Task 3: Nonlinear regression – modelling gene regulation
We know one of the genes “& is regulated by the other two genes “‘ and “(, however, we do not know
if such regulation is activation or repression, or if such a regulatory interaction is linear or nonlinear.
Therefore, we will fit a generic nonlinear polynomial regression model (with 2 inputs) to the data with
the following exemplar structure:
“& = +, + .#”‘ + .%”‘% + .&”‘& + ⋯ + 0#”( + 0%”(% + 0&”(& + ⋯ + 1
Here +, is a bias term (denotes the basal transcription rate); .#, .%, .&, ⋯ , 0#, 0%, 0&, ⋯ are the
parameters of the regression model to be estimated, and 1 denotes an additive, Gaussian, zero-mean
noise.
The main objective of this task is to identify the (polynomial) model structure, estimate model
parameters from the training data, and use the identified model to predict the response/output signal.
Then you need to identify the nonlinear regression model structure and estimate its parameters, by
• Identify the correct model structure (by using a model selection approach – e.g. subset selection,
AIC/BIC, or explore all possible different model structures), so that the model provides you a
good mean square error (MSE) and the model residual/error is close to Gaussian. You can either:
i) Split the input and output dataset into two part: one part used to train the model, the
other used for testing (e.g. 80% for training, 20% for testing). Apply the forward subset
selection approach to select the best model structure iteratively (select the most
significant term that reduce the MSE on testing data, in each iteration, and add it to the
current model).
ii) Or select the best model, using BIC or AIC goodness-of-fit criteria, by exploring all
possible combinations (or out of the different possible model structures).
The underlying nonlinear polynomial model may contain a bias term, a linear term, and one or
few (input) nonlinear terms; the nonlinear terms can have a (maximum) nonlinearity up to 4th
This document is for Coventry University students for their own use in completing their
assessed work for this module and should not be passed to third parties or posted on any
website. Any infringements of this rule should be reported to
facultyregistry.eec@coventry.ac.uk.
order, the maximum model terms will be no more than 3 (including bias, linear and nonlinear
terms).
• Estimate the model parameters using least squares method. This step will be embedded within
the above model structure identification process (since for each candidate model structure, you
will need to estimate its parameters, in order to evaluate the model’s performance against
observation data).
• Once the best model structure is selected and its parameters are estimated, estimate the
parameter covariance matrix, plot corresponding parameter uncertainty p.d.f. in the 3D and/or
contours (similar to the example given in the lecture/lab notes). Plot the pair-wise combinations
of all parameters, if you have more than 2 parameters in the selected model.
• Compute the model’s output/prediction (on the training data), and also compute the 95%
confidence intervals and plot them (with error bars) together with the mean values of the model
prediction.
• Validate the model using train-test split validation approach (may use different splitting portion
as the subset model selection stage), to check whether the identified model provide good
prediction on the testing dataset.
• Using “Approximate Bayesian Computation (ABC)” method to compute the posterior distribution
of the regression model parameters (using rejection ABC and assuming a Uniform prior). Plot the
marginal posterior distribution for each parameter, and the joint posterior probability
distribution for all pair-wise combinations of parameters.
Marking Scheme
This coursework worth 15 credits (100%). This will be marked according to:
• 15% will be given for performing an initial data analysis (histogram plots, simple input-output
correlation measures, time series plots, fitting linear model …). If you create any R code, you must
include this in the report.
• 10% will be given for performing dimension reduction using PCA and plotting the result.
• 25% will be given for writing the R code that to select the correct model structure, estimate the
model’s parameters, use these estimates to calculate new predictions.
• 20% will be given for estimating the parameter estimation uncertainties (covariance matrix, plot the
corresponding parameter estimates distribution) and the model’s prediction confidence intervals (on
the training input data). Again, if you create any R code, you must include this in the report.
• 5% will be given for performing model validation and analysing the performance of the identified
nonlinear model.
• 5% will be given to perform the Approximate Bayesian computation to compute the (approximated)
posterior distribution of the regression model.
• 10% will be given to appropriate discussion and interpretation of the results you obtained.
• 10% for writing the report (around 3000-4000 words) in a structured, readable form and submitting
the executable R scripts. Report should be in sections with appropriate headings, an introduction
and a conclusion.
This document is for Coventry University students for their own use in completing their
assessed work for this module and should not be passed to third parties or posted on any
website. Any infringements of this rule should be reported to
facultyregistry.eec@coventry.ac.uk.
Notes:
1. You are expected to use the Coventry University Harvard Referencing Style. For support and
advice on this students can contact Centre for Academic Writing (CAW).
2. Please notify your registry course support team and module leader for disability support.
3. Any student requiring an extension or deferral should follow the university process as outlined
here.
4. The University cannot take responsibility for any coursework lost or corrupted on disks, laptops
or personal computer. Students should therefore regularly back-up any work and are advised to
save it on the University system.
5. If there are technical or performance issues that prevent students submitting coursework
through the online coursework submission system on the day of a coursework deadline, an
appropriate extension to the coursework submission deadline will be agreed. This extension
will normally be 24 hours or the next working day if the deadline falls on a Friday or over the
weekend period. This will be communicated via your Module Leader.
6. *(ML’s delete if not applying to this assessment) Assignments that are more than 10% over the
word limit will result in a deduction of 10% of the mark i.e. a mark of 60% will lead to a
reduction of 6% to 54%. The word limit includes quotations, but excludes the bibliography,
reference list and tables.
7. You are encouraged to check the originality of your work by using the draft Turnitin links on your
Moodle Web.
8. Collusion between students (where sections of your work are similar to the work submitted by
other students in this or previous module cohorts) is taken extremely seriously and will be
reported to the academic conduct panel. This applies to both courseworks and exam answers.
9. A marked difference between your writing style, knowledge and skill level demonstrated in class
discussion, any test conditions and that demonstrated in a coursework assignment may result in
you having to undertake a Viva Voce in order to prove the coursework assignment is entirely
your own work.
10. If you make use of the services of a proof reader in your work you must keep your original version
and make it available as a demonstration of your written efforts.
11. You must not submit work for assessment that you have already submitted (partially or in full),
either for your current course or for another qualification of this university, unless this is
specifically provided for in your assignment brief or specific course or module information.
Where earlier work by you is citable, ie. it has already been published/submitted, you must
reference it clearly. Identical pieces of work submitted concurrently will also be considered to be
self-plagiarism.
Mark allocation guidelines to students (to be edited by staff per assessment)

0-39 40-49 50-59 60-69 70+ 80+
Work mainly
incomplete
and /or
weaknesses in
most areas
Most elements
completed;
weaknesses
outweigh
strengths
Most elements
are strong,
minor
weaknesses
Strengths in all
elements
Most work
exceeds the
standard
expected
All work
substantially
exceeds the
standard
expected

This document is for Coventry University students for their own use in completing their assessed work for this module and should not be passed to third
parties or posted on any website. Any infringements of this rule should be reported to facultyregistry.eec@coventry.ac.uk.
Marking Rubric (To be edited by staff per each assessment)

GRADE ANSWER RELEVANCE ARGUMENT & COHERENCE EVIDENCE SUMMARY
First
≥70
Innovative response, answers the
question fully, addressing the learning
objectives of the assessment task.
Evidence of critical analysis, synthesis
and evaluation.
A clear, consistent in-depth critical and
evaluative argument, displaying the ability
to develop original ideas from a range of
sources. Engagement with theoretical
and conceptual analysis.
Wide range of appropriately supporting
evidence provided, going beyond the
recommended texts. Correctly
referenced.
An outstanding, well-structured and
appropriately referenced answer,
demonstrating a high degree of
understanding and critical analytic skills.
Upper Second
60-69
A very good attempt to address the
objectives of the assessment task with an
emphasis on those elements requiring
critical review.
A generally clear line of critical and
evaluative argument is presented.
Relationships between statements and
sections are easy to follow, and there is a
sound, coherent structure.
A very good range of relevant sources is
used in a largely consistent way as
supporting evidence. There is use of
some sources beyond recommended
texts. Correctly referenced in the main.
The answer demonstrates a very good
understanding of theories, concepts and
issues, with evidence of reading beyond
the recommended minimum. Well
organised and clearly written.
Lower Second
50-59
Competently addresses objectives, but
may contain errors or omissions and
critical discussion of issues may be
superficial or limited in places.
Some critical discussion, but the argument
is not always convincing, and the work is
descriptive in places, with over-reliance on
the work of others.
A range of relevant sources is used, but
the critical evaluation aspect is not fully
presented. There is limited use of sources
beyond the standard recommended
materials. Referencing is not always
correctly presented.
The answer demonstrates a good
understanding of some relevant
theories, concepts and issues, but there
are some errors and irrelevant material
included. The structure lacks clarity.
Third
40-49
Addresses most objectives of the
assessment task, with some notable
omissions. The structure is unclear in
parts, and there is limited analysis.
The work is descriptive with minimal
critical discussion and limited theoretical
engagement.
A limited range of relevant sources used
without appropriate presentation as
supporting or conflicting evidence coupled
with very limited critical analysis.
Referencing has some errors.
Some understanding is demonstrated but
is incomplete, and there is evidence of
limited research on the topic. Poor
structure and presentation, with few
and/or poorly presented references.
Fail
<40
Some deviation from the objectives of the
assessment task. May not consistently
address the assignment brief. At the
lower end fails to answer the question set
or address the learning outcomes. There
is minimal evidence of analysis or
evaluation.
Descriptive with no evidence of theoretical
engagement, critical discussion or
theoretical engagement. At the lower end
displays a minimal level of understanding.
Very limited use and application of
relevant sources as supporting evidence.
At the lower end demonstrates a lack of
real understanding. Poor presentation of
references.
Whilst some relevant material is present,
the level of understanding is poor with
limited evidence of wider reading. Poor
structure and poor presentation, including
referencing. At the lower end there is
evidence of a lack of comprehension,
resulting in an assignment that is well
below the required standard.
Late submission 0 0 0 0

The post 7089CEM: Introduction to Statistical Methods for Data Science appeared first on My Assignment Online.

Plagiarism Free Assignment Help

Expert Help With This Assignment — On Your Terms

Native UK, USA & Australia writers Deadline from 3 hours 100% Plagiarism-Free — Turnitin included Unlimited free revisions Free to submit — compare quotes
Scroll to Top