Create a Presentation
Tip: Read through this document in its entirety before you begin.
Situation: You are assigned to analyze the data gathered for the World Happiness Report (2019).
Problem: Is the healthy life expectancy predictable?
Hypothesis: The healthy life expectancy of a person can be predicted based on a European countries’ happiness or ladder scores, gross domestic product per capita, and social support, freedom to make life choices, and perception of government corruption, and government democratic quality and government delivery quality for the survey results found in the World Happiness Report with a greater than 85% accuracy (World Happiness Report, 2019).
Data Collection: The data and data dictionaries are online.
World Happiness Report. (2019). [Data file and codebook]. Retrieved from https://worldhappiness.report/ed/2019/
The information you need is on the right side, under Downloads. The data is Chapter 2: Online Data. The data dictionaries are Statistical Appendix 1 for Chapter 2 and Statistical Appendix 2 for Chapter 2.
Data Cleaning:
· View a summary of the data.
· Create a subset of the data based on the hypothesis. You can add a column for the continent and use the library countrycode, for easier subsetting of European countries.
· View a summary of the data to validate your subset is accurate. If there are any erroneous data types, address them. Do not include the country’s name in the subset.
· Round all numeric values in the data frame to two decimal places.
a. Do not store them this way, change the value.
· Omit the observations with missing values after the previous steps are complete.
Analyze:
· The plan is to perform multivariate linear regression (MLR) modeling on the subset after splitting the set into a train and test set. Eighty percent of the data should fall in the train data set.
· Identify the assumptions of linear regression.
· Prepare the data for linear regression.
o If you feel the need to transform the data, stop. There is no need to transform the data.
· Generate a model so that the assumptions are testable.
· View a summary of the model, but only address the summary statistics of the residuals.
o What do you think the measures of central tendency indicate here?
· Test the assumptions.
· Did the data meet the assumptions?
o If the data set met the assumptions, view a summary of the model, again.
§ Store the model.
§ Interpret the model output.
§ Test the MLR model using the test set.
§ Did the model meet the criteria in the problem of 85% accuracy?
o If the data set did not meet the assumptions, move on to the Analyze Stage, to attempt a new approach.
Analyze:
· The new plan is generating a random forest model (RF), with the same subset data as the MLR model.
· Identify the assumptions of an RF Model.
· Prepare the data for the RF.
· Generate the RF model with 100 trees.
· Store the model.
· View a summary of the model. Interpret the model output.
· Test the model. Is the model accuracy of more than 85%?
· As a final step in the analysis, in the presentation describe how the significance of the predictors in the MLR model compares to the importance of predictors in the RF model. Why are they the same or why are they different? What does that indicate? Explain.
Visualize:
· Generate a visualization to depict the model accuracy. A scatter plot of predicted and test values is an option.
· Generate a plot to depict the learning curve of the RF model.
· Generate a feature importance plot, is post hoc analysis of the RF model possible to reduce the number of independent variables? Explain.
For the presentation:
Tell your data story. Based on the programming above, tell the data story. You may add additional visualizations to improve your story, should you see fit. You do not need to show R code in your presentation. (I would keep it at a minimum and only show programming if it adds to the story.) Ensure to include your findings and a conclusion.
Your audience: The presentation is at a meeting with experts in the field of statistics and it is open to the general public. There will be people in the audience looking for unsupported findings (such as failing to test assumptions). There may be children in attendance, as well.
You must generate your slides in RMarkdown. There are multiple options for generating slides in RMarkdown; demonstrations in the lectures cover ioslides and revealjs. You may choose which slide platform you use. You may not use something other than slides.
You must narrate your slides. The ability to embed audio into your slides is one option. The other option is to create a video of your audio narration over the slides. These are the only two options.
Required files to submit:
· If you program your .Rmd file with embedded audio, submit everything that makes your slides function, including, but not limited to your .Rmd file and the embedded audio file.
· If you recorded audio outside of .Rmd, submit everything needed to make the program function and
submit a video in 480p, with your voice narrating the slides.
Note: There is no need to send the .html file.
The presentation is worth 84 points:
· Complete assignment prescribed instructions – 44 points –
o Submitting appropriate documentation
o Presentation
§ The slides contain accurate information
§ The slides are easy to understand
§ The slides are neat in appearance individually and neat in composition with all slides
§ The presenter is clearly understood and amplifies the content of the slides (audio)
§ The presenter interprets every analysis in the presentation accurately (audio)
o Strategy/Focus/References –
§ Every analysis interpretation is accurate; every analysis has an interpretation.
· Remember that analysis is not necessarily a statistical test. It is anything you use to gain knowledge from the data.
§ The presentation has a tight, cohesive focus
§ Any external figures, such as pictures from the internet, include a reference and are not subject to copyright infringement.
§ Annotations of references and citations are accurate.
· Programming – 30 points –
o Includes all necessary functions from the instructions
o Organization
o Program runs
· Programming comments = 10 points –
o Appropriate commenting included in the documentation
Some tips when creating your presentation:
· Do not speak in variables; instead of hp state horsepower
· Avoid too many words per slide
· Don’t read the slide to the audience
· Keep in mind – there is a difference between a presentation and a presentation of data!
· Do not present any statistical output without interpretations, including charts, tables, and test outputs.
· Avoid the use of “clearly’ or “obviously”
· Practice narrating your slides! I need to be able to understand what you say.
· All references have citations; all citations have references.
· Anything that is not common knowledge has a reference to support the assertion.
· Have a conclusion that summarizes the purpose of the presentation of data
· Your data subset will have seven predictors and one outcome variable. The dataset will have 411 observations if the subsetting is correct.