ICT706 SouthBank 2019 Semester 1 Task 2
This assignment will be done completely inside this Jupyter notebook.
Background A medium-size Australian company (Imaginary) has given you one year of data about the online purchases that their customers have made. They want you to analyse the data using statistical and machine learning techniques and produce:
- a prediction algorithm for predicting how much money each customer is likely to spend in a year: • a classification algorithm for predicting which customers will be ‘big spenders’: • some recommendations on what marketing strategy they should use to attract more ‘big spenaer customers.
Instructions
Follow all the Instructions in this notebook to complete these tasks. Note that some cells contain ‘assert’ statements – these will automatically mark your work so that you can check that you have done the preceeding steps correctly (11 they give errors. then go back and correct your previous work until you fix those errors. Once those ‘assert’ cells execute without errors. you know that you have achieved the marks for that step.)
When you have finished. this notebook is the only file that you will need to submit to Blackboard.
Note: If you want some space to try out some Python code of your own. feel free to add extra cells into this notebook. Just make sure that before you submit your notebook, that those extra cells execute without error, or that you delete them before submitting.
overview
You have five sections to complete in this Notebook (total = 100 marks):
- Part A. Load and Clean Data (20 points)
- Part B Data Exploration (30 points)
- Part C: Predicting Spending Levels (20 points)
- Part D: Predicting Big Spenders (20 points)
- Part E: Business Recommendations (10 points)
In [1]: import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder import matplotlib.pyplot as pit
Part A: Load and Clean Data (20 points)
Save your CSV data file into the same folder as this notebook.
Write Python code to load your dataset into a Pandas DataFrame called ‘sales’
In [2]:
sales- pd.read_csu(‘GreenHat_Sales.csvm)
Cleaning the Data
Some of the columns are stnngs, with dollar signs. But we need to convert them to numbers (float) so that we can do calculations on them. The next cell shows what will go wrong if we try doing calculations before converting them floats!
In [22): s2 salesrSpendi • 4 s2.head()
Out[22): 0 $1615.00$1615.00$1616.00$1615.00
1 $1927.20$1927.20$1927.20$1927.20
2 $1660.80$1660.80$1660.80$1660.80
3 $3041.10$3041.10$3041.10$3041.10
4 $1764.40$1764.40$1764.40$1764.40
Name: Spend, dtype: object
| In [23]: | # Complete the following remove dollar function
# so that it removes any dollar signs and spaces # and then returns the string as a number (float). def remove_dollar(s): —“Removes dollar signs and spaces from s. Returns it as a float. |
| In [8]: | “Check that remove_dollar() removes dollars and spaces properly (S points).””” assert remove_dollar(“12”) 12.0
assert remove_dollar(“$123″) 123.0 assert remove_dollar(” $1234″) 1234.0 assert remove_dollar(” $42.3 “) 42.3 |
Clean up the Spend columns
Apply your remove_dollar function to the ”Spend” column (every row), and put the cleaned-up float values into a new column of your ‘sales’ DataFrame called “Spend Value”
Then do the same for the “LastSpend” column and put the float values into a new column called “LastSpendValue”
ft YOUR CODE HERE
raise NotlmplementedError()
sales.dtypes
I check the new Spend VaLue coLumns (5 points) assert sales.columns.contains(“Spendvalue”) assert sales.columns.contains(“LastSpendValue”) k check that they are floats
assert sales(“SpendValuel.dtype “float64”
assert sales[“LastSpendvalue” ] .dtype “float64”
k check that the vaLues are greater than zero. assert (sales(“Spendvaluel > 0.0).all()
assert (salesrLastSpendvaluel >- 0.0).all()
Make Sex and State numeric
