KPMG VIRTUAL INTERNSHIP PROGRAM

A step-by-step approach to improve the quality of dataset

Potential Business: Optimizing marketing strategy

Here is the background information on the task

Sprocket Central Pty Ltd , a medium size bikes & cycling accessories organization.The company is looking to leverage the insights and analysis provided by KPMG to grow its business by optimizing its marketing strategy.

The client provided KPMG with 3 datasets to start the preliminary data exploration and identify ways to improve the quality of Sprocket Central Pty Ltd’s data:

  1. Customer Demographic
  2. Customer Addresses
  3. Transaction data

In a nutshell, the tasks involve analyzing business data sets, improving their data quality and suggesting recommendations to ensure quality of future data. Creating customer segments based on customer behavior trends and patterns. Creating visualizations and dashboards in tableau to derive insights from existing customers and new customers.

Data Quality Framework Table

Below is a list of the Data Quality dimensions to evaluate a dataset

Lets Load the Required Libraries

read in the excel datasets and put into a panda data frame

Load Transaction data

Overview of Data

The data types have to be changed appropriately.
Also there are some missing values in the dataset.

we will delete the entries from the above list [2660, 869, 3236, 852, 3229, 2074, 1373, 5034] for now looks like customer_id 5034 is an outlier lets delete entries with that customer_id

Check for duplicates in columns

As required transaction_id has all unique values for this data set it might be a primary key

Looking at categorical vairiables

There are rows where order status is canceled we will delete those rows

Since all the values in order_status are Approved we will delete this column

Missing Value treatment

There are still 360 values missing from 'online_order' replace them with mode of online_order

Get list of customer ids Rows with NaN in brand, product_line, product_class, product_size, standard_cost, product_first_sold_date

Looks like all 197 rows have no entries in the same columns brand, product_line, product_class, product_size, standard_cost, product_first_sold_date

We will look at these rows later.
But before that we can extract the customer ID information from these rows

for now we will delete rows where data is missing

Change data type

Load CustomerDemographic and CustomerAddress data

Overview of Data

cust_demographics dataset

cust_address dataset

Join cust_demographics and cust_address datasets

Looks like there is information regarding customer demographic but not transactions

default

There is a default column without a valid data delete it

customer_id and postcode from int and float to objects

DOB

Check if DOB are in order

1843-12-21 might be an incorrect entry
we can change it to 1943-12-21 assuming that year was entered incorrectly

Looking at categorical variables

gender

In Gender Female & F,Femal and also Male & M are same we can replace those values

state

In states NSW & New South Wales and also VIC & Victoria are same we can replace those values

owns_car

Encode owns car to "Yes" : 1, "No" : 0

deceased_indicator

Delete customers who are deceased and delete the column

Now data points have customers who are alive we will delete this column

drop deceased_indicator column

Create new columns

age

Check for duplicates in columns

As required customer_id has all unique values for this data set it might be a primary key

Change data type

All column are in order

Missing Value treatment

First name

we can delete these rows as there is no information given in most of the columns but before that we will check if there are any transactions in the transaction dataset

So there are transactions relating to these customer ids, hence we will delete these rows

This addresses missing values from most of the columns

job_industry_category

replaced the missing values in the job_industry_category with group mode of 'gender','age','postcode' There are still some more values missing replace them with group by state and mode

job_title

replaced the missing values in the job_title with group mode of 'gender','age','job_industry_category','postcode'

replaced the missing values in the job_title with group mode of 'state' & 'job_industry_category'

replaced the missing values in the job_title with group mode

deceased_indicator

Though deceased_indicator has no missing values, it is difficult to say if the values are correct.
If we wish to retain such data, it has to be updated frequently.

Rows where address, postcode, state, country, property_valuation is blank

These are the rows where address, postcode, state, country, property_valuation are all blank.
Lets check if there is any information on transaction.

There are 26 transactions with these customer_id, we will not delete these rows.

replace missing in country with Australia

we will replace missing values in state, postcode and property_valuation with the mode for state, mode for postcode after group by on state and mean for property evaluation after group by on state

customer_id

'first_name','last_name','address','postcode','country'

columns such as 'first_name','last_name','address','postcode','country' are not relevant for model building we will drop these columns

delete remaining with null values

There are still some rows where we have missing values for now we will delete rows where data is missing

Join customer data and transaction datasets

Missing Value treatment

delete rows without transaction

delete all rows where gender and other details are not available

Prepare data for model building

RFM analysis

RFM stands for recency, frequency, monetary value.
In business analytics, we often use this concept to divide customers into different segments, like high-value customers, medium value customers or low-value customers, and similarly many others.

Calculating Recency

Here we are calculating recency for customers who had made a purchase with a company.

Since all the date values belong to 2017 we will take Jan 1 2018 as the benchmark to calculate the recency

Calculating Frequency

We are here calculating the frequency of frequent transactions of the customer in ordering/buying some product from the company. Count unique transactions for each customer

Calculating Monetary Value

Here we are calculating the monetary value of customer spend on purchasing products from the company. Total list_price(spending) for each customer

Here online_order indicates the most preferred mode of oder on-line or off-line

Rename columns

Ranking Customer’s based upon their recency, frequency, and monetary score

Here we are normalizing the rank of the customers within a company to analyze the ranking.

Calculating RFM score

RFM score is calculated based upon recency, frequency, monetary value normalize ranks.
Based upon this score we divide our customers. Here we rate them on a scale of 5.
Formula used for calculating rfm score is : 0.15Recency score + 0.28Frequency score + 0.57 *Monetary score

Weight.

By default, the highest importance when calculating scores is given to the recency data, followed by frequency, and then monetary. If required, you can amend the weighting affecting one or several of these to change which is given the highest importance.

The RFM score is calculated as follows: (Recency score x Recency weight) + (Frequency score x Frequency weight) + (Monetary score x Monetary weight).

Rating Customer based upon the RFM score

Load New customer details

There are some unnamed columns we will delete these columns.
We can also delete first_name, last_name

Create a dummy column called customer_id for convenience

Change datatypes

Create new columns

age

Looking at categorical variables

gender

change "U" : "Unidentified"

owns_car

Encode owns car to "Yes" : 1, "No" : 0

state

In states NSW & New South Wales and also VIC & Victoria are same we can replace those values

drop columns deceased_indicator and country

Missing Value Treatment

There are some null values in the age, job industry and title column we will replace those as done previously

job_industry_category

replaced the missing values in the job_industry_category with group mode of 'gender','age','postcode' There are still some more values missing replace them with group by state and mode

job_title

replaced the missing values in the job_title with group mode of 'gender','age','job_industry_category','postcode'

replaced the missing values in the job_title with group mode of 'state' & 'job_industry_category'

replaced the missing values in the job_title with group mode

age

replaced the missing values in the age with group mean of 'postcode','gender','job_industry_category','job_title'

no more missing values