Used Car Price Prediction

A step-by-step approach to predict used car price using machine learning

Potential Business: Pandemic-Driven Purchase

In the recent days, due to pandemic, many individuals are looking forward to owning a vehicle who otherwise preferred public transport and decrease the dependency on public tansport.
The fear of contacting virus while using a public transport, prompted consumers to owning a vehicle. Also, used cars are mostly preferred by those who cannot afford to buy new cars at higher prices. With the increase in prices of new vehicles and considering affrordability we can observe a trend in growth of used car market.
In general, a seller decides a price at random and the buyer has no idea about the vehicle and it’s value the market. It could also be possible that the seller also has no idea about the vehicles’s value or the price at which he should be selling the car.
To address this problem I have developed a Used Car Price Prediction system which can effectively determines the price of a vehicle using various features.
I used a Regression Algorithms which can provide us with a continuous value as recommeded selling price.

Please vote up and share your feedback in the comment box, if you like this notebook.

Data Description

Selling_Price : The price of the used car in INR Lakhs.
Name : The brand and model of the car.
Location : The location in which the car is being sold or is available for purchase.
Year : The year or edition of the model.
Kilometers Driven : The total kilometers driven in the car by the previous owner(s) in KM.
Fuel Type : The type of fuel used by the car. (Petrol, Diesel, Electric, CNG, LPG)
Transmission : The type of transmission used by the car. (Automatic / Manual)
Seller_Type : Whether the seller an Individual, Dealer or Trustmark Dealer.
Owner Type : Whether the ownership is Firsthand, Second hand or other.
Mileage : The standard mileage offered by the car company in kmpl(Petrol/Diesel) or km/kg(CNG/ LPG)
Current_Mileage : Current mileage claimed by the seller.
Engine : The displacement volume of the engine in CC.
Power : The maximum power of the engine in bhp
Seats : The number of seats in the car.
New_Price : Latest price of vehicle.

Importing necessary libraries

Load Data

Data Overview

Let us look at the data.

Train Data

Selling Price

selling price is given in INR Lakhs, we will multiply the columns with 1,00,000. to get the correct price.

Test Data

There are columns such as Current_Mileage, Engine, Power, Seats and New_Price which have null-values.
Also, data type of appropriate columns have to be changed to meet the requirement.

Converting to appropriate data type

Though data types of Year and Seats are 'int64' and 'float64' respectively, using them with the same datatype will not be useful for our evaluation.
we need to convert these parameters into datetime and categorical(nominal) respectively.
Also, Location, Fuel_type, Transmission, Owner_Type, Seller_Type can all be changed to categorical variables.

Seats Column

There are records where Seats is given as 0 which is misleading. This can be assumed that number of seats is not recorded for those enteries. We will consider this as NaN an impute them with make and model information.

Year column

we will calculate the age of vehicle from Year column and drop Year column.

Current_Mileage and Mileage

Current_Mileage and Mileage have units which have to be removed and these should be converted into appropriate datatype 'float64'. When Fuel_Type is CNG/ LPG units for Current_Mileage and mileage are km/kg, when Fuel_Type is Petrol/ Diesel units are kmpl. In addition there are values with 'null' in the column, null can be converted to NaN and imputed later on. Units can be removed using regex or str operations.

Feature Engineering

Missing Value Treatment

Check for null/ nan in Dataset

For electric vehicle instead of mileage there is a parameter called range.
Range is Distance km/Charge. Also there is no Electric in fuel type in test data. To maintain uniformity these two rows can be dropped.

There are records where Current_Mileage is given as 0 which is misleading. This can be assumed that the Current_Mileage is not recorded. We will consider this as NaN an replace during imputation with the mileage value given by the manufacturer.

Removing suffixes from values in a column

For Engine units are CC.
For Power units are bhp. In addition there are values with 'null bhp' in the column, null can be converted to NaN and imputed later on.
Units can be removed using regex or str operations.

Splitting columns

We can extract the Make and model infomation from name.
We can split Make and Model into separate columns which can be used further and drop name column.

Impute missing values

Impute missing values in 'Power','Mileage', 'Seats','Engine','Seller_Type'.

Vehicles with different variants of under same Make and Model tend to have same Engine capacity and Power.
All missing values in Engine and Power column, missing values can be imputed based on the Make and Model information available.

Checking Distribution of Numeric Variables

As rule of thumb, skewness can be interpreted like this:

Skewness
Fairly Symmetrical -0.5 to 0.5
Moderate Skewed -0.5 to -1.0 and 0.5 to 1.0
Highly Skewed < -1.0 and > 1.0

Correlation between variables

Relation between Mileage, Engine and Power

Over period, values specified by manufaturer such as Mileage, Engine and Power will eventually change due to wear and tear, hence we will drop these columns.
Also, the values are given by the manufacturer cannot be a deciding factor to buy a vehicle.
For understanding we can check the correlation with the target value(used car price).
These are the data given by the manufaturer.

Outlier Treatment

Relation between Kilometers_Driven and vehicle_age

Third Quartile and above values in Kilometers_Driven have very high values. We will look at them closely to get an understanding about them.

There is one entry where Kilometers_Driven is 6500000 which is not likely possible considering age of the vehicle.
we will replace this value by max of km driven based on age of that vehicle.

There are very few vehicles which have recorded very high values in Kilometers_Driven, we will limit these values to 300000.

Though we have addressed the skewness in Kilometers_Driven data, from the above plot we can see that distribution of data is skewed.
We will use box-cox transformation to address the skewness and normalize the data.

Box-Cox Transformation for numeric variables variables

We have successfully imputed all the missing values, normalized the data. This way we can be assured that we have not lost any data. Since the number of missing values is not high we can also choose to delete those rows. But that will lead to loss of data.

Data Grouping

We can group categories within column for category variables. In this method we will check the unique categories within the categorical variables and reduce the number of categories keeping highly appropriate categories.

categorical variables into Binary variables and adding new columns

Delete Make and Model columns

Dummy Variable Creation

Correlation matrix

We can delete columns which have high correlation with other columns.
Fuel_Type_Diesel-Fuel_Type_Petrol,
Transmission_Automatic-Transmission_Manual,
Seller_Type_Dealer-Seller_Type_Individual,
Owner_Type_Second-Owner_Type_First,
Seats_6nabove-Seats5.0
have high correlation with each other.
Hence we can drop 'Fuel_Type_Diesel', 'Transmission_Automatic','Seller_Type_Dealer', 'Owner_Type_Second', 'Seats_6nabove'

Feature Importance

With feature importance we can understand which features are very important for price prediction

To reduce the model complexity we will reduce the number of features based on importance.
We can delete the features that are not important for our model building based on the feature importance plot.

Model Building

Choosing best fit model for our dataset :

We need to select a model which can do predictions on non linear and combination of categorical and numerical data.
We will evaluate between linear regression, decision tree, random forest and xgboost.
A model with high accuracy can be selected.
We can choose our best fit model using cross validation score.

As we see above accuracy score result we can say that Random forest regressor gives better accuracy with very low standard deviation.

Random Forest Regressor

We will build a random forest regressor model.
We will also use RandomizedSearchCV.

Hyper parameter tuning

The result of a hyperparameter optimization is a single set of well-performing hyperparameters that you can use to configure your model.

Comparision of Predictions with Actual Values

We can use scatter plot and distribution graph to visually understand how our predicted values compare to actual values

With the help of above scatter plot we can understand that

Though the distribution plot is a closed Gaussian distributed graph, the difference between ‘y_test’(Actual value)and ‘y_predicted’ (Prerdicted value) is spread across a wide range.

Make Predictions on Test Data