Predict probability whether a given ticket will be paid?¶
Blight violations in the city of Detroit has been an dificult problem to solve. Every year, the city of Detroit issues millions of dollars in fines to residents and every year, many of these fines remain unpaid. Enforcing unpaid blight fines is a costly and tedious process, so the city wants to know: how can we increase blight ticket compliance?
Blighted property shall include commercial or residential structures, which have been declared uninhabitable and hazardous by the City and exhibit objectively determinable signs of deterioration that constitute a threat to human health, safety and public welfare.
In this notebook, I will focus on the problem at hand, helping the city enforce fines by predicting whether a person is going to comply with a property maintenance fine or not.
The first step in answering this question is understanding when and why a resident might fail to comply with a blight ticket. This is where predictive modeling comes in. For this assignment, your task is to predict whether a given blight ticket will be paid on time.
The Dataset¶
I have two data files for use in training and validating my models: train.csv and test.csv. Each row in these two files corresponds to a single blight ticket, and includes information about when, why, and to whom each ticket was issued.
The target variable is compliance, which is True if the ticket was paid early, on time, or within one month of the hearing data, False if the ticket was paid after the hearing date or not at all, and Null if the violator was found not responsible.
Note: All tickets where the violators were found not responsible are not considered during evaluation. They are included in the training set as an additional source of data for visualization, and to enable unsupervised and semi-supervised approaches. However, they are not included in the test set.
File descriptions (used for training model!)
train.csv - the training set (all tickets issued 2004-2011)
test.csv - the test set (all tickets issued 2012-2016)
addresses.csv & latlons.csv - mapping from ticket id to addresses, and from addresses to lat/lon coordinates.
Note: misspelled addresses may be incorrectly geolocated.
Data fields
train.csv & test.csv
ticket_id - unique identifier for tickets
agency_name - Agency that issued the ticket
inspector_name - Name of inspector that issued the ticket
violator_name - Name of the person/organization that the ticket was issued to
violation_street_number, violation_street_name, violation_zip_code - Address where the violation occurred
mailing_address_str_number, mailing_address_str_name, city, state, zip_code, non_us_str_code, country - Mailing address of the violator
ticket_issued_date - Date and time the ticket was issued
hearing_date - Date and time the violator's hearing was scheduled
violation_code, violation_description - Type of violation
disposition - Judgment and judgement type
fine_amount - Violation fine amount, excluding fees
admin_fee - $20 fee assigned to responsible judgments
state_fee - $10 fee assigned to responsible judgments late_fee - 10% fee assigned to responsible judgments discount_amount - discount applied, if any clean_up_cost - DPW clean-up or graffiti removal cost judgment_amount - Sum of all fines and fees grafitti_status - Flag for graffiti violations
train.csv only
payment_amount - Amount paid, if any
payment_date - Date payment was made, if it was received
payment_status - Current payment status as of Feb 1 2017
balance_due - Fines and fees still owed
collection_status - Flag for payments in collections
compliance [target variable for prediction]
Null = Not responsible
0 = Responsible, non-compliant
1 = Responsible, compliant
compliance_detail - More information on why each ticket was marked compliant or non-compliant
Evaluation¶
The predictions are given as the probability that the corresponding blight ticket will be paid on time.
The evaluation metric for this assignment is the Area Under the ROC Curve (AUC).
Import Libraries¶
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")
# Importing re package
import re
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import GridSearchCV
%matplotlib inline
class color:
BLUE = '\033[94m'
PURPLE = '\033[95m'
RED = '\033[91m'
BOLD = '\033[1m'
END = '\033[0m'
Load Test Train Data¶
train_data = pd.read_csv('train.csv', encoding = 'ISO-8859-1')
test_data = pd.read_csv('test.csv', encoding = 'ISO-8859-1')
Data Overview ¶
Let us look at the data.
def overview(df):
print(color.BOLD + "There are {} rows and {} columns in the dataset.".format(df.shape[0],df.shape[1]),"\n"+ color.END)
print(color.BOLD +color.BLUE +"Let's look at the data types available in the dataset"+ color.END)
df.info()
overview(train_data)
print(color.BOLD +color.BLUE +"\n","Summary statistics of dataset"+ color.END)
train_data.describe()
There are 250306 rows and 34 columns in the dataset. Let's look at the data types available in the dataset <class 'pandas.core.frame.DataFrame'> RangeIndex: 250306 entries, 0 to 250305 Data columns (total 34 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ticket_id 250306 non-null int64 1 agency_name 250306 non-null object 2 inspector_name 250306 non-null object 3 violator_name 250272 non-null object 4 violation_street_number 250306 non-null float64 5 violation_street_name 250306 non-null object 6 violation_zip_code 0 non-null float64 7 mailing_address_str_number 246704 non-null float64 8 mailing_address_str_name 250302 non-null object 9 city 250306 non-null object 10 state 250213 non-null object 11 zip_code 250305 non-null object 12 non_us_str_code 3 non-null object 13 country 250306 non-null object 14 ticket_issued_date 250306 non-null object 15 hearing_date 237815 non-null object 16 violation_code 250306 non-null object 17 violation_description 250306 non-null object 18 disposition 250306 non-null object 19 fine_amount 250305 non-null float64 20 admin_fee 250306 non-null float64 21 state_fee 250306 non-null float64 22 late_fee 250306 non-null float64 23 discount_amount 250306 non-null float64 24 clean_up_cost 250306 non-null float64 25 judgment_amount 250306 non-null float64 26 payment_amount 250306 non-null float64 27 balance_due 250306 non-null float64 28 payment_date 41113 non-null object 29 payment_status 250306 non-null object 30 collection_status 36897 non-null object 31 grafitti_status 1 non-null object 32 compliance_detail 250306 non-null object 33 compliance 159880 non-null float64 dtypes: float64(13), int64(1), object(20) memory usage: 64.9+ MB Summary statistics of dataset
ticket_id | violation_street_number | violation_zip_code | mailing_address_str_number | fine_amount | admin_fee | state_fee | late_fee | discount_amount | clean_up_cost | judgment_amount | payment_amount | balance_due | compliance | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 250306.000000 | 2.503060e+05 | 0.0 | 2.467040e+05 | 250305.000000 | 250306.000000 | 250306.000000 | 250306.000000 | 250306.000000 | 250306.0 | 250306.000000 | 250306.000000 | 250306.000000 | 159880.000000 |
mean | 152665.543099 | 1.064986e+04 | NaN | 9.149788e+03 | 374.423435 | 12.774764 | 6.387382 | 21.494506 | 0.125167 | 0.0 | 268.685356 | 48.898986 | 222.449058 | 0.072536 |
std | 77189.882881 | 3.188733e+04 | NaN | 3.602034e+04 | 707.195807 | 9.607344 | 4.803672 | 56.464263 | 3.430178 | 0.0 | 626.915212 | 222.422425 | 606.394010 | 0.259374 |
min | 18645.000000 | 0.000000e+00 | NaN | 1.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | -7750.000000 | 0.000000 |
25% | 86549.250000 | 4.739000e+03 | NaN | 5.440000e+02 | 200.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
50% | 152597.500000 | 1.024400e+04 | NaN | 2.456000e+03 | 250.000000 | 20.000000 | 10.000000 | 10.000000 | 0.000000 | 0.0 | 140.000000 | 0.000000 | 25.000000 | 0.000000 |
75% | 219888.750000 | 1.576000e+04 | NaN | 1.292725e+04 | 250.000000 | 20.000000 | 10.000000 | 25.000000 | 0.000000 | 0.0 | 305.000000 | 0.000000 | 305.000000 | 0.000000 |
max | 366178.000000 | 1.415411e+07 | NaN | 5.111345e+06 | 10000.000000 | 20.000000 | 10.000000 | 1000.000000 | 350.000000 | 0.0 | 11030.000000 | 11075.000000 | 11030.000000 | 1.000000 |
overview(test_data)
print(color.BOLD +color.BLUE +"\n","Summary statistics of dataset"+ color.END)
test_data.describe()
There are 61001 rows and 27 columns in the dataset. Let's look at the data types available in the dataset <class 'pandas.core.frame.DataFrame'> RangeIndex: 61001 entries, 0 to 61000 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ticket_id 61001 non-null int64 1 agency_name 61001 non-null object 2 inspector_name 61001 non-null object 3 violator_name 60973 non-null object 4 violation_street_number 61001 non-null float64 5 violation_street_name 61001 non-null object 6 violation_zip_code 24024 non-null object 7 mailing_address_str_number 59987 non-null object 8 mailing_address_str_name 60998 non-null object 9 city 61000 non-null object 10 state 60670 non-null object 11 zip_code 60998 non-null object 12 non_us_str_code 0 non-null float64 13 country 61001 non-null object 14 ticket_issued_date 61001 non-null object 15 hearing_date 58804 non-null object 16 violation_code 61001 non-null object 17 violation_description 61001 non-null object 18 disposition 61001 non-null object 19 fine_amount 61001 non-null float64 20 admin_fee 61001 non-null float64 21 state_fee 61001 non-null float64 22 late_fee 61001 non-null float64 23 discount_amount 61001 non-null float64 24 clean_up_cost 61001 non-null float64 25 judgment_amount 61001 non-null float64 26 grafitti_status 2221 non-null object dtypes: float64(9), int64(1), object(17) memory usage: 12.6+ MB Summary statistics of dataset
ticket_id | violation_street_number | non_us_str_code | fine_amount | admin_fee | state_fee | late_fee | discount_amount | clean_up_cost | judgment_amount | |
---|---|---|---|---|---|---|---|---|---|---|
count | 61001.000000 | 6.100100e+04 | 0.0 | 61001.000000 | 61001.0 | 61001.0 | 61001.000000 | 61001.000000 | 61001.000000 | 61001.000000 |
mean | 331724.532811 | 1.256638e+04 | NaN | 272.714185 | 20.0 | 10.0 | 25.116219 | 0.239340 | 20.649711 | 347.895541 |
std | 25434.932141 | 1.414373e+05 | NaN | 360.101855 | 0.0 | 0.0 | 36.310155 | 3.245894 | 242.375180 | 460.058043 |
min | 284932.000000 | -1.512600e+04 | NaN | 0.000000 | 20.0 | 10.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 310111.000000 | 6.008000e+03 | NaN | 50.000000 | 20.0 | 10.0 | 5.000000 | 0.000000 | 0.000000 | 85.000000 |
50% | 332251.000000 | 1.213400e+04 | NaN | 200.000000 | 20.0 | 10.0 | 10.000000 | 0.000000 | 0.000000 | 250.000000 |
75% | 353031.000000 | 1.716500e+04 | NaN | 250.000000 | 20.0 | 10.0 | 25.000000 | 0.000000 | 0.000000 | 305.000000 |
max | 376698.000000 | 2.010611e+07 | NaN | 10000.000000 | 20.0 | 10.0 | 1000.000000 | 250.000000 | 15309.000000 | 15558.800000 |
Load Address Data and Create a map with property location¶
from geopy.geocoders import Nominatim
from folium.plugins import HeatMap
import folium
from tqdm import tqdm
#load datasets
addresses = pd.read_csv('addresses.csv')
lat_lons = pd.read_csv('latlons.csv')
#join datasets
id_address= addresses.set_index('address').join(lat_lons.set_index('address')).reset_index().set_index('ticket_id')
print(color.BOLD + "There are {} rows and {} columns in the dataset.".format(id_address.shape[0],id_address.shape[1]),"\n"+ color.END)
There are 311307 rows and 3 columns in the dataset.
print(color.BOLD +color.BLUE +"\n","Summary statistics of dataset"+ color.END)
id_address.describe()
Summary statistics of dataset
lat | lon | |
---|---|---|
count | 311299.000000 | 311299.000000 |
mean | 42.389822 | -83.112668 |
std | 0.038155 | 0.095320 |
min | 41.785926 | -88.081348 |
25% | 42.364200 | -83.188845 |
50% | 42.393794 | -83.126422 |
75% | 42.419709 | -83.036160 |
max | 45.809387 | -82.433593 |
address_locations=pd.DataFrame(id_address['address'].value_counts().reset_index())
address_locations.columns=['address','count']
address_locations=address_locations.merge(id_address,on='address',how="left").dropna()
address_locations['count'].max()
203
def defaultmap(default_location=[42.3314, -83.0458], default_zoom_start=9):
base_map = folium.Map(location=default_location, control_scale=True, zoom_start=default_zoom_start)
return base_map
basemap=defaultmap()
HeatMap(address_locations[['lat','lon','count']].values.tolist(),zoom=20,radius=15).add_to(basemap)
<folium.plugins.heat_map.HeatMap at 0x286dbb2e8f0>
basemap