Data cleansing plays a key role in building models in machine learning. A significant amount of any data scientists’ time is dedicated to data cleansing activity. In a lot of cases, cleaner data gives us the desired results even with a preliminary algorithm. Undoubtedly, there are different types of data and they all require different techniques but the below mentioned steps should serve as a guide to kick off your journey in data cleaning using Python.
Here are the datasets we will be using:
Let us start by importing the necessary libraries and dataset:
The first step is to take a look at the data. This gives us a brief idea of how the data looks and if there are any anomalies in the dataset. The most common issue encountered during data cleaning is missing values. You can look at the first few rows in the data frame to understand more about the data.
Fields in a dataset which have missing values are represented as NaN(not a number) or None. Here we can see we have empty spaces in our dataset and they can skewer our results. In fact, most of the models do not accept missing values.
We can detect the cells with missing cells and count them using the following commands :
missing_values = data.isnull()
missing_values = data.isnull().sum()
Let us discuss some of the ways to deal with missing values:
1) Deleting rows/columns with missing values:
This is the simplest way to handle / resolve the issue of missing values in your dataset. As the name suggests, this step simply involves deleting the rows or columns which have Nan or none in them. This approach is only recommended when you have enough data in your dataset. The down side of using this approach is you experience loss of data which can affect your results. Hence, this is usually not the best way to deal with missing values however it can be useful when most values in your column are empty.
To drop data with missing values we can use:
my_data_without_missing_value = original_data.dropna(inplace=True)
We can choose to drop a feature if we deem it to be unimportant to us. To drop a feature we can use the following:
columns_to_drop = [‘Cabin’,’Embarked’] df_without_features = df.drop(columns_to_drop, axis=1)
2)Replacing with mean/median/mode:
When you have numeric data we can use methods like mean,median and mode to fill in empty values in our data. This is particularly useful when you are calculating data like age, price, etc and you encounter missing values.
This method adds variance to your dataset and yields better results than removing entire columns and rows.
We can also try imputation to fill in the NaN values. The imputed value is not perfect but it gives more authentic models to work with.
imp=Imputer(missing_values=”NaN”, strategy=”mean” )
3) Predicting missing values:
We can use the features without missing values to help us predict missing values. We can use the correlation between features with machine learning algorithms to help predict the empty values.
Either regression or classification can help us with prediction. In our data frame, we have missing values in the “age” feature so let’s predict the missing values using linear regression.
For linear regression, we require data to be split into two parts:
1) Training data
2) Testing data
Let us split them:
X_train = Dataset without data[“Age”] features with non null values
Y_train = Rows from data[“Age”] with non-null values
X_test = Dataset without data[“Age”] features with null values
Y_test = Rows from data[“Age”] with null values
from sklearn.linear_model import LinearRegression
import pandas as pd
data = pd.read_csv(“train.csv”)
data = data[[“Survived”, “Pclass”, “Sex”, “SibSp”, “Parch”, “Fare”, “Age”]]
data[“Sex”] = [1 if x==”male” else 0 for x in data[“Sex”]]
test_data = data[data[“Age”].isnull()]
y_train = data[“Age”]
X_train = data.drop(“Age”, axis=1)
X_test = test_data.drop(“Age”, axis=1)
model = LinearRegression()
y_pred = model.predict(X_test)
This provides us with superior results than other previous methods as it uses covariance between columns.
Handling missing values is a problem that needs to be handled cleverly by every aspiring Data Scientist. There are more advanced methods to handle missing data using Machine Learning but the above mentioned methods should suffice for most beginners. There is no hard and fast method to get rid of missing values in your dataset, choosing the best performing method based on your data should be given the most importance.
Did you like this blog? Do check out our other blogs here.