Have you ever come across the term “data preparation & cleaning”?

It is probably the most important part of the entire Machine Learning process. Real world data is often messy and can cause many flaws, problems and errors. That makes it so vital to ensure that your data is clean and prepared for analysis.

So, in the simplest words, data preparation & cleaning is all about garbage in, garbage out. This process includes identifying and removing/fixing errors or incorrect, corrupted, duplicate, or incomplete data within a dataset, filling in missing values, and dealing with outliers. As you can imagine, this can be a time-consuming process, but it is totally worth it as it ensures that your data is accurate and ready for analysis. Even the most fanciest algorithms have struggles learning when the data isn’t clean or not in the right format. So clean data is essential for a successful project. 

To make you feel confident about your Machine Learning projects, we cover off everything you need to know in this “Data Preparation & Cleaning for Machine Learning” mini-course. We begin with a checklist of the eight most important steps that you must remember each time you start a project. Then we are going through some theory, where you learn about missing values, outliers, feature selection and more, before getting into the hands-on parts for each topic where we work with an application in Python.

Course Outline

ML Prep Checklist

Before getting started with the theory and hands-on training of this course, we go through our Machine Learning prep checklist where you get to know the eight most common data preparation and cleaning steps that need to be considered in a project, such as dealing with missing values and duplicate or incorrect data, feature scaling, or validation split. 


Working with Missing Values

In this part we cover off some of the high level intuition around dealing with the missing values that can exist in our data.

Learn why we even have to worry about missing values and the different ways to deal with them. 

Learn how to make use of Pandas and the isNA method to check for missing values and about the options you have if you find some.

Get to know imputation - a data preprocessing and preparation technique, where you impute replacement values for those that are missing. Learn how to put some powerful imputation approaches into action, such as SimpleImputer or KNNImputer, which enable either the static or the dynamic imputation of missing values. 

Furthermore, we show you how to deal with categorical variables so you can ensure that your model can extract the meaningful information that the variables hold in. For this, you will learn about one of the most common approaches to do that: One-Hot-Encoding.


Outliers & Feature Scaling

In this section you learn what outliers are and why they matter when it comes to preprocessing and preparing your data for a machine learning project. Also, you get to know some ways of how to detect and deal with them - and learn when it is ok to just leave them be and do nothing. 

Also, we are discussing and running through ways to scale the values of your features or columns, an approach known as feature scaling. Here, you learn what feature scaling is and why it is such an important step. And you get to know the two most common techniques for this: standardization and normalization. Learn about the difference between these two and the logic that lies behind them. 


Feature Selection

Next, we are discussing what exactly feature selection is and the scenarios in which you need to apply it. We also will be looking at some nice and effective approaches that are used to find the best features prior to building and training your model: creating a quick and easy correlation matrix between the features, using univariate feature selection - a more automated method - , and recursive feature elimination (RFE).


ML Model Validation

Last but not least you learn everything about model validation. Learn what overfitting is and why this can be problematic for your machine learning project. Also get to know some great approaches for model validation: cross validation and k-fold cross validation.