How to quickly perform EDA

Exploratory Data Analysis(EDA) is understating and analyzing data using visual and statistical techniques. The goal of EDA is to discover patterns, relationships, and insights from the data that can help with further analysis and modeling

The EDA process often involves steps like

Data Understanding - Understanding the structure, range, distribution, and pattern of the data
Data Cleaning - handling missing values, imputing and dealing with duplicate data, ensuring data quality and consistency
Data Transformation - changing the structure of the data for better model learning
Data Reduction -removing unnecessary features that do not provide helpful information
Data preprocessing - involves encoding categorical variables, splitting the data into training, and testing

Overall EDA is a critical step in the ML pipeline, and it is often an iterative process, meaning that you may have to repeat certain steps multiple times to gain a deeper understanding of the data. EDA can be time-consuming at times. Thankfully, we have some compelling libraries that can come to the rescue.

So here are three such libraries that can do the job for you with minimal code

Let's have a look at each of these libraries with sample code

DataPrep

This python library simplifies the process of EDA by providing a set of pre-built functions for data cleaning, transformation, and visualization. It is built on top of popular data manipulation libraries such as pandas and numpy, and it works seamlessly with other libraries such as seaborn and matplotlib
With minimal code it provides functions for handling missing data, removing duplicate rows, and data type conversion
It also provides a wide range of data visualization options, including histograms, scatter plots, and box plots
DataPrep EDA can be 10X faster than other pandas-based profiling tools because of the highly optimized Dask-based computing module
DataPrep provides a function called generate_report() that can be used to create a detailed and interactive EDA report
Here is an example of how to use the generate report function to create an EDA report
Code snippet

First, install the DataPrep library by running the following command

Next, import the necessary libraries and dataset, here, we have selected an open-source dataset that is available on Kaggle of ‘Most Subscribed 1000 Youtube Channels’ for demonstration. This data provides valuable insights into the current trend and popularity of various content creators and helps understand which channels are leading the way in terms of audience engagement and growth on YouTube, a popular video streaming platform.

DataPrep library provides a function called create_report() that can be used to create a detailed and interactive EDA report. Here is an example of how to use the create_report() function to create an EDA report:

This will open a new tab in the browser and display the report in an interactive way here is a snapshot of the same. The user also has an option of saving the html file and showing the report in the notebook itself.

This report very conveniently shows the approximate distinct count, approximately, unique percentage, etc of all features which helps in kind of having a quick grasp of the data.
Even though flexibility, automation, and minimal coding are pluses of the library, like every other thing it has got its limitations like visualization options may not be as powerful as other visualization libraries out there along with limited documentation
Also, the DataPrep library is in the early stage of development and is constantly evolving and improving. It's a good idea to keep a regular check on the documentation and updates about the library.

Pandas Profiling

This library uses matplotlib for graphs and jinja2 as a template engine for its interface, generates a detailed report about the data frame, it generates information about variables, their types, and their values with just a couple of lines of code.

It provides a quick overview of the data, including the number of rows and columns, missing values, and basic statistics
It can help identify potential issues with the data, such as outliers or duplicate values
This report can come in handy for feature selection
It can also help to identify which columns have high cardinality which can be helpful for data preprocessing
Here’s how we can do it in notebook

Install the library by running the following command

Once we import the necessary libraries and data. We can generate the report by following commands which can be saved in html file as well.

Report generated by pandas profiling quickly gives us an idea about what is the distribution of categories of most subscribed youtube channels, distribution of the start year of the channel, etc.
Just a friendly heads up that The naming of the 'pandas-profiling' package is currently in a transitional stage and it will soon be known as 'ydata-profiling'. We suggest that you keep a watchful eye on the updates and make the necessary changes to your projects accordingly.
The downside of this library can be its computational cost and it might not provide detailed information on the data such as skewness and kurtosis
All in all, it can be a useful tool for quickly getting an overview of a dataset and identifying potential issues, but it should not be relied upon as the only method for exploring and understanding data

SweetViz

SweetViz allows you to quickly generate a report about the data that contains high-density visualization along with all the necessary information about variables, missing values, duplicates, and other important statistics.

Some of the key features of SweetViz include:

Ability to compare two datasets
ability to handle missing values and duplicate values
Build on top of widely used and well-documented pandas and matplotlib
It allows exporting the report in different formats like html, json, pdf etc
Here is how to generate the report

We need to install the library as follows

Once we import the necessary libraries and read the data, we can generate the report that compares two data frames as follows

Sweetviz provides a convenient way to compare two datasets and understand the differences between them. This can be useful for checking if the training and testing datasets are representative of each other, or if there are any disparities that could impact modeling performance. The comparison feature generates a report that summarizes the main characteristics of the datasets, including the distribution of the features, the missing values, the correlations, and more.

This feature of the sweetviz library makes data frame comparison fast and comprehensive.
Additionally, sweet_viz has compare_intra() feature as well which can be used to compare two populations within the same dataset which becomes really powerful when coupled with target feature analysis
This was all about a short introduction to these powerful EDA libraries
Overall, Pandas Profiling generates a comprehensive report of the dataset including statistics, distribution, missing values, correlations, and more. It has more customization options but can be limited to the amount of data it can handle.
Sweetviz is a library for generating easy-to-read visualizations for exploring and comparing datasets. It generates a report similar to Pandas Profiling but with more visually appealing and interactive graphics.
Data Prep also supports a wide range of data sources, including databases and cloud services, making it a more comprehensive data preparation tool compared to the other two.

As you can see, each library has its own strengths and weaknesses. It's important to try out a few different libraries and see which one works best for your specific use case.

‍

Hi folks, I am Saurabh Kedare. As a data scientist and avid badminton player, I endeavor to enhance both my work and play.

Want to receive update about our upcoming podcast?

Latest Articles

View All Articles

Implementing custom windowing and triggering mechanisms in Apache Flink for advanced event aggregation

Dive into advanced Apache Flink stream processing with this comprehensive guide to custom windowing and triggering mechanisms. Learn how to implement volume-based windows, pattern-based triggers, and dynamic session windows that adapt to user behavior. The article provides practical Java code examples, performance optimization tips, and real-world implementation strategies for complex event processing scenarios beyond Flink's built-in capabilities.

15

min read

Implementing feature flags for controlled rollouts and experimentation in production

Discover how feature flags can revolutionize your software deployment strategy in this comprehensive guide. Learn to implement everything from basic toggles to sophisticated experimentation platforms with practical code examples in Java, JavaScript, and Node.js. The post covers essential implementation patterns, best practices for flag management, and real-world architectures that have helped companies like Spotify reduce deployment risks by 80%. Whether you're looking to enable controlled rollouts, A/B testing, or zero-downtime migrations, this guide provides the technical foundation you need to build robust feature flagging systems.

12

min read

Implementing incremental data processing using Databricks Delta Lake's change data feed

Discover how to implement efficient incremental data processing with Databricks Delta Lake's Change Data Feed. This comprehensive guide walks through enabling CDF, reading change data, and building robust processing pipelines that only handle modified data. Learn advanced patterns for schema evolution, large data volumes, and exactly-once processing, plus real-world applications including real-time analytics dashboards and data quality monitoring. Perfect for data engineers looking to optimize resource usage and processing time.

12

min read