How to quickly perform EDA

In this blog, discover the secrets to speeding up your Exploratory Data Analysis, EDA, process and gain better insights from the data.

GraphQL has a role beyond API Query Language- being the backbone of application Integration
background Coditation

How to quickly perform EDA

Exploratory Data Analysis(EDA) is understating and analyzing data using visual and statistical techniques. The goal of EDA is to discover patterns, relationships, and insights from the data that can help with further analysis and modeling

The EDA process often involves steps like

  • Data Understanding - Understanding the structure, range, distribution, and pattern of the data 
  • Data Cleaning - handling missing values, imputing and dealing with duplicate data, ensuring data quality and consistency 
  • Data Transformation - changing the structure of the data for better model learning
  • Data Reduction -removing unnecessary features that do not provide helpful information  
  • Data preprocessing - involves encoding categorical variables, splitting the data into training, and testing

Overall EDA is a critical step in the ML pipeline, and it is often an iterative process, meaning that you may have to repeat certain steps multiple times to gain a deeper understanding of the data. EDA can be time-consuming at times. Thankfully, we have some compelling libraries that can come to the rescue.

So here are three such libraries that can do the job for you with minimal code 

  1.  DataPrep 
  2.  Pandas-profiling
  3.  Sweetviz

Let's have a look at each of these libraries with sample code

  • This python library simplifies the process of EDA by providing a set of pre-built functions for data cleaning, transformation, and visualization. It is built on top of popular data manipulation libraries such as pandas and numpy, and it works seamlessly with other libraries such as seaborn and matplotlib
  • With minimal code it provides functions for handling missing data, removing duplicate rows, and data type conversion
  • It also provides a wide range of data visualization options, including histograms, scatter plots, and box plots
  • DataPrep EDA can be 10X faster than other pandas-based profiling tools because of the highly optimized Dask-based computing module
  • DataPrep provides a function called generate_report() that can be used to create a detailed and interactive EDA report
  • Here is an example of how to use the generate report function to create an EDA report
  • Code snippet
  1. First, install the DataPrep library by running the following command
  1. Next,  import the necessary libraries and dataset, here, we have selected an open-source dataset that is available on Kaggle of ‘Most Subscribed 1000 Youtube Channels’ for demonstration. This data provides valuable insights into the current trend and popularity of various content creators and helps understand which channels are leading the way in terms of audience engagement and growth on YouTube, a popular video streaming platform.
  1. DataPrep library provides a function called create_report() that can be used to create a detailed and interactive EDA report. Here is an example of how to use the create_report() function to create an EDA report:
  1. This will open a new tab in the browser and display the report in an interactive way here is a snapshot of the same. The user also has an option of saving the html file and showing the report in the notebook itself.
  • This report very conveniently shows the approximate distinct count, approximately, unique percentage, etc of all features which helps in kind of having a quick grasp of the data.
  •  Even though flexibility, automation, and minimal coding are pluses of the library, like every other thing it has got its limitations like visualization options may not be as powerful as other visualization libraries out there along with limited documentation
  • Also, the DataPrep library is in the early stage of development and is constantly evolving and improving. It's a good idea to keep a regular check on the documentation and updates about the library.
Pandas Profiling

This library uses matplotlib for graphs and jinja2 as a template engine for its interface, generates a detailed report about the data frame, it generates information about variables, their types, and their values with just a couple of lines of code.

  • It provides a quick overview of the data, including the number of rows and columns, missing values, and basic statistics
  • It can help identify potential issues with the data, such as outliers or duplicate values
  • This report can come in handy for feature selection
  • It can also help to identify which columns have high cardinality which can be helpful for data preprocessing
  • Here’s how we can do it in notebook
  1. Install the library by running the following command
  1. Once we import the necessary libraries and data. We can generate the report by following commands which can be saved in html file as well.
  • Report generated by pandas profiling quickly gives us an idea about what is the distribution of categories of most subscribed youtube channels, distribution of the start year of the channel, etc.
  • Just a friendly heads up that The naming of the 'pandas-profiling' package is currently in a transitional stage and it will soon be known as 'ydata-profiling'. We suggest that you keep a watchful eye on the updates and make the necessary changes to your projects accordingly.
  • The downside of this library can be its computational cost and it might not provide detailed information on the data such as skewness and kurtosis
  • All in all, it can be a useful tool for quickly getting an overview of a dataset and identifying potential issues, but it should not be relied upon as the only method for exploring and understanding data

SweetViz allows you to quickly generate a report about the data that contains high-density visualization along with all the necessary information about variables, missing values, duplicates, and other important statistics.

Some of the key features of SweetViz include:

  • Ability to compare two datasets
  • ability to handle missing values and duplicate values
  • Build on top of widely used and well-documented pandas and matplotlib
  • It allows exporting the report in different formats like html, json, pdf etc
  • Here is how to generate the report
  1. We need to install the library as follows 
  1. Once we import the necessary libraries and read the data, we can generate the report that compares two data frames as follows
  1. Sweetviz provides a convenient way to compare two datasets and understand the differences between them. This can be useful for checking if the training and testing datasets are representative of each other, or if there are any disparities that could impact modeling performance. The comparison feature generates a report that summarizes the main characteristics of the datasets, including the distribution of the features, the missing values, the correlations, and more. 
  • This feature of the sweetviz library makes data frame comparison fast and comprehensive.
  • Additionally, sweet_viz has compare_intra() feature as well which can be used to compare two populations within the same dataset which becomes really powerful when coupled with target feature analysis 
  • This was all about a short introduction to these powerful EDA libraries
  • Overall, Pandas Profiling generates a comprehensive report of the dataset including statistics, distribution, missing values, correlations, and more. It has more customization options but can be limited to the amount of data it can handle.
  • Sweetviz is a library for generating easy-to-read visualizations for exploring and comparing datasets. It generates a report similar to Pandas Profiling but with more visually appealing and interactive graphics.
  • Data Prep also supports a wide range of data sources, including databases and cloud services, making it a more comprehensive data preparation tool compared to the other two.

As you can see, each library has its own strengths and weaknesses. It's important to try out a few different libraries and see which one works best for your specific use case.

Hi folks, I am Saurabh Kedare. As a data scientist and avid badminton player, I endeavor to enhance both my work and play.

Want to receive update about our upcoming podcast?

Thanks for joining our newsletter.
Oops! Something went wrong.