We wanted to process raw data to identify new patterns and grasp difficult concepts which later we could share with our customers in a pictorial or a graphical manner; as it is said ‘a picture is worth a thousand words’.
What we were looking for was a data clean-up / data quality tool which can do some sanity checks on the results and then a visualization tool which can be used to share those insights.
We explored 3 tools:
- Trifacta – A data wrangling tool
- Mode Analytics
Let’s talk about each tool in detail.
Trifacta- A data wrangling tool
As the name suggests, it helps in the process of cleaning, structuring and enriching raw data into a desired format for better decision making. It works with cloud and on-premise data platforms.
Proof of Concept
We deployed Trifacta on our cloud platform and built several data flows in order to check Trifacta’s functionalities, performance, benefits, limitations, etc. Below are few observations on Trifacta:
- Trifacta is a wrangling tool which is good for filtration and transformation
- Once the flow is built, it can be scheduled at a specific time to rerun
- Trifacta flow can be parameterized so that we don’t need to make changes for every run
- They have released a new feature for data quality checks which will give extra advantage for data preparation to Trifacta. However, we have not tried it out since the new version launched in September 2020 and our POC was done before that
- Full data can’t be seen on Trifacta as it loads only 10 MB of data (sample) which is very less. This limits the EDA capability
- One has to run the job to see the filters/transformations on the full data
- Complex joins not yet available which makes flow unnecessarily complex and lengthy
Mode is an analytics platform designed to analyze, visualize, and share data. We can create charts and dashboards, and then share reports/dashboards using embed functionality.
Mode has their own query editor
Mode – Chart builder
Mode – Sample dashboard
Proof of Concept
We connected our database with mode analytics and created a few dashboards, reports. We have also tested embed functionality so as to feel the shared dashboard experience. Below are few observations on Mode analytics:
- Dedicated SQL editor and python notebook available, so we can use SQL as well as Python to query against data
- Good query performance
- Easy embed functionality
- various user level/data level permissions available
- Overall it is simple, easy to use and has good rendering capabilities
- Filters inter-connectivity and sorting is not possible, means filters should be configurable in such a way so that filters applied to one chart/graph will reflect changes to all the other charts present on the dashboard
- Less visualization options
- Limited customization available
- Large data processing is a bit slow
Tableau is a full packaged data analysis tool starting from data cleaning solutions to data visualization and data sharing. It provides a variety of products such as tableau prep, tableau online, tableau server, tableau desktop, etc. to achieve all of the data related goals.
Tableau chart builder
Tableau sample dashboard
Proof of Concept
We started with tableau online which is a visualization tool and is fully hosted on Tableau’s cloud. We built several widgets and a dashboard on it, later we experienced some performance bottlenecks and then shifted tableau server which is hosted on a public cloud where we can choose instance type depending on our use. Also it can be scaled horizontally and vertically to gain some performance benefits.
We have also tested the Tableau prep tool which is used to clean, filter, transform data by building flow just like Trifacta. We can use this flow output directly as a data source for reports building. Below is our experience with Tableau:
- Query performance is fast even in case of large data. Approximately, it loads 40 million+ records in less than just 5 seconds
- It provides a wide range of data source connectors
- Filters connectivity, sortability, flexibility, etc. available
- Lots of analytical features are available
- Almost all kinds of customization and visualization available
- White label embed functionality per dashboard as well as individual widget level available
- In Tableau prep, the sample data size limit is quite large which enhances EDA capability
- A data sample limit of 1 million rows is applied to Aggregate and Union step types and a data sample limit of 3 million rows is applied to Join and Pivot step types
- There is no such big drawback seen as of now in the visualisation side, but Tableau is a bit more complex than Mode analytics, maybe that too because it provides a vast number of features
- Overall ease of use is slightly less if compared with mode analytics
- In data preparation, Tableau prep does not have data quality check features
- Tableau prep doesn’t have an Amazon S3 connector, so if you want to access S3 objects, you need to access via Amazon Athena. This seems a considerable drawback for Tableau prep
- Trifacta is a pure data wrangling tool along with data quality checks features (released in new version) but it doesn’t have in-built visualization capabilities (dashboard), Also it loads only 10 MB of data at a time which is very less and affects EDA capabilities whereas tableau prep provides loading of rows upto 1 to 3 million at time.
- Below table shows the feature wise comparison for Mode analytics and Tableau:
|Aesthetics & Polish||Great to look at. Not a lot of customization available for color codes or chart types||Number of options to play around with the visualizations and layout of the dashboard/widget|
|Query Performance in Embeds||Good and easy to query. Dedicated sql editor available.||Query is faster than mode.|
|Render/Other Performance in Embeds||Rendering is good,|
Widget level embed not available but that can be achieved by building separate reports which will lead to duplication.
|Rendering is good, but not as good as Mode. Widget level sharing/embed available.|
|Filter sorting ability and Flexibility||Not available||Available|
|Ease of Use Overall||Slightly better than Tableau||Need to follow more steps to achieve the same result due to more functionalities available.|
|Ease of Use – Embeds||Same||Same|
|Dynamic Filters in Embeds||Not Available||Available|
|Filters & Parameters (widget level and report level)||Filters not interdependent.||All kinds of customization available|
- If you are only looking for a data quality/data wrangling tool, then you can consider Trifacta as a good option but if you are looking for a complete package from data clean-up to good visualisation then go for Tableau.
- Tableau provides almost all the features that we are looking for and is a one stop solution for our use case starting from data clean-up till sharing reports with the end-user.
Did you like our blog post? Do check out our other blog on Airflow UI Plugin Development.