SmartEDA: An R Package for Automated Exploratory Data Analysis

This paper introduces SmartEDA, which is an R package for performing Exploratory data analysis (EDA). EDA is generally the first step that one needs to perform before developing any machine learning or statistical models. The goal of EDA is to help someone perform the initial investigation to know more about the data via descriptive statistics and visualizations. In other words, the objective of EDA is to summarize and explore the data. The need for EDA became one of the factors that led to the development of various statistical computing packages over the years including the R programming language that is a very popular and currently the most widely used software for statistical computing. However, EDA is a very tedious task, requires some manual effort and some of the open source packages available in R are not just upto the mark. In this paper, we propose a new open source package i.e. SmartEDA for R to address the need for automation of exploratory data analysis. We discuss the various features of SmartEDA and illustrate some of its applications for generating actionable insights using a couple of real-world datasets. We also perform a comparative study of SmartEDA with respect to other packages available for exploratory data analysis in the Comprehensive R Archive Network (CRAN).


Introduction: Exploratory Data Analysis
Nowadays, we see applications of Data Science almost everywhere. Some of the well highlighted aspects of data science are the various statistical and machine learning techniques applied for solving a problem. However, any data science activity starts with an Exploratory Data Analysis (EDA). The term "Exploratory Data Analysis" was coined by Tukey (1977). EDA can be defined as the art and science of performing initial investigation on the data by arXiv:1903.04754v1 [stat.CO] 12 Mar 2019 means of statistical and visualization techniques that can bring out the important aspects in the data that can be used for further analysis (Tukey 1977). EDA puts an emphasis on hypothesis generation and pattern recognition from the raw data (Liu 2014). There have been many studies conducted on EDA reported in the Statistics literature (please see Section 2 for more details).  (Wirth 2000) EDA is a very important component of the Data Mining process as per the industry standard CRSP-DM framework. The CRISP-DM stands for "CRoss Industry Standard Process for Data Mining" (Wirth 2000). Data mining is a creative process that required a different set of skills and knowledge. However, earlier there was a lack of any standard framework for Data mining projects, which esured that the sucess/failure of a data mining project is highly dependent on the skill-set of a particular individual or a team that is executing the project. To address this need, Wirth (2000) proposed the CRISP-DM process mopdel that is a framework for executing any data mining project. The CRISP-DM framework is independent of the tools used and the industry sector. Figure 1 shows the different components (viz. Business understanding, Data understanding, Data preparation, Modeling, Evaluation and Deployment) of the CRISP-DM process model for data mining. It is a cyclic process where there is a feedback loop between some components. We can see that the "Data Understanding" is a very important component which affects the "Business Understanding" as well. EDA helps in Data understanding and thus directly impacts the quality and success of a data mining project.
EDA can be categorized into Descriptive statistical techniques and graphical techniques (Jaggi 2013). The first category encompasses various univariate and multivariate statistical techniques whereas the second category comprises the various visualization techniques. Both of these techniques are used to explore the data, understand the patterns in the data, understand the existing relationships between the variables and most importantly, generate data drive insights that can be used by the business stakeholders. However, EDA requires a lot of manual effort and also a substantial amount of coding effort in statistical computing packages such as R (Venables and Ripley 2002). There is a huge need for automation of the EDA process and this motivated us to develop the SmartEDA package and come up with this paper.
The contribution of this paper is in development of a novel R package i.e. SmartEDA that addresses the need for automating the EDA process. The main benefits of SmartEDA are in development time savings, less error percentage and reproducibility. Although there are other packages available in the Comprehensive R Archive Network (CRAN) for EDA (such as Hmisc, DataExplorer and more) but SmartEDA has additional functionalities when compared to them including extension to data.table package, capability for performing summary statistics and ability to plot for both numerical and categorical variables and many more (please see Section 5 for more details).
The rest of this paper is structured as follows. In Section 2, we give a brief review of the literature. Section 3 gives an overview of the SmartEDA package available in CRAN and its various functionalities. In Section 4, we apply SmartEDA to generate actionable insights for a couple of real world datasets. We then follow it up with Section 5 where we compare SmartEDA with some of the other packages for EDA available in CRAN. Finally, Section 6 concludes this paper.

Related Work
Some of earliest work done on Exploratory Data Analysis (EDA) including coining the term and defining some of the basic EDA techniques was done by Tukey (1977). However, many researchers have formulated different definitions of EDA over the years. One of the widely accepted definition is that "Exploratory data analysis isolates patterns and features of the data and reveals these forcefully to the analyst" (Hoaglin, Mosteller, and Tukey 1983). Velleman and Hoaglin (1981) formulated the four basic elements of EDA namely, (a) residual, (b) re-expression, (c) resistant and (d) display. Behrens and Yu (2003) worked on this framework and updated the four elements with relevant techniques and they re-named "display" as "revelation". Behrens (1997) contrasted Exploratory Data Analysis (EDA) with Confirmatory Data Analysis (CDA) and proposed that EDA complements CDA. Gelman (2004) proposed an unified approach to confirmatory data analysis and exploratory data analysis using graphical data displays.
Chon Ho (2010) introduced EDA in the context of data mining and resampling with focus on pattern recognition, cluster detection and variable selection. Over the years, EDA has been used various applications across different domains such as geoscience research (Ma et al. 2017), auditing (Liu 2014), game-based assessments (DiCerbo et al. 2015), clinical study groups (Konopka et al. 2018) and more.

An overview of the SmartEDA package
The SmartEDA R package is publicly available in the Comprehensive R Archive Network (CRAN) (Ubrangala, Rama, and Kondapalli 2018). It has got more than 3100+ downloads as of March 2019, which indicates its acceptability and maturity in the Statistics and Machine learning community.
The SmartEDA package automatically select the variables and performs the related descriptive statistics. Moreover, it also analyzes information value, weight of evidence, custom tables, summary statistics and performs graphical techniques for both numeric and categorical variables.
Some of the most important advantages of the SmartEDA package are that it can help in applying end to end EDA process without having to remember the different R package names, write lengthy R scripts, no manual effort required to prepare the EDA report and finally, automatically categorize the variables into the right data type (viz. Character, Numeric, Factor and more) based on the input data. Thus, the main benefits of SmartEDA are in development time savings, less error percentage and reproducibility.
Moreover, the SmartEDA package has customized options for the data.table package such as (1) Generate appropriate summary statistics depending on the data type, (2) Data reshaping using data.    Figure 2 summarizes the various functionalities of SmartEDA. The SmartEDA R package has four unique functionalities as described in Table 1. To know more about the specific commands/functions to execute the above mentioned functions of SmartEDA, please refer to the detailed package documentation available in CRAN (Ubrangala et al. 2018).

Illustrations
In this section, we illustrate the various functionalities of SmartEDA to generate actionable insights for couple of publicly available datasets namely, Carseats and NYC flights data. It is to be noted that we have used the version 3 of the SmartEDA package for all the illustrations in this paper.

EDA for Sales of Child Car seats at different locations
We apply SmartEDA to generate insights on the sales of Child car seats at different locations. We will use the "Carseats" data available in the ISLR package (James, Witten, Hastie, and Tibshirani 2017) that contains 11 variables such as unit sales in each locations (Sales), price charged by competitors (CompPrice), community income level, (Income) population size in region (population), advertising budget (Advertising), price company charges for car seats in each site (Price), quality of shelving location (ShelveLoc), average age of local population (Age), education level at each location (Education), urban/rural location indicator (Urban) and US store/non-US store indicator (US).
We will now use SmartEDA for understanding the dimensions of the dataset, variable names, overall missing summary and data types of each variables.
R> library("SmartEDA") R> library("ISLR") R> Carseats <-ISLR::Carseats We will now check for the summary of categorical variables namely, ShelveLoc, Urban and US.  Stsize=c(10,15,20),Nvar= c("Price","Income","Advertising","Population","Age","Education")) Some of the above functions such as, ExpNumViz, ExpCatViz and ExpOutQQ if executed without the "sample" argument then we will get all the possible plot with various combinations of the relevant variables. For example, ExpNumViz() and ExpCatViz() will generate the required plots for all the possible combinations of numerical variables and categorical variables respectively. Similarly, ExpOutQQ() will generate the normality plot for all the variables in the Carseats dataset. Here, sample = 1 argument represents that we need the function to display one plot only. Moreover, the ExpParcoord() function plots the parallel coordinate plot for the mentioned variables such as Price, Income and more. The "Stsize" argument represents the stratified sample size for each class of the group variable i.e. the Shelving location ("ShelveLoc"). The parallel coordinate plot is generally used to detect outliers.

EDA for NYC Flights Departure Data
We apply SmartEDA to generate insights on the airline on-time data for all flights departing NYC in 2013. We will use the various datasets namely, flights, airlines, planes and airports available in the nycflights13 package (Wickham 2018) and combine these datasets into a single
We will now use SmartEDA for understanding the dimensions of the dataset, variable names, overall missing summary and data types of each variables.
R> library("SmartEDA") R> ExpData(data=flight_data,type=1)  The following code chunk shows the functions in SmartEDA that are applied on the NYC flights data to get a summary of all the categorical and character variables. We can also check for degree of association between the target variable which in this case is taken as "dst_dest" i.e. day light savings time zone of destination with other variables such as origin, type, timezone and more. We can see in Table 4 that the degree of association between the target and the timezone variables is very high, which is expected.

Conclusion
The contribution of this paper is in development of a new package in R i.e. SmartEDA for automated Exploratory Data Analysis. SmartEDA package helps in implementing the complete Exploratory Data Analysis just by running the function instead of writing lengthy R code. The users of SmartEDA can automate the entire EDA process on any dataset with easy to implements functions and export EDA reports that follows the industry and academia best practices. The SmartEDA can provide summary statistics along with graphical plots for both numerical and categorical variables. It also provides extension to data.table package which none of the other packages available in CRAN provides. Overall, the main benefits of SmartEDA are in development time savings, less error percentage and reproducibility. As of March 2019, the SmartEDA package has more than 3100+ downloads, which indicates its acceptability and maturity in the Statistics and Machine learning community.
Some of the current limitations of the SmartEDA are that it cannot perform variable transformation, feature engineering and dynamic visualization. Another area for further development includes functions to generate automated shiny dashboard with some standard templates for some basic EDA presentation. We are working on adding these functionalities in the future releases of SmartEDA.