Overlapping : a R package for Estimating Overlapping in Empirical Distributions

Overlapping can be defined as the area intersected by two or more probability density functions. The idea of overlapping was introduced in a formal way by Gini & Livada (1943) and, more recently, it has been applied in several research problems involving, for instance, data fusion (Moravec, 1988), information processing (Viola & Wells III, 1997), applied statistics (Inman & Bradley Jr, 1989), economics (Milanovic & Yitzhaki, 2001) and psychology, as a basis for Cohen’s U index (Cohen, 1988), McGraw and Wong’s CL measure (McGraw & Wong, 1992), and Huberty’s I degree of non-overlap index (Huberty & Lowman, 2000).

overlapping is an R package for estimating the overlapping area of two or more kernel density estimations from empirical data.The main idea of the package is to offer an easy way to quantify the similarity (or the difference) between two or more empirical distributions.In addition, the package allows to plot density distributions, highlighting the overlapped area by using the ggplot2 R package (Wickham, 2009).
A recent R package, overlap (Ridout & Linkie, 2009), offers an implementation of the overlapping index which can be used to analyse temporal activity patterns of animals and species in echology.Compared to this latter, overlapping package offers a more general approach where overlapping can be computed for any type of numerical variable, and it allows for computations with more than two variables.

Examples
Suppose we have collected data in two groups of 100 subjects each, with respect to a generic variable Y, expressed by scores ranging between 0 and 30, and to be interested in assessing whether the two groups can be considered samples from populations with the same average.We can simulate the groups' scores as follows: set.seed( 1 ) n <-100 G1 <-sample( 0:30, size = n, replace = TRUE ) G2 <-sample( 0:30, size = n, replace = TRUE, prob = dbinom( 0:30, 31, .55 ) ) For Group 1 (G1) we randomly sampled n = 100 values from a uniform distribution; for Group 2 (G2) we randomly sampled 100 values from a binomial distribution.In the first group, scores range between 0 and 30 with mean 15.55 and standard deviation 8.32.In the second group, scores range between 10 and 24 with mean 16.72 and standard deviation 2.74.
We can display the scores distribution as follows: library( ggplot2 ) Data <-data.frame(y = c(G1,G2), group = rep(c("G1","G2"),each=n) ) ggplot( Data, aes( x=group, y=y ) ) + geom_boxplot() + ylab("scores") obtaining Figure 1.From this figure it is evident the heterogeneity of the variances in the two groups.In such a case, the statistical comparison between means can be biased and not very informative; for example, with a t-test, corrected for heterogeneity, we obtain the following result: t(120.24)= −1.34,p = 0.18, from which we cannot draw any conclusion (Wilkinson & Task Force on Statistical Inference, 1999).
So, let us assume a different perspective: Rather than assessing the similarity between the two groups on the basis of averages (and standard deviations) only, we use all the information available in the data.In practice, we estimate the degree of overlap between groups as the overlap between their kernel density estimates.We expect 0% to indicate the absence of overlapping (i.e., maximum distance between groups), and 100% to indicate the perfect overlap between the two distributions (i.e., groups are identically distributed).We can use the overlapping package in the following way: With the command library() we load the overlapping package, next we create a list containing the two groups' scores, and finally, by using the overlap() function, we compute the overlap index.The index value (43.22) is an estimate of the percentage of overlapping between estimated densities.We can obtain a graphical representation by adding the option plot = TRUE as follows: overlap( dataList, plot = TRUE ) obtaining Figure 2. In the figure are represented the estimated densities of the two groups' scores, with different colors.The shaded region is the overlapping area of densities.

Figure 1 :
Figure 1: Scores distribution of simulated groups of 100 subjects each.

Figure 2 :
Figure 2: Comparison between densities of two groups.The overlap (43%) is represented by the shaded area.