Data with Density-Based Clusters: A Generator for Systematic Evaluation of Clustering Algorithms

Published in Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 2024, 2024

Recommended citation: Data with Density-Based Clusters: A Generator for Systematic Evaluation of Clustering Algorithms. P Jahn, CMM Frey, A Beer, C Leibler, T Seidl - Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 2024

Abstract

Mining data containing density-based clusters is well-established and widespread but faces problems when it comes to systematic and reproducible comparison and evaluation. Although the success of clustering methods hinges on data quality and availability, reproducibly generating suitable data for this setting is not easy, leading to mostly low-dimensional toy datasets being used. To resolve this issue, we propose DENSIRED (DENSIty-based Reproducible Experimental Data), a novel data generator for data containing density-based clusters. It is highly flexible w.r.t. a large variety of properties of the data and produces reproducible datasets in a two-step approach. First, skeletons of the clusters are constructed following a random walk. In the second step, these skeletons are enriched with data samples. DENSIRED enables the systematic generation of data for a robust and reliable analysis of methods aimed toward examining data containing density-connected clusters. In extensive experiments, we analyze the impact of user-defined properties on the generated datasets and the intrinsic dimensionalities of synthesized clusters. Our code and novel benchmark datasets are publicly available at: https://github.com/PhilJahn/DENSIRED.