This special issue is particularly focused on fundamental and practical issues in data clustering [16]. Data clustering aims at organizing a set of records into a set of groups so that the overall similarity between the records within a group is maximized while minimizing the similarity with the records in the other groups. The data clustering is a state of the art problem with increasing number of applications. For decades, data clustering problems have been identified in many applications and domains such as computer vision and pattern recognition (e.g., video and image analysis for information retrieval, object recognition, image segmentation, and point clustering), networks (e.g., identification of web communities), databases and computing (facing privacy in databases), and statistical physics and mechanics (e.g., understanding phase transitions, vibration control, and fracture identification using acoustic emission data). In addition, several definitions and validation measures [3, 7] of data clustering problem have been used on different applications in engineering. For instance, the goal of the classical clustering problem is to find the clusters that optimize a predefined criterion while the goal of the microaggregation problem [8] is to determine the clusters under the constraint of a given minimum cluster size for masking microdata.

In this special issue, the selected papers focus on the topics of theory and applications of data clustering. They propose new methods that have been successfully applied on several clustering problems including image segmentation [9, 10], time series clustering [4], graph clustering (community detection) [11, 12], and (stock) recommendation systems [13, 14]. Image segmentation is a key step in many image analysis and interpretation tasks. Finding semantic regions is the ultimate goal of segmentation for image understanding. It has become a necessity for many applications, such as content based image retrieval and object recognition. The goal of time series clustering is to partition time series into clusters based on similarity or distance criteria, so that time series in the same group are similar and dissimilar to the time series in the other groups. Concerning the community detection problem, it holds that networks are usually composed of subgroup structures, whose interconnections are sparse and the intraconnections are dense, which is called community structure. Detecting the community structure of a network is a fundamental problem in complex networks which presents many variations. Community detection is often a NP-hard problem and traditional methods for detecting communities in networks can be concluded into two categories: graph partitioning and hierarchical clustering. The recommender system tries to predict the behavior of a complex system by producing a list of recommendations. In stock recommendation that has become a hot topic, most of the methods try to integrate multiple technologies, such as data mining, machine learning, herd psychology, and other nontraditional technologies.

During the last decades, there have been published thousands of clustering algorithms [1]. The clustering methods can be classified into five major categories [2]: partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods. A partitioning method constructs (crisp or fuzzy) partitions of the data, where each partition represents a cluster. The partition is called crisp if each object belongs to exactly one cluster or fuzzy if one object is allowed to belong to more than one cluster at the same time. Hierarchical clustering algorithms recursively find nested clusters either in agglomerative (bottom-up) mode or in divisive (top-down) mode. Agglomerative algorithms start with each point as a separate cluster and successively merge the most similar pair of clusters. On the contrary, divisive algorithms start with all the data points in one cluster and recursively divide each cluster into smaller clusters. In both cases, a hierarchical structure (e.g., dendrogram) is provided which represents the merging or dividing steps of the method. The density-based methods continue growing a cluster as long as its density (number of data objects in the “neighborhood”) exceeds a threshold. Concerning the grid-based methods, they quantize the object space into a finite number of cells that form a grid structure. Then, they use statistical attributes for all the data objects located in each individual cell and clustering is performed on the grid, instead of data objects themselves. Model-based methods assume a model for each of the clusters and attempt to best fit the data to the assumed model.

The definition of a metric that can be used to validate clusters of different densities and/or sizes is an open problem. In the literature, several clustering validity measures have been proposed to measure the quality of clustering [3, 7, 15]. In addition, using the clustering validity measures, it is possible to compare the performance of clustering algorithms and to improve their results by getting a local minima of them.

The papers, published in this special issue, have novelty and contain some interesting methods and applications on data clustering. We believe that the papers published in this special issue will motivate further research in the field of data clustering.

Acknowledgments

The guest editors wish to express their sincere gratitude to the authors and reviewers who contributed greatly to the success of this special issue. We would also like to thank the editorial board members of this journal for their support and help throughout the preparation of this special issue.

Costas Panagiotakis
Emmanuel Ramasso
Paraskevi Fragopoulou
Daniel Aloise