- Walking Wales: the visualisation challenge
- Word Clouds: Can they be misleading?
- VIPS-Visual Interactive Parameter Steering
- Confirmation bias : do multiple views really help?
- Visual Query Interface for Infrastructure Networks
- MADucator – Matrix Educator
- Concensus Matrix Sort
- Multiscale Multiples: An Overview+Detail Interface for Small Multiples
- Analyzing Image Similarity to Detect Disguised Plagiarism
- Tackling the Multi-Analyst Problem in Soccer: How to Improve Collaborative Pattern Detection in Team Sports
- MDSQ – Quality Assessment of Distance-Preserving Projections
- Internal Quality Measures for Subspace Clusterings
- Exploration of Datasets for Visualizations of High-Dimensional Data
- Search and Visual Exploration of Scientific Literature
- Star Glyph – Optimal Dimension Layout
- Visual Analytics of Molecular Neurochemicals and Biological Markers in Mental Illness
- Ensembles of classifiers: Visual construction of classification models
- Cutting-Down the Complexity of Parallel Coordinates Plots in High-Dimensional Data
- Investigating Analytic Behavior in VA
- Hierarchical Matrix Visualization
- Realisierung und Evaluierung einer stereoskopischen 3D Perspective Wall Umgebung für verlinkte Informationsvisualisierung (Master)
- Combination of Matrices + Graphs
- TreeMap Evaluation
- Visual Analysis of Language Change over Time
- Visual Parameter Space Analysis of Topic Models

Subspace Clustering

### Overview and Background

Clustering is a well-known unsupervised data mining technique to find groups of similar data objects. Different data distributions and clustering shapes require the usage of different algorithms (e.g. k-means for circular based clusters, or DBSCAN for arbitrarily shaped clusters). Applying clustering to high-dimensional data (data objects with a large number of attributes) is influenced by the so-called curse of dimensionality. To put it simply, the huge number of dimensions influence the similarity measures (e.g. Euclidean distance) that build the foundation for any distance-based clustering algorithm.

To overcome the curse of dimensionality, subspace clustering algorithms have been developed. These algorithms search for (a large number of) different clusters in different subspaces (=subsets of dimension combinations). However, the results of subspace clustering algorithms are often difficult to analyze or interpret – especially for non-experts. Furthermore, the algorithms mostly depend on a large number of parameters which highly influence the detected clusters. Therefore, there is a need to compute the quality of a subspace clustering result in order to automatically determine the best parameter setting.

### Problem Statement

The goal of this project is to develop novel quality measures for subspace clusters and subspace clustering results. While quite a few external quality measures (compare a result against a griven ground truth) for subspace clusterings exist, internal measures – purely based on the cluster characteristics (e.g., density) – do not exist. This makes it impossible to apply subspace clustering to real-world data where no ground-truth information exist.

### Tasks

- Literature review and analysis of existing external quality measures.
- Literature review and analysis of internal quality measures for full-space clustering.
- Development of different novel quality measures.
- Throughout evaluation of the developed measures.
- How can visualization help in the quality assessment?

### Requirements

- Good knowledge in data mining (especially clustering).
- Good understanding of complex concepts
- Motivated to read scientific papers with math-formulas
- Good programming skills in Java.

### Scope/Duration/Start

- Scope: Bachelor/Master
- 6 Month Project, 3 Month Thesis (Bachelor) / 6 Month Thesis (Master)
- Start: immediately

Theoretical (Analytical): | (5 / 5) |

Practical (Implementation): | (3 / 5) |

Literature Work: | (5 / 5) |

### Contact

**Michael Hund** (E-Mail)

Research Associate, Ph.D. Student

Office Room: D215

### References

- Subspace Clustering for High Dimensional Data: A Review [Parsons et al., 2004]
- Visual Quality Assessment of Subspace Clusterings [Hund et al., 2016]
- Evaluating Clustering in Subspace Projections of High Dimensional Data [Müller et al., 2009]
- On using class-labels in evaluation of clusterings [Färber et al., 2010]