- Walking Wales: the visualisation challenge
- Word Clouds: Can they be misleading?
- VIPS-Visual Interactive Parameter Steering
- Confirmation bias : do multiple views really help?
- Visual Query Interface for Infrastructure Networks
- MADucator – Matrix Educator
- Concensus Matrix Sort
- Multiscale Multiples: An Overview+Detail Interface for Small Multiples
- Analyzing Image Similarity to Detect Disguised Plagiarism
- Tackling the Multi-Analyst Problem in Soccer: How to Improve Collaborative Pattern Detection in Team Sports
- MDSQ – Quality Assessment of Distance-Preserving Projections
- Internal Quality Measures for Subspace Clusterings
- Exploration of Datasets for Visualizations of High-Dimensional Data
- Search and Visual Exploration of Scientific Literature
- Star Glyph – Optimal Dimension Layout
- Visual Analytics of Molecular Neurochemicals and Biological Markers in Mental Illness
- Ensembles of classifiers: Visual construction of classification models
- Cutting-Down the Complexity of Parallel Coordinates Plots in High-Dimensional Data
- Investigating Analytic Behavior in VA
- Hierarchical Matrix Visualization
- Realisierung und Evaluierung einer stereoskopischen 3D Perspective Wall Umgebung für verlinkte Informationsvisualisierung (Master)
- Combination of Matrices + Graphs
- TreeMap Evaluation
- Visual Analysis of Language Change over Time
- Visual Parameter Space Analysis of Topic Models
Overview and Background
Clustering is a well-known unsupervised data mining technique to find groups of similar data objects. Different data distributions and clustering shapes require the usage of different algorithms (e.g. k-means for circular based clusters, or DBSCAN for arbitrarily shaped clusters). Applying clustering to high-dimensional data (data objects with a large number of attributes) is influenced by the so-called curse of dimensionality. To put it simply, the huge number of dimensions influence the similarity measures (e.g. Euclidean distance) that build the foundation for any distance-based clustering algorithm.
To overcome the curse of dimensionality, subspace clustering algorithms have been developed. These algorithms search for (a large number of) different clusters in different subspaces (=subsets of dimension combinations). However, the results of subspace clustering algorithms are often difficult to analyze or interpret – especially for non-experts. Furthermore, the algorithms mostly depend on a large number of parameters which highly influence the detected clusters. Therefore, there is a need to compute the quality of a subspace clustering result in order to automatically determine the best parameter setting.
The goal of this project is to develop novel quality measures for subspace clusters and subspace clustering results. While quite a few external quality measures (compare a result against a griven ground truth) for subspace clusterings exist, internal measures – purely based on the cluster characteristics (e.g., density) – do not exist. This makes it impossible to apply subspace clustering to real-world data where no ground-truth information exist.
- Literature review and analysis of existing external quality measures.
- Literature review and analysis of internal quality measures for full-space clustering.
- Development of different novel quality measures.
- Throughout evaluation of the developed measures.
- How can visualization help in the quality assessment?
- Good knowledge in data mining (especially clustering).
- Good understanding of complex concepts
- Motivated to read scientific papers with math-formulas
- Good programming skills in Java.
- Scope: Bachelor/Master
- 6 Month Project, 3 Month Thesis (Bachelor) / 6 Month Thesis (Master)
- Start: immediately
|Theoretical (Analytical):||(5 / 5)|
|Practical (Implementation):||(3 / 5)|
|Literature Work:||(5 / 5)|
- Subspace Clustering for High Dimensional Data: A Review [Parsons et al., 2004]
- Visual Quality Assessment of Subspace Clusterings [Hund et al., 2016]
- Evaluating Clustering in Subspace Projections of High Dimensional Data [Müller et al., 2009]
- On using class-labels in evaluation of clusterings [Färber et al., 2010]