Image
ISL events

From Large to Small Datasets: Size Generalization for Clustering Algorithm Selection

Summary
Prof Ellen Vitercik (Stanford)
Packard 202
Mar
29
Date(s)
Content

Abstract: In clustering algorithm selection, we are given a massive dataset and must efficiently select which clustering algorithm to use. We study this problem in a semi-supervised setting, with an unknown ground-truth clustering that we can only access through expensive oracle queries. Ideally, the clustering algorithm’s output will be structurally close to the ground truth. We approach this problem by introducing a notion of size generalization for clustering algorithm accuracy. We identify conditions under which we can (1) subsample the massive clustering instance, (2) evaluate a set of candidate algorithms on the smaller instance, and (3) guarantee that the algorithm with the best accuracy on the small instance will have the best accuracy on the original big instance. We verify these findings both theoretically and empirically.

Bio: Ellen Vitercik is an Assistant Professor at Stanford University with a joint appointment between the Management Science & Engineering department and the Computer Science department. Her research revolves around machine learning theory, discrete optimization, and the interface between economics and computation. Before joining Stanford, she spent a year as a Miller Fellow at UC Berkeley after receiving a PhD in Computer Science from Carnegie Mellon University. Her thesis won the SIGecom Doctoral Dissertation Award and the CMU School of Computer Science Distinguished Dissertation Award.