For the last several years, we have witnessed the emergence of datasets of an unprecedented scale across different scientific disciplines. The large volume of such datasets presents new computational challenges as the diverse, feature-rich, and usually high-resolution data does not allow for effective data-intensive inference. In this regard, data summarization is a compelling (and sometimes the only) approach that aims at both exploiting the richness of large-scale data and being computationally tractable; Instead of operating on complex and large data directly, carefully constructed summaries not only enable the execution of various data analytics tasks but also improve their efficiency and scalability.
A systematic way for data summarization is to turn the problem into selecting a subset of data elements optimizing a utility function that quantifies "representativeness" of the selected set. Often-times, these objective functions satisfy submodularity, an intuitive notion of diminishing returns stating that selecting any given element earlier helps more than selecting it later. Thus, many problems in data summarization require maximizing submodular set functions subject to cardinality and massive data means we have to solve these problems at scale.
In this talk, I will present our recent efforts in developing practical schemes for data summarization. In particular, I will first discuss the fastest centralized solution whose query complexity is only linear in data size. However, to truly summarize massive data we need to opt for scalable methods. I will then present a streaming algorithm that with a single pass over the data provides a constant-factor approximation guarantee to the optimum solution. Finally, I will talk about a distributed approach that summarizes tens of millions of data points in a timely fashion. I will also demonstrate experiments on several applications, including sparse Gaussian process inference and exemplar-based clustering using Apache-Spark.
Amin Karbasi is currently an assistant professor of Electrical Engineering and Computer Science at Yale University. He has been the recipient of the Grainger grant 2017 from National Academy of Engineering, DARPA Young Faculty Award 2016, Simons-Berkeley fellowship 2016, Google Faculty Award 2015, and ETH fellowship 2013. His work has been recognized with a variety of paper awards from AISTATS 2015, IEEE Data Storage 2013, ICASSP 2011, ACM SIGMETRICS 2010, and ISIT 2010 (runner-up). His Ph.D. work received the Patrick Denantes Memorial Prize 2013 for the best doctoral thesis in the School of Computer and Communication Sciences at EPFL.