Abstract: As commonplace as big data now are, there are few statistical methods and computational algorithms that are both widely applicable and account for uncertainty in their results. Theories and applications of non-probabilistic approaches account for most of the significant advances in the statistical methodology for big data. A major limitation of these methods is that it is unclear how to account for uncertainty in their results. The Bayesian framework, which is based on Bayes' theorem, provides a probabilistic approach for analyzing big data and quantifies uncertainty using a probability measure. While popular in machine learning and statistics, Bayesian methods are widely used in the data sciences because they are easily extended to capture data-specific patterns. This flexibility of the Bayesian approach comes at the cost of intractable computations for realistic modeling of massive data sets. Divide-and-conquer based Bayesian methods provide a general approach for tractable computations in massive data sets. These methods first divide the data into smaller subsets, perform computations to estimate a probility measure in parallel across all subsets, and then combine the probability measures from all the subsets to approximate the probability measure computed using the full data. In this talk, I will introduce one such approach that relies on the geometry of probability measures estimated across different subsets and combines them through their barycenter in a Wasserstein space of probability measures. The geometric method has attractive theoretical properties and shows superior empirical performance on a large movie ratings database.
This is based on a joint work with David B. Dunson (Duke University) and Cheng Li (National University of Singapore). A preliminary version of this work is available at https://arxiv.org/abs/1508.05880.