Tests of Missing Completely At Random based on sample covariance matrices

Abstract

We study the problem of testing whether the missing values of a potentially high-dimensional dataset are Missing Completely at Random (MCAR). We relax the problem of testing MCAR to the problem of testing the compatibility of a collection of covariance matrices, motivated by the fact that this procedure is feasible when the dimension grows with the sample size. Our first contributions are to define a natural measure of the incompatibility of a collection of correlation matrices, which can be characterised as the optimal value of a Semi-definite Programming (SDP) problem, and to establish a key duality result allowing its practical computation and interpretation. By analysing the concentration properties of the natural plug-in estimator for this measure, we propose a novel hypothesis test, which is calibrated via a bootstrap procedure and demonstrates power against any distribution with incompatible covariance matrices. By considering key examples of missingness structures, we demonstrate that our procedures are minimax rate optimal in certain cases. We further validate our methodology with numerical simulations that provide evidence of validity and power, even when data are heavy tailed. Furthermore, tests of compatibility can be used to test the feasibility of positive semi-definite matrix completion problems with noisy observations, and thus our results may be of independent interest.

Publication
In The Annals of Statistics, to appear. I contributed to the R-package MCARtest
Click the Cite button above to demo the feature to enable visitors to import publication metadata into their reference management software.
Alberto Bordino
Alberto Bordino
PhD Student in Statistics, Warwick CDT in Statistics

Third-year PhD candidate developing nonparametric, minimax-optimal methods for learning with missing or heterogeneous data.