Matching methods for observational studies derived from large administrative databases

Matching methods for observational studies derived from large administrative databases

Ruoqi Yu, UC Berkeley
Sloan 380C

Jul

Tue, Jul 5 2022, 4:30pm

Ideally, people study causal relationships with randomized experiments, which are not always practical or ethical. As such, causal effects are often studied with non-randomized observational studies. Matching for observational studies is a common approach to mimic randomized experiments in the design stage by creating similar treated and control groups for observed covariates. As technologies have rapidly developed in recent decades, data sets have grown in size while also becoming more accessible for analysis, e.g., electronic health records, medical claims data, educational databases, and social media data. The increasing sample size has posed tremendous computational challenges to optimal matching in observational studies. In current practice, very large matched samples are constructed by subdividing the population and solving a series of smaller problems, which can restrict the possible matches in undesirable ways. In the first part of this talk, I introduce a single match using everyone in the data set that accelerates the computations differently without sacrificing the quality of the subsequent statistical analysis. In particular, I reduce the number of candidate matches by using an iterative form of Glover’s algorithm for a doubly convex bipartite graph to determine an optimal caliper for the propensity score. After constructing a matched sample, it is essential to assess the covariate balance of the matched data since the residual imbalances can bias the estimated treatment effects. The common informal diagnostics have several limitations. In the second part, I discuss a new framework for covariate balance evaluation that compares the matched sample with complete randomizations. The method controls the probability of falsely detecting a covariate imbalance among many comparisons, yet it has a high chance of identifying a major problem correctly. The methods are applied to a data set from US Medicaid with 198,368 surgical admissions to study the causal effects of having surgery at a children's hospital on children's mortality within 30 days of surgery.

Statistics and Probability Seminars