High-throughput sequencing (HTS) allows the quantification of non-culturable microbial organisms in human health and disease states, including infectious diseases. However, contaminating nucleic acids (DNA) from external sources may lead to misidentification of a taxon's provenance. Sequencing controls can help to identify most of these contaminants through the use of statistical mixture models. We propose a Bayesian reference analysis based on a hierarchical model for the observed data, that infers the true intensities of a specimen's microbial DNA in the presence of microbial DNA contamination. By using the partial information about contamination intensities available in negative controls, we define a marginal likelihood and reference prior for the true intensities. Then, we obtain a marginal posterior distribution for the true intensities.
In this talk, I will present the performance of the contamination removal method in the dilution series of the standard ZymoBIOMICS microbial community. I will also demonstrate our approach on two different low-biomass plasma specimens datasets. Our method is available as an open-source R package on Github. In addition, to identify contaminant sources, we provide a topic modeling approach to infer contaminant topics.