In this talk we describe our knowledge extraction and fusion efforts at Google, including the Knowledge Vault project and the Knowledge-based Trust project. We use 15 extractors to periodically extract knowledge from 1B+ Webpages. The results are 3B+ distinct (subject, predicate, object) knowledge triples. Errors can creep in at every stage in this process, both from erroneous data provided by the Web sources and from mistakes made by the extractors. As a result, only about 20% of the extracted triples are correct.
We adapt state-of-the-art data fusion techniques to solve the knowledge fusion problem. By leveraging the collective wisdom from different extractors and from different Web sources, we are able to compute well-calibrated probabilities for the correctness of each triple as well as the correctness of extractions. In addition, we are able to compute trustworthiness for 119M webpages and 5.6M websites. We discuss our observations and provide insights on future research directions.
Xin Luna Dong is a Senior Research Scientist at Google Inc. She is one of the major contributors for the Knowledge Vault project, and has led the Knowledge-based Trust project, the Solomon data fusion project, and the Semex personal data management project. She has co-authored a book, "Big Data Integration", published over 50 papers in top conferences and journals, given over 10 tutorials, and got the Best Demo award in Sigmod 2005. She is the PC co-chair for WAIM 2015 and has served as an area chair for Sigmod 2015, ICDE 2013, and CIKM 2011.