The Higher Criticism (HC) test is a useful tool for detecting the presence of a signal spread across a vast number of features, especially in the sparse setting when only few features are useful while the rest contain only noise. We adapt the HC test to the two-sample setting of detecting changes between two frequency tables. We apply this adaptation to authorship attribution challenges, where the goal is to identify the author of a document using other documents whose authorship is known. The method is simple yet performs well without handcrafting and tuning. Furthermore, as an inherent side effect, the HC calculation identifies a subset of discriminating words, which allow additional interpretation of the results. Our examples include authorship in the Federalist Papers and machine-generated texts.
We take two approaches to analyze the success of our method. First, we show that, in practice, the discriminating words identified by the test: have low variance across documents belonging to a corpus of homogeneous authorship. We conclude that in testing a new document against the corpus of an author, HC is mostly affected by words characteristic of that author and is relatively unaffected by topic structure. Finally, we analyze the power of the test in discriminating two multinomial distributions under sparse and weak perturbations model. We show that our test has maximal power in a wide range of the model parameters, even though these parameters are unknown to the user.
Alon Kipnis is a postdoctoral scholar in the department of statistics at Stanford University. He received his B.Sc. degree in mathematics (summa cum laude) and his B.Sc. degree in electrical engineering (summa cum laude), both in 2010, and his M.Sc. degree in mathematics in 2012, all from Ben-Gurion University of the Negev. He received his Ph.D. degree in electrical engineering from Stanford University, where he is now a postdoctoral scholar in the Department of Statistics. His research combines data compression and dimensionality reduction techniques with classical methods in signal processing and machine learning.