## EE Student Information

### The Department of Electrical Engineering supports Black Lives Matter. Read more.

• • • • •

EE Student Information, Spring Quarter through Academic Year 2020-2021: FAQs and Updated EE Course List.

Updates will be posted on this page, as well as emailed to the EE student mail list.

As always, use your best judgement and consider your own and others' well-being at all times.

# Statistics Department Seminar presents "Two-sample problem for high-dimensional multinomials and testing authorship"

Topic:
Two-sample problem for high-dimensional multinomials and testing authorship
Tuesday, February 11, 2020 - 4:30pm
Venue:
Sloan Mathematics Center, Room 380C
Speaker:
Alon Kipnis (Stanford Statistics)
Abstract / Description:

The Higher Criticism (HC) test is a useful tool for detecting the global significance of multiple independent tests, especially for rare and weak effects. We adapt the HC test to a discrete two-sample setting and use it as a measure of similarity between the samples. We apply this measure to word-frequency tables and authorship attribution challenges, where the goal is to identify the author of a document using other documents whose authorship is known. The method is simple yet performs well without handcrafting and tuning. Furthermore, as an inherent side effect, the HC calculation identifies a subset of discriminating words, which allow additional interpretation of the results. Our examples include authorship in the Federalist Papers and machine-generated texts. We take two approaches to analyze the success of our method. First, we show that, in practice, the discriminating words identified by the test have low variance across documents belonging to a corpus of homogeneous authorship. We conclude that in testing a new document against the corpus of an author, HC is mostly affected by words characteristic of that author and is relatively unaffected by topic structure. Finally, we analyze the power of the test in discriminating two multinomial distributions under rare and weak perturbations. We derive a phase transition curve for the power of the test which separates the parameter space into an area where the test is successful and an area where it fails. This phase curve is different than the phase curve in the Gaussian means model.

The Statistics Seminars for Winter Quarter will be held in Room 380C of Sloan Mathematics Center in the Main Quad at 4:30pm on Tuesdays. Refreshments are served at 4pm in the Lounge on the first floor of Sequoia Hall.