abstract
It is quite common in modern research, for a researcher to test many hypotheses. The statistical (frequentist) hypothesis testing framework, does not scale with the number of hypotheses in the sense that naively performing many hypothesis tests will probably yield many false findings. Indeed, statistical "significance" is evidence for the presence of a signal within the noise expected in a single test, not in a multitude. In order to protect himself from an uncontrolled number of erroneous findings, a researcher has to consider of the type or errors he wishes to avoid and select the adequate procedure for that particular error type and data structure.
A quick search of the tag [multiple-comparisons] in the statistics Questions & Answers web site Cross Validates (
this http URL) demonstrates the amount of confusion this task can actually cause. This was also a point made at the 2009 Multiple Comparisons conference in Tokyo. In an attempt to offer guidance, we review possible error types for multiple testing, and demonstrate them with some practical examples, which clarify the formalism. Finally, we include some notes on the software implementations of the methods discussed.
The emphasis of this manuscript is on the error-rates, and not on the procedures themselves. We do try to name several procedures in this manuscript where appropriate. P-value adjustment will not be discussed as it is procedure specific. I.e., it is the choice of a procedure that defines the p-value adjustment, and not the error rate itself. Simultaneous confidence intervals will, also, not be discussed.