Reliability and Validity in Experiment Design

(text copied from Wikipedia)

The key traditional concepts in classical test theory are reliability and validity. A reliable measure is measuring something consistently, while a valid measure is measuring what it is supposed to measure. A reliable measure may be consistent without necessarily being valid, e.g., a measurement instrument like a broken ruler may always under-measure a quantity by the same amount each time (consistently), but the resulting quantity is still wrong, that is, invalid. For another analogy, a reliable rifle will have a tight cluster of bullets in the target, while a valid one will center its cluster around the center of the target, whether or not the cluster is a tight one.

Both reliability and validity may be assessed conceptually and mathematically. Stability over repeated measures of the same test can be assessed with the Pearson correlation coefficient, and is often called test-retest reliability. Similarly, the equivalence of different versions of the same measure can be indexed by a Pearson correlation, and is called equivalent forms reliability or a similar term. Internal consistency, which addresses the homogeneity of a single test form, may be assessed by correlating performance on two halves of a test, which is termed split-half reliability; the value of this Pearson product-moment correlation coefficient for two half-tests is adjusted with the Spearman-Brown prediction formula to correspond to the correlation between two full-length tests. Finally, possibly the most commonly used index of reliability is Cronbach’s α, which is equivalent to the mean of all possible split-half coefficients. Other approaches include the intra-class correlation, which is defined as the ratio of variance of measurements of a given target to the variance of all targets.

Validity may be assessed by correlating measures with a criterion measure known to be valid. When the criterion measure is collected at the same time as the measure being validated the goal is to establish concurrent validity; when the criterion is collected later the goal is to establish predictive validity. A measure has construct validity if it is related to other variables as required by theory. Content validity is simply a demonstration that the items of a test are drawn from the domain being measured.