ValidityValidity - DefinitionsSlide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10Validity That Isn’tContent-Related EvidenceSlide 13Slide 14Criterion-Related EvidenceSlide 16Slide 17Quantifying Criterion-Related EvidenceEvaluating Criterion-Related EvidenceSlide 20Slide 21Construct-Related EvidenceSlide 23Validity & EstimatesValidity and ReliabilitySlide 26Cal State NorthridgePsy 427Andrew Ainsworth PhDThe extent to which a test measures what it was designed to measure.Agreement between a test score or measure and the quality it is believed to measure.Proliferation of definitions led to a dilution of the meaning of the word into all kinds of “validities”Internal validity – Cause and effect in experimentation; high levels of control; elimination of confounding variablesExternal validity - to what extent one may safely generalize the (internally valid) causal inference (a) from the sample studied to the defined target population and (b) to other populations (i.e. across time and space). Generalize to other peoplePopulation validity – can the sample results be generalized to the target populationEcological validity - whether the results can be applied to real life situations. Generalize to other (real) situationsContent validity – when trying to measure a domain are all sub-domains representedWhen measuring depression are all 16 clinical criteria represented in the itemsVery complimentary to domain sampling theory and reliabilityHowever, often high levels of content validity will lead to lower internal consistency reliabilityConstruct validity – overall are you measuring what you are intending to measureIntentional validity – are you measuring what you are intending and not something else. Requires that constructs be specific enough to differentiateRepresentation validity or translation validity – how well have the constructs been translated into measureable outcomes. Validity of the operational definitionsFace validity – Does a test “appear” to be measuring the content of interest. Do questions about depression have the words “sad” or “depressed” in themConstruct ValidityObservation validity – how good are the measures themselves. Akin to reliabilityConvergent validity - Convergent validity refers to the degree to which a measure is correlated with other measures that it is theoretically predicted to correlate with.Discriminant validity - Discriminant validity describes the degree to which the operationalization does not correlate with other operationalizations that it theoretically should not correlated with.Criterion-Related Validity - the success of measures used for prediction or estimation. There are two types:Concurrent validity - the degree to which a test correlates with an external criteria that is measured at the same time (e.g. does a depression inventory correlated with clinical diagnoses)Predictive validity - the degree to which a test predicts (correlates) with an external criteria that is measured some time in the future (e.g. does a depression inventory score predict later clinical diagnosis)Social validity – refers to the social importance and acceptability of a measureThere is a total mess of “validities” and their definitions, what to do?1985 - Joint Committee ofAERA: American Education Research AssociationAPA: American Psychological AssociationNCME: National Council on Measurement in Education developed Standards for Educational and Psychological Testing (revised in 1999).According to the Joint Committee:Validity is the evidence for inferences made about a test score.Three types of evidence:Content-relatedCriterion-relatedConstruct-relatedDifferent from the notion of “different types of validity”Content-related evidence (Content Validity)Based upon an analysis of the body of knowledge surveyed.Criterion-related evidence (Criterion Validity)Based upon the relationship between scores on a particular test and performance or abilities on a second measure (or in real life).Construct-related evidence (Construct Validity)Based upon an investigation of the psychological constructs or characteristics of the test.Face ValidityThe mere appearance that a test has validity.Does the test look like it measures what it is supposed to measure?Do the items seem to be reasonably related to the perceived purpose of the test.Does a depression inventory ask questions about being sad?Not a “real” measure of validity, but one that is commonly seen in the literature.Not considered legitimate form of validity by the Joint Committee.Does the test adequately sample the content or behavior domain that it is designed to measure?If items are not a good sample, results of testing will be misleading.Usually developed during test development.Not generally empirically evaluated.Judgment of subject matter experts.To develop a test with high content-related evidence of validity, you need:good logicintuitive skillsPerseveranceMust consider:wordingreading levelOther content-related evidence termsConstruct underrepresentation: failure to capture important components of a construct.Test is designed for chapters 1-10 but only chapters 1-8 show up on the test.Construct-irrelevant variance: occurs when scores are influenced by factors irrelevant to the construct.Test is well-intentioned, but problems secondary to the test negatively influence the results (e.g., reading level, vocabulary, unmeasured secondary domains)Tells us how well a test corresponds with a particular criterioncriterion: behavioral or measurable outcomeSAT predicting GPA (GPA is criterion)BDI scores predicting suicidality (suicide is criterion).Used to “predict the future” or “predict the present.”Predictive Validity Evidenceforecasting the futurehow well does a test predict future outcomesSAT predicting 1st yr GPAmost tests don’t have great predictive validitydecrease due to time & method varianceConcurrent Validity Evidenceforecasting the presenthow well does a test predict current similar outcomesjob samples, alternative tests used to demonstrate concurrent validity evidencegenerally higher than predictive validity estimatesValidity Coefficientcorrelation between the test and the criterionusually between .30 and .60
View Full Document