Probabilistic Methodology for Genealogical Record Linkage: Determining Weight Robustness Krista Jensen John S Lawson Brigham Young University Statistics DepartmentRecord LinkageCensus DataCensus IndexesProbabilistic MethodologySlide 6Slide 7Slide 8Slide 9Project DataSlide 11Slide 12Algorithm AdaptationsResultsScore CalculationSlide 16Slide 17Slide 18DiscussionFuture ResearchProbabilistic Methodology for Probabilistic Methodology for Genealogical Record Linkage: Genealogical Record Linkage: Determining Weight Robustness Determining Weight Robustness Krista JensenKrista JensenJohn S LawsonJohn S LawsonBrigham Young UniversityBrigham Young UniversityStatistics DepartmentStatistics DepartmentRecord LinkageRecord LinkageWhat is record linkage?What is record linkage?Process that joins two Process that joins two records of information for records of information for a particular individual or a particular individual or familyfamilyApplications of Record Applications of Record LinkageLinkageGenealogical researchGenealogical researchCensus RecordsCensus RecordsEcclesiastical RecordsEcclesiastical RecordsMedical researchMedical researchData storageData storageGovernment GovernmentCensus DataCensus DataBenefits of census dataBenefits of census dataInformation Information CompletenessCompletenessStarting point for genealogical Starting point for genealogical researchresearchCollection methodsCollection methodsTrainingTrainingInstruction given to Instruction given to enumeratorsenumeratorsConcerns with census dataConcerns with census dataCorrectness of dataCorrectness of dataAgeAgePlace of originPlace of originCensus IndexesCensus IndexesWhat is a census indexWhat is a census indexHead of HouseholdHead of HouseholdIndividuals with different last namesIndividuals with different last namesSubset of questionsSubset of questionsAvailability of census records. Census record Availability of census records. Census record access limited from 1930 to present for privacyaccess limited from 1930 to present for privacyFields available in census record indexesFields available in census record indexesSurname, given name, age, gender, race, place Surname, given name, age, gender, race, place of origin, state, county, census page information of origin, state, county, census page informationProbabilistic MethodologyProbabilistic Methodology3 decisions possible (3 decisions possible (ei) where i=1,2,3Definitions of Events Definitions of Events eeii where where i=1,2,3i=1,2,3ee11 two fields are atwo fields are a match (positive link)match (positive link)ee22 two fields are a of undetermined statustwo fields are a of undetermined statusee33 two fields are a non-match (positive non-two fields are a non-match (positive non-link)link) Overview of TheoryProbabilistic MethodologyProbabilistic MethodologyA weight is calculated for each field based on A weight is calculated for each field based on conditional and unconditional probabilitiesconditional and unconditional probabilitiesDefinitions of ProbabilitiesDefinitions of ProbabilitiesP (ei|M) can be calculated from a known set of matchesP (ei) can be estimated using sample pairsP (M) is constant for all comparisonsA score for each comparison is calculated A score for each comparison is calculated (sum of the weights)(sum of the weights)Threshold Values are used to determine the Threshold Values are used to determine the classification of each record comparisonclassification of each record comparisonProbabilistic MethodologyProbabilistic MethodologyCalculating the Weights)]|(ln[ikeMPw Using Bayes Rule:)()()|()|(iiiePMPMePeMP Probabilistic MethodologyProbabilistic MethodologyThe Scores )()|(ln)](ln[)]|(ln[iiikePMePMPeMPwWA Weight is calculated for k fields, the score is the sum of those weightsProbabilistic MethodologyProbabilistic MethodologyT =T = 1.806T =T = 2.504Project DataProject DataCensus Record Census Record availabilityavailabilityGeographical areas Geographical areas sampledsampledCaliforniaCaliforniaConnecticutConnecticutIllinoisIllinoisMichiganMichiganLouisianaLouisianaProject DataProject DataSampled counties Sampled counties from 1910 and from 1910 and 1920. 1920. County boundaries County boundaries that changed were that changed were eliminated from eliminated from selectionselectionRecords were Records were extracted for the extracted for the counties of interestcounties of interestProject DataProject DataSTATE Record SizeMatchesConnecticut 18,799 2,405Illinois 32,211 4,984Louisiana 18,233 596Michigan 31,497 4,539Southern California32,684 2,779Northern California21,436 1,943Algorithm AdaptationsAlgorithm AdaptationsPlace of Origin IndexPlace of Origin IndexPrussia in 1920 matches Germany in 1910Prussia in 1920 matches Germany in 1910Hungary and Austria match for either yearHungary and Austria match for either yearEnumeration Locality IndexEnumeration Locality IndexConsiderations for AgeConsiderations for AgeRange of 8-12 years classified as “same”Range of 8-12 years classified as “same”Range of 7 and 13 years classified as Range of 7 and 13 years classified as “close”“close”Range greater than 13 years and less then Range greater than 13 years and less then 7 years classified as “different”7 years classified as “different”ResultsResultsAveraged: Fields Weight for “Same”Weight for “Close”Weight for “Different”Given Name 4.18009 -1.2599 -4.76084First 3 letters of Given Name 3.3928 First letter of Given Name 0.357 Last 3 letters of Given Name -0.2251 Age 2.45507 -0.1078 -2.63094Race 0.18305 0.84357 -1.58802Place or Origin 1.49957 -0.9575 -2.66818Locale of Census 2.02468 1.52134 -1.35869County 0.50254 -3.16472Score CalculationScore CalculationGiven name - match, Age - match, Gender - match, Given name - match, Age - match, Gender - match, Race - match, Origin - match, State match, Race - match, Origin - match, State match, County - match and District - match. Provides a score County - match and District - match. Provides a score ofof4.18 + 2.45 + 0.18 - 2.67 + 2.02 + 0.50 + 2.02 = 6.4984.18 + 2.45 + 0.18 - 2.67 + 2.02 + 0.50 + 2.02 = 6.4981920-8780DRECHSKER OTTO C D62248 M W SAXO CT TOLLAND 4-WD ROCKVILLE VERNONT6251981
or
We will never post anything without your permission.
Don't have an account? Sign up