Probabilistic Methodology for Genealogical Record Linkage: Determining Weight Robustness

Home> Academic Documents> Probabilistic Methodology for Genealogical Record Linkage: Determining Weight Robustness

DOC PREVIEW

This preview shows page 1-2-19-20 out of 20 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 20 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 20 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 20 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 20 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 20 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Probabilistic Methodology for Genealogical Record Linkage: Determining Weight Robustness Krista Jensen John S Lawson Brigham Young University Statistics DepartmentRecord LinkageCensus DataCensus IndexesProbabilistic MethodologySlide 6Slide 7Slide 8Slide 9Project DataSlide 11Slide 12Algorithm AdaptationsResultsScore CalculationSlide 16Slide 17Slide 18DiscussionFuture ResearchProbabilistic Methodology for Probabilistic Methodology for Genealogical Record Linkage: Genealogical Record Linkage: Determining Weight Robustness Determining Weight Robustness Krista JensenKrista JensenJohn S LawsonJohn S LawsonBrigham Young UniversityBrigham Young UniversityStatistics DepartmentStatistics DepartmentRecord LinkageRecord LinkageWhat is record linkage?What is record linkage?Process that joins two Process that joins two records of information for records of information for a particular individual or a particular individual or familyfamilyApplications of Record Applications of Record LinkageLinkageGenealogical researchGenealogical researchCensus RecordsCensus RecordsEcclesiastical RecordsEcclesiastical RecordsMedical researchMedical researchData storageData storageGovernment GovernmentCensus DataCensus DataBenefits of census dataBenefits of census dataInformation Information CompletenessCompletenessStarting point for genealogical Starting point for genealogical researchresearchCollection methodsCollection methodsTrainingTrainingInstruction given to Instruction given to enumeratorsenumeratorsConcerns with census dataConcerns with census dataCorrectness of dataCorrectness of dataAgeAgePlace of originPlace of originCensus IndexesCensus IndexesWhat is a census indexWhat is a census indexHead of HouseholdHead of HouseholdIndividuals with different last namesIndividuals with different last namesSubset of questionsSubset of questionsAvailability of census records. Census record Availability of census records. Census record access limited from 1930 to present for privacyaccess limited from 1930 to present for privacyFields available in census record indexesFields available in census record indexesSurname, given name, age, gender, race, place Surname, given name, age, gender, race, place of origin, state, county, census page information of origin, state, county, census page informationProbabilistic MethodologyProbabilistic Methodology3 decisions possible (3 decisions possible (ei) where i=1,2,3Definitions of Events Definitions of Events eeii where where i=1,2,3i=1,2,3ee11 two fields are atwo fields are a match (positive link)match (positive link)ee22 two fields are a of undetermined statustwo fields are a of undetermined statusee33 two fields are a non-match (positive non-two fields are a non-match (positive non-link)link) Overview of TheoryProbabilistic MethodologyProbabilistic MethodologyA weight is calculated for each field based on A weight is calculated for each field based on conditional and unconditional probabilitiesconditional and unconditional probabilitiesDefinitions of ProbabilitiesDefinitions of ProbabilitiesP (ei|M) can be calculated from a known set of matchesP (ei) can be estimated using sample pairsP (M) is constant for all comparisonsA score for each comparison is calculated A score for each comparison is calculated (sum of the weights)(sum of the weights)Threshold Values are used to determine the Threshold Values are used to determine the classification of each record comparisonclassification of each record comparisonProbabilistic MethodologyProbabilistic MethodologyCalculating the Weights)]|(ln[ikeMPw Using Bayes Rule:)()()|()|(iiiePMPMePeMP Probabilistic MethodologyProbabilistic MethodologyThe Scores )()|(ln)](ln[)]|(ln[iiikePMePMPeMPwWA Weight is calculated for k fields, the score is the sum of those weightsProbabilistic MethodologyProbabilistic MethodologyT =T = 1.806T =T = 2.504Project DataProject DataCensus Record Census Record availabilityavailabilityGeographical areas Geographical areas sampledsampledCaliforniaCaliforniaConnecticutConnecticutIllinoisIllinoisMichiganMichiganLouisianaLouisianaProject DataProject DataSampled counties Sampled counties from 1910 and from 1910 and 1920. 1920. County boundaries County boundaries that changed were that changed were eliminated from eliminated from selectionselectionRecords were Records were extracted for the extracted for the counties of interestcounties of interestProject DataProject DataSTATE Record SizeMatchesConnecticut 18,799 2,405Illinois 32,211 4,984Louisiana 18,233 596Michigan 31,497 4,539Southern California32,684 2,779Northern California21,436 1,943Algorithm AdaptationsAlgorithm AdaptationsPlace of Origin IndexPlace of Origin IndexPrussia in 1920 matches Germany in 1910Prussia in 1920 matches Germany in 1910Hungary and Austria match for either yearHungary and Austria match for either yearEnumeration Locality IndexEnumeration Locality IndexConsiderations for AgeConsiderations for AgeRange of 8-12 years classified as “same”Range of 8-12 years classified as “same”Range of 7 and 13 years classified as Range of 7 and 13 years classified as “close”“close”Range greater than 13 years and less then Range greater than 13 years and less then 7 years classified as “different”7 years classified as “different”ResultsResultsAveraged: Fields Weight for “Same”Weight for “Close”Weight for “Different”Given Name 4.18009 -1.2599 -4.76084First 3 letters of Given Name 3.3928 First letter of Given Name 0.357 Last 3 letters of Given Name -0.2251 Age 2.45507 -0.1078 -2.63094Race 0.18305 0.84357 -1.58802Place or Origin 1.49957 -0.9575 -2.66818Locale of Census 2.02468 1.52134 -1.35869County 0.50254 -3.16472Score CalculationScore CalculationGiven name - match, Age - match, Gender - match, Given name - match, Age - match, Gender - match, Race - match, Origin - match, State match, Race - match, Origin - match, State match, County - match and District - match. Provides a score County - match and District - match. Provides a score ofof4.18 + 2.45 + 0.18 - 2.67 + 2.02 + 0.50 + 2.02 = 6.4984.18 + 2.45 + 0.18 - 2.67 + 2.02 + 0.50 + 2.02 = 6.4981920-8780DRECHSKER OTTO C D62248 M W SAXO CT TOLLAND 4-WD ROCKVILLE VERNONT6251981


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-19-20 out of 20 pages.

Please select your school