SWARTHMORE CS 97 - Drawing Isoglosses Algorithmically - D2033712

Home> Schools> Swarthmore College> (CS) > CS 97> Drawing Isoglosses Algorithmically

DOC PREVIEW

SWARTHMORE CS 97 - Drawing Isoglosses Algorithmically

School name Swarthmore College

Course Cs 97- Computer Perception

Pages 6

This preview shows page 1-2 out of 6 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Drawing Isoglosses AlgorithmicallyBryce [email protected] La [email protected] this paper, we apply algorithms fordefining regions from sets of points to theproblem of drawing isoglosses, the bound-aries between dialect regions. We discussthe justifications for our method, and al-ternative models that could be constructedfrom this data. We evaluate the resultantmodel by comparison to the traditionalmethod of drawing isoglosses, by hand.1 IntroductionIn the linguistic subfield of dialectology, an impor-tant activity is the drawing of isoglosses, or bound-aries between dialect areas. It is often difficult to pindown the meaning of these terms, as within a regionpeople come and go, and bring their speech withthem, but broadly speaking, an isogloss is a bound-ary between where people speak like this and wherepeople speak like that. Typically, an isogloss willnot be a sharp line, but will be an area of overlapbetween speakers of one type and the other.Unfortunately, isoglosses are typically drawn byhand, as an approximate dividing line. This leads totwo problems: one, they should not be thought of orrepresented as lines, but rather as approximate tran-sition zones, and two, they could be better drawn,we think, by algorithm than by eyeballing. As thenoted sociolinguist William Labov writes,Every dialect geographer yearns for anautomatic method for drawing dialectboundaries which would insulate this pro-cedure from the preconceived notions ofthe analyst. No satisfactory program hasyet been written. (Labov et al., 2005)We have therefore attempted, as something like aproof-of-concept, to redraw the isoglosses for cer-tain dialect differences in American English. Wehave taken as our data the results of the telephonesurvey of speakers across the contiguous USA donefor the Atlas of North American English (Labov etal., 2005). To this data, we have applied a numberof algorithms from the field of computational geom-etry, and hope that the result will better represent theboundaries between dialect regions.One way we expect that our method will improveon a hand-drawn line is by clearly showing areasthat are thoroughly mixed. It can be tempting, whendrawing by hand, to mark an area as primarily speak-ing one way, and drawing your line as though thatis the case. It may well be the case, however, thatin such situations, the region is much more evenlymixed than it appears to the eye, and should proba-bly not be marked decidedly in either direction. Wesuspect that an algorithmic approach will see thesecases more clearly.The clearest way to test whether our method im-proves on a hand-drawn line would be to see if thismodel has greater predictive power, such that if wewere to randomly make telephone calls to people,their proximity to our lines would be a better indica-tor of the features of their dialect. However, doingso is outside the scope of the project. Other changesto our method that would make it more explicitlya predictive model, such as predicting the value fora point based on the inverse-distance weighted av-erage of the n-nearest neighbors,1while interest-ing, would also be outside the scope of this project.Such an approach would really be a machine learn-ing task, and not a cartographic task.As such, our evaluation is limited to observingin a qualitative way the differences between ourmethod and the hand-drawn method. We expectthat if there is no significant difference, this will atleast provide a way of automating the creation ofa machine-readable form, assuming data points areavailable. If there is a significant difference, then1As suggested by George Dahl.perhaps more investigation into the predictive pow-ers of the two models is warranted.2 The dataThe data from the Atlas of North American En-glish (Labov et al., 2005) is in the form of approxi-mately 600 points, identified geographically by ZIPcode, which we then converted to latitude and lon-gitude coordinates. At each point, there was an in-dication of whether the speaker at that point makesor does not make each of a set of possible linguisticdistinctions. The data is generally denser around thenorth east and Great Lakes regions, but this is in partdue to greater population density in these areas. Ofcourse, more datapoints would always be desirable,but these data are still useful.A dialect area can be considered an area of over-lap between polygons from different feature sets,where the area of overlap consists of most of the areaof the parent polygons. Thus, it is an area where,at least with respect to the features under consid-eration, people speak the same way, and that wayis distinct from the surrounding area. The scale ofthis can vary, of course: a city may form a distinctdialect area within a state, if it has sufficiently dif-ferent speech, but there may also be dialect areaswithin that city, which are each more like each otherthan they are like the speech outside the city, but stilldiffer from each other.The specific features which we mapped were thefollowing: whether the speaker distinguishes A andO, whether the speaker distinguishes û and w, andwhether the speaker distinguishes I and E before anasal, such as n or m.2These are a set of easily ex-plicable features, with the first and third being gen-erally considered to be characteristic of large USdialects. For each feature, following the format inthe Atlas, a speaker could make the distinction, notmake the distinction, or be unclear — that is, theinterviewer was unable to tell whether they reliablymade or did not make the distinction. In all cases,we used the interviewer’s judgment, rather than thespeaker’s self-reporting.The scale of our project is across the contigu-ous USA, so many smaller-scale regional variationsdon’t show up. This is dependent on a number of2These symbols are explained in Section 7.parameters in the algorithm, and the size and gran-ularity of the dataset. For example the data set andour techniques might allow us to note the speech ofwestern Pennsylvania and West Virginia as distinctfrom the area around it, but would not create a dis-tinct region for the dialect of San Francisco and itscountryside.3 MethodsOur goal of identifying dialect regions based on in-dividual phoneme information required us to first lo-cate areas where speakers pronounced a phonemesimilarly. For most phonemic variables in the Atlasof North American English, the speakers with sim-ilar pronounciation are not located all in one

View Full Document