SWARTHMORE CS 97 - A Robust Mid-level Representation for Harmonic Content in Music Signals - D3055174

Home> Schools> Swarthmore College> (CS) > CS 97> A Robust Mid-level Representation for Harmonic Content in Music Signals

DOC PREVIEW

SWARTHMORE CS 97 - A Robust Mid-level Representation for Harmonic Content in Music Signals

School name Swarthmore College

Course Cs 97- Computer Perception

Pages 8

This preview shows page 1-2-3 out of 8 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

A Robust Mid-level Representation for Harmonic Content in Music SignalsJuan P. Bello and Jeremy PickensCentre for Digital MusicQueen Mary, University of LondonLondon E1 4NS, [email protected] considering the problem of audio-to-audio match-ing, determining musical similarity using low-level fea-tures such as Fourier transforms and MFCCs is an ex-tremely difficult task, as there is little semantic informa-tion available. Full semantic transcription of audio is anunreliable and imperfect task in the best case, an unsolvedproblem in the worst. To this end we propose a robustmid-level representation that incorporates both harmonicand rhythmic information, without attempting full tran-scription. We describe a process for creating this represen-tation automatically, directly from multi-timbral and poly-phonic music signals, with an emphasis on popular mu-sic. We also offer various evaluations of our techniques.Moreso than most approaches working from raw audio,we incorporate musical knowledge into our assumptions,our models, and our processes. Our hope is that by utiliz-ing this notion of a musically-motivated mid-level repre-sentation we may help bridge the gap between symbolicand audio research.Keywords: Harmonic description, segmentation, musicsimilarity1 IntroductionMid-level representations of music are measures that canbe computed directly from audio signals using a combi-nation of signal processing, machine learning and musicalknowledge. They seek to emphasize the musical attributesof audio signals (e.g. chords, rhythm, instrumentation),attaining higher levels of semantic complexity than low-level features (e.g. spectral coefficients, MFCC, etc), butwithout being bounded by the constraints imposed by therules of music notation. Their appeal resides in their abil-ity to provide a musically-meaningful description of audiosignals that can be used for music similarity applications,Permission to make digital or hard copies of all or part of thiswork for personal or classroom use is granted without fee pro-vided that copies are not made or distributed for profit or com-mercial advantage and that copies bear this notice and the fullcitation on the first page.c2005 Queen Mary, University of Londonsuch as retrieval, segmentation, classification and brows-ing in musical collections.Previous attempts to model music from complex au-dio signals concentrate mostly on the attributes of timbreand rhythm (Aucouturier and Pachet, 2002; Yang, 2002).These methods are usually limited by the simplicity oftheir selected feature set, which can be often regardedas low-level. Dixon et al. (2004) demonstrated that it ispossible to successfully characterize music according torhythm by adding higher-level descriptors to a low-levelfeature set. These descriptors are more readily availablefor rhythm than for harmony as the state-of-the-art in beat,meter tracking and tempo estimation has had more successthan similar efforts on chord and melody estimation.Pickens et al. (2002) showed success at identifyingharmonic similarities between a polyphonic audio queryand symbolic polyphonic scores. The approach relied onautomatic transcription, a process which is partially effec-tive within a highly constrained subset of musical record-ings (e.g. mono-timbral, no drums or vocals, small poly-phonies). To effectively retrieve despite transcription er-rors, all symbolic data was converted to harmonic distri-butions and similarity was measured by computing thedistance between two distributions over the same eventspace. This is an inefficient process that goes to the un-necessary step of transcription before the construction ofan abstract representation of the harmony of the piece.In this paper we propose a method for semanticallydescribing harmonic content directly from music signals.Our goal is not to do a formal harmonic analysis but toproduce a robust and consistent harmonic description use-ful for similarity-based applications. We do this with-out attempting to estimate the pitch of notes in the mix-ture. By avoiding the transcription step, we also avoidits constraints, allowing us to operate on a wide varietyof music. The approach combines a chroma-based repre-sentation and a hidden Markov model (HMM) initializedwith musical knowledge and partially trained on the sig-nal data. The output, which is a function of beats (tactus)instead of time, represents the sequence of major and mi-nor triads that describe the harmonic character of the inputsignal.The remainder of this paper is organized as follows:Section 2 reviews previous work on this area; Section 3gives details about the construction of the feature vector;Section 4 explains the used model and justifies our ini-304tialization and training choices; Section 5 evaluates therepresentation against a database of annotated pop musicrecordings; Section 6 discusses the application of our rep-resentation to long-term segmentation; and finally, Sec-tion 7 presents our conclusions and directions for futurework.2 BackgroundWe are by no means the first to use either chroma-basedrepresentations or HMMs for automatically estimatingchords, harmony or structure from audio recordings. Pre-vious systems (Gomez and Herrera, 2004; Pauws, 2004)correlate chromagrams1, to be explained in 3.1, withcognition-inspired models of key profiles (Krumhansl,1990) to estimate the overall key of music signals. Sim-ilarly Harte and Sandler (2005) correlate tuned chroma-grams with simple chord templates for the frame-by-frameestimation of chords in complex signals. While differingin their goals, these studies identified the lack of contex-tual information about chord/key progressions as a weak-ness of their approaches, as at the level of analysis framesthere are a number of factors (e.g. transients, arpeggios,ornamentations) that can negatively affect the local esti-mation.In their research on audio thumbnailing, Bartsch andWakefield (2001) found that the structure of a piece,as seen by calculating a similarity matrix, is moresalient when using beat-synchronous analysis of chromas.Longer analysis frames help to overcome the noise intro-duced by transients and short ornamentations. However,this solution still does not make use of the fact that in aharmonic progression certain transitions are more likelyto occur than others.An alternative way of embedding the idea of harmonicprogression into the estimation is by using HMMs. Thework by Raphael and Stoddard (2003) is a

View Full Document