SWARTHMORE CS 97 - EASY AS CBA- A SIMPLE PROBABILISTIC MODEL FOR TAGGING MUSIC - D2986913

Home> Schools> Swarthmore College> (CS) > CS 97> EASY AS CBA- A SIMPLE PROBABILISTIC MODEL FOR TAGGING MUSIC

DOC PREVIEW

SWARTHMORE CS 97 - EASY AS CBA- A SIMPLE PROBABILISTIC MODEL FOR TAGGING MUSIC

School name Swarthmore College

Course Cs 97- Computer Perception

Pages 6

This preview shows page 1-2 out of 6 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

10th International Society for Music Information Retrieval Conference (ISMIR 2009)EASY AS CBA: A SIMPLE PROBABILISTIC MODEL FOR TAGGINGMUSICMatthew D. HoffmanDept. of Computer SciencePrinceton Universitymdhoffma at cs.princeton.eduDavid M. BleiDept. of Computer SciencePrinceton Universityblei at cs.princeton.eduPerry R. CookDept. of Computer ScienceDept. of MusicPrinceton Universityprc at cs.princeton.eduABSTRACTMany songs in large music databases are not labeled withsemantic tags that could help users sort out the songs theywant to listen to from those they do not. If the words thatapply to a song can be predicted from audio, then thosepredictions can be used both to automatically annotate asong with tags, allowing users to get a sense of what qual-ities characterize a song at a glance. Automatic tag predic-tion can also drive retrieval by allowing users to search forthe songs most strongly characterized by a particular word.We present a probabilistic model that learns to predict theprobability that a word applies to a song from audio. Ourmodel is simple to implement, fast to train, predicts tagsfor new songs quickly, and achieves state-of-the-art per-formance on annotation and retrieval tasks.1. INTRODUCTIONIt has been said that talking about music is like dancingabout architecture, but people nonetheless use words to de-scribe music. In this paper we will present a simple systemthat addresses tag prediction from audio—the problem ofpredicting what words people would be likely to use to de-scribe a song.Two direct applications of tag prediction are semanticannotation and retrieval. If we have an estimate of theprobability that a tag applies to a song, then we can saywhat words in our vocabulary of tags best describe a givensong (automatically annotating it) and what songs in ourdatabase a given word best describes (allowing us to re-trieve songs from a text query).We present the Codeword Bernoulli Average (CBA)model, a probabilistic model that attempts to predict theprobability that a tag applies to a song based on a vector-quantized (VQ) representation of that song’s audio. OurCBA-based approach to tag prediction• Is easy to implement using a simple EM algorithm.Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page.c 2009 International Society for Music Information Retrieval.• Is fast to train.• Makes predictions efficiently on unseen data.• Performs as well as or better than state-of-the-art ap-proaches.2. DATA REPRESENTATION2.1 The CAL500 data setWe train and test our method on the CAL500 dataset [1,2]. CAL500 is a corpus of 500 tracks of Western popularmusic, each of which has been manually annotated by atleast three human labelers. We used the “hard” annotationsprovided with CAL500, which give a binary value yjw∈{0, 1} for all songs j and tags w indicating whether tag wapplies to song j.CAL500 is distributed with a set of 10,000 39-dimensionalMel-Frequency Cepstral Coefficient Delta (MFCC-Delta)feature vectors for each song. Each Delta-MFCC vectorsummarizes the timbral evolution of three successive 23mswindows of a song. CAL500 provides these feature vec-tors in a random order, so no temporal information beyonda 69ms timescale is available.Our goals are to use these features to predict which tagsapply to a given song and which songs are characterized bya given tag. The first task yields an automatic annotationsystem, the second yields a semantic retrieval system.2.2 A vector-quantized representationRather than work directly with the MFCC-Delta featurerepresentation, we first vector quantize all of the featurevectors in the corpus, ignoring for the moment what featurevectors came from what songs. We:1. Normalize the feature vectors so that they have mean0 and standard deviation 1 in each dimension.2. Run the k-means algorithm [3] on a subset of ran-domly selected feature vectors to find a set of Kcluster centroids.3. For each normalized feature vector fjiin song j, as-sign that feature vector to the cluster kjiwith thesmallest squared Euclidean distance to fji.369Oral Session 5: TagsThis vector quantization procedure allows us to representeach song j as a vector njof counts of a discrete set ofcodewords:njk=NjXi=11(kji= k) (1)where njkis the number of feature vectors assigned tocodeword k, Njis the total number of feature vectors insong j, and 1(a = b) is a function returning 1 if a = b and0 if a 6= b.This discrete “bag-of-codewords” representation is lessrich than the original continuous feature vector representa-tion. However, it is effective. Such VQ codebook represen-tations have produced state-of-the-art performance in im-age annotation and retrieval systems [4], as well as in sys-tems for estimating timbral similarity between songs [5,6].3. THE CODEWORD BERNOULLI AVERAGEMODELIn order to predict what tags will apply to a song and whatsongs are characterized by a tag, we developed the Code-word Bernoulli Average model (CBA). CBA models theconditional probability of a tag w appearing in a song jconditioned on the empirical distribution njof codewordsextracted from that song. One we have estimated CBA’shidden parameters from our training data, we will be ableto quickly estimate this conditional probability for newsongs.3.1 Related workOne class of approaches treats audio tag prediction as aset of binary classification problems to which variants ofstandard classifiers such as the Support Vector Machine(SVM) [7,8] or AdaBoost [9] can be applied. Once a set ofclassifiers has been trained, the classifiers attempt to pre-dict whether or not each tag applies to previously unseensongs. These predictions come with confidence scores thatcan be used to rank songs by relevance to a given tag (forretrieval), or tags by relevance to a given song (for anno-tation). Classifiers like SVMs or AdaBoost focus on bi-nary classification accuracy rather than directly optimiz-ing the continuous confidence scores that are used for re-trieval tasks, which might lead to suboptimal results forthose tasks.Another approach is to fit a generative probabilisticmodel such as a Gaussian Mixture Model (GMM) for eachtag to the audio feature data for all of the songs manifest-ing that tag [2]. The posterior likelihood p(tag|audio) ofthe feature data for a new

View Full Document