Automatic Image Captioning Jia Yu Pan Hyung Jeong Yang Pinar Duygulu and Christos Faloutsos Computer Science Department Carnegie Mellon University Pittsburgh U S A Department of Computer Engineering Bilkent University Ankara Turkey jypan hjyang christos cs cmu edu duygulu cs bilkent edu tr Abstract In this paper we examine the problem of automatic image captioning Given a training set of captioned images we want to discover correlations between image features and keywords so that we can automatically find good keywords for a new image We experiment thoroughly with multiple design alternatives on large datasets of various content styles and our proposed methods achieve up to a 45 relative improvement on captioning accuracy over the state of the art 1 Introduction and related work Given a large image database find images that have tigers Given an unseen image find terms which best describe its content These are some of the problems that many image video indexing and retrieval systems deal with see 4 5 10 for recent surveys Content based image retrieval systems matching images based on visual similarities have some limitations due to the missing semantic information Manually annotated words could provide semantic information however it is time consuming and errorprone Several automatic image annotation captioning methods have been proposed for better indexing and retrieval of large image databases 1 2 3 6 7 We are interested in the following problem Given a set of images where each image is captioned with a set of terms describing the image content find the association between the image features and the terms Furthermore with the association found caption an unseen image Previous works caption an image by captioning its constituting regions by a mapping from image regions to terms Mori et al 10 use cooccurrence statistics of image grids and words for modeling the association Duygulu et al 3 view the mapping as a translation of image regions to words and learn the mapping between region groups and words by an EM algorithm Recently probabilistic models such as cross media relevance model 6 and latent semantic analysis LSA based models 11 are also proposed for captioning In this study we experiment thoroughly with multiple design alternatives better clustering decision weighting image features and keywords dimensionality reduction for noise suppression for better association model The proposed methods achieve a 45 relative improvement on captioning accuracy over the result of 3 on large datasets of various content styles The paper is organized as follows Section 2 describes the data set used in the study Section 3 describes an adaptive method for obtaining image region groups The proposed uniqueness weighting scheme and correlation based image captioning methods are given in Section 4 and 5 Section 6 presents the experimental results and Section 7 concludes the paper 2 Input representation We learn the association between image regions and words from manually annotated images examples are shown in Figure 1 sea sun sky waves cat forest grass tiger w6 w7 w8 w1 w2 w9 w10 w11 Figure 1 Top annotated images with their captions bottom corresponding blob tokens and word tokens This material is based upon work supported by the National Science Foundation under Grants No IRI 9817496 IIS 9988876 IIS 0113089 IIS 0209107 IIS 0205224 INT 0318547 SENSOR 0329549 EF 0331657 by the Pennsylvania Infrastructure Technology Alliance PITA Grant No 22 901 0001 and by the Defense Advanced Research Projects Agency under Contract No N66001 00 1 8936 Additional funding was provided by donations from Intel and by a gift from Northrop Grumman Corporation An image region is represented by a vector of features regarding its color texture shape size and position These feature vectors are clustered into B clusters and each region is assigned the label of the closest cluster center as in 3 These labels are called blob tokens Formally let I I1 IN be a set of annotated images where each image Ii is annotated with a set of terms Wi wi 1 wi Li and a set of blob tokens Bi bi 1 bi Mi where Li is the number of words and Mi is the number of regions in image Ii The goal is to construct a model that captures the association between terms and the blob tokens given Wi s and Bi s 3 Blob token generation The quality of blob tokens affects the accuracy of image captioning In 3 the blob tokens are generated using the K means algorithm on feature vectors of all image regions in the image collection with the number of blob tokens B set at 500 However the choice of B 500 is by no means optimal In this study we determine the number of blobtokens B adaptively using the G means algorithm 12 G means clusters the data set starting from small number of clusters B and increases B iteratively if some of the current clusters fail the Gaussianity test e g Kolmogorov Smirov test In our work the blobtokens are the labels of the clusters adaptively found by G means The numbers of blob tokens generated for the 10 training set are all less than 500 ranging from 339 to 495 mostly around 400 4 Weighting by uniqueness If there are W possible terms and B possible blobtokens the entire annotated image set of N images can be represented by a data matrix D N by W B We now define two matrices one is unweighted the other is uniqueness weighted as initial data representation Definition 1 Unweighted data matrix Given an annotated image set I I1 IN with a set of terms W and a set of blob tokens B the unweighted data matrix D0 DW0 DB0 is a N by W B matrix where the i j element of the N by W matrix DW0 is the count of term wj in image Ii and the i j element of the N by B matrix DB0 is the count of blob token bj in image Ii We weighted the counts in the data matrix D according to the uniqueness of each term blob token If a term appears only once in the image set say with image I1 then we will use that term for captioning only when we see the blob tokens of I1 again which is a small set of blob tokens The more common a term is the more blob tokens it has association with and the uncertainty of finding the correct term and blob token association goes up The idea is to give higher weight to terms blob tokens which are more unique in the training set and low weights to noisy common terms blob tokens Definition 2 Uniqueness weighted data matrix Given an unweighted data matrix D0 DW0 DB0 Let zj yj be the number of images which contain the term wj the blob token bj The
or
We will never post anything without your permission.
Don't have an account? Sign up