# Image2Vec: Learning word in addition to image representations as long as reasoningLerrel J. Pinto,

Image2Vec: Learning word in addition to image representations as long as reasoningLerrel J. Pinto, Gunnar A. SigurdssonHow would you summarize this image in 3 words Joint word in addition to image embeddingGiven an image, get words that approximate Tower+Clock+Night

Joint word in addition to image embeddingGiven words, find an unlabeled image Snowy+YoungCurrent embeddingsUse word similarity Learn to map images to a word spaceDoes not apply to sumsCannot recover sum of words given imageFC7 features: -+= FormulationBe as long as e we start:If elements of x are zero except at 1, 6, 15Intuition: Each column of D is a word OR

FormulationSimilar images = similar sumsSimilarity between images Indicator vector as long as word First solve this as long as (per image)Then solve as long as D A black dog laying on his big dog bedFormulation (Simple model)This is a non-linear dimensionality reductionMultidimensional scaling O(N^3)Instead we use JL-Trans as long as m (adds noise) where is FC7Thus given R find D s.t.  Models (Learn all model)To learn a better image representationLearn R too (we could backprop to deep network)(We need more constraints to do this)find D in addition to R s.t. if i in addition to j occur in same image(D,R column norm=1) 

Models (Weak tag model)To allow as long as weak tags (Flickr)Find D in addition to R s.t. s.t. s.t. if i in addition to j occur in same image(D,R column norm=1) camping,wildernessTrainingCOCO dataset80k training images5 captions eachBuild a vocabulary from nouns in addition to adjectivesNumber of training examples, N = 400kWords as long as image are vocab words in captionModels solved withBlock coordinate descentStochastic gradient descentOrthogonal Matching Pursuit A black dog laying on his big dog bedSummarizing image, successes pulling+horses+grassybedroom+wooden+smallsurf+s in addition to yGiven Trained model, find words that summarize. (Sparse recovery)

Summarizing image (w2v in addition to imnet) trombone+accordion+kimonopoodledog+rabbitice-cream+vaseblack-bear+cat+guenoncanoe+sea-lionSummarizing image (simple model) hipster+rabbitlap+bridehipster+lambcheesecake+polkabear+hydrantwave+motorbikeSummarizing image (learn all model) leaning+player+videotoddler+lying+livingkneeling+outsidevasebear+rocky+kneelingwet+rocky

Summarize image (weak tags model) Be as long as e: concrete,gbr,londonAfter: crane,londonBe as long as e: diving,scubaAfter: brain-coral,scubaBe as long as e: jet,lynxAfter: airlinerThese represent successesFlickr imagesUses ImageNet in addition to Word2vecWhy is a visual model differentEnter word (break to quit): ‘yellow’Using word: yellowbus: 0.39orange: 0.32station: 0.30road: 0.28bananas: 0.27Word2Vec most similar words:Using word: yellowred: 0.75bright-yellow: 0.69orange: 0.64blue: 0.64purple: 0.63 Summary in addition to takeawaysL2 distance better than inner productAllows scaling down noisy non-visual wordsBetter optimization / learning rate / initializationPoor local minima, overfittingLearning word in addition to image embedding jointly improves summarization bear+rocky+kneeling

