Image Captioning Approaches Azam Moosavi Overview Tasks involving

Image Captioning Approaches Azam Moosavi Overview Tasks involving sequences: Image and video description Long-term recurrent convolutional networks for visual recognition and description From Captions to Visual Concepts and Back Long-term recurrent convolutional networks for visual recognition and description Donahue Jeff, et al Berkeley LRCN

LRCN is a class of models that is both spatially and temporally deep, and has the flexibility to be applied to a variety of vision tasks involving sequential inputs and outputs. Sequential inputs /outputs Image credit: main paper Activity recognition CNN CNN CNN CNN

LSTM LSTM LSTM LSTM sitting jumping jumping running

Average jumping Activity Recognition Evaluation Image description CNN LSTM a LSTM dog LSTM

is LSTM jumping LSTM Image description CNN LSTM LSTM

LSTM LSTM LSTM LSTM a dog is LSTM LSTM jumping

LSTM LSTM Image description Two layered factor CNN LSTM LSTM LSTM

LSTM LSTM LSTM a dog is LSTM LSTM jumping LSTM

LSTM Image Description Evaluation Image Description Evaluation Video description CNN CNN CNN CNN

Average LSTM a LSTM dog LSTM is LSTM jumping LSTM

Video description CNN CNN CNN CNN LSTM LSTM LSTM

LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM

LSTM LSTM N time step video frames a dog is LSTM LSTM

jumping M time step produce description LSTM LSTM Video description Pre-trained detector predictions LSTM LSTM

LSTM LSTM LSTM LSTM a dog is LSTM LSTM

jumping LSTM LSTM Video Description Figure credit: main paper Video Description Evaluation TACoS multilevel dataset, which has 44,762 video/sentence pairs (about 40,000 for training/validation). From Captions to Visual Concepts and Back

Saurabh Gupta UC Berkeley Work done at Microsoft Research Hao Cheng, Li Deng, Jacob Devlin, Piotr Dollr, Hao Fang, Jianfeng Gao, Xiaodong He, Forrest Iandola, Margaret Mitchell, John C. Platt, Rupesh Srivastava, C. Lawrence Zitnick, Geoffrey Zweig crowd woman holding camera cat Purple

1. Word Detection woman, crowd, cat, camera, holding, purple Slide credit: Saurabh Gupta crowd woman holding camera cat Purple

1. Word Detection woman, crowd, cat, camera, holding, purple 2. Sentence 3. Sentence Generation Re7Ranking A purple camera with a woman. A woman holding a camera in a crowd.

... A woman holding a cat. #1 A woman holding a camera in a crowd. Slide credit: Saurabh Gupta crowd woman holding camera cat Purple

1. Word Detection 2. Sentence 3. Sentence Re-Ranking Generation woman, crowd, cat, camera, holding, purple A purple camera with a woman. A woman holding a camera in a crowd. ...

A woman holding a cat. #1 A woman holding a camera in a crowd. 3 Slide credit: Saurabh Gupta crowd woman holding camera cat Purple

1. Word Detection woman, crowd, cat, camera, holding, purple Slide credit: Saurabh Gupta Caption generation: Visual detectors: Detects a set of words that are likely to be part of the image caption MIL CNN FC6, FC7, FC8 as fully convolutional layers

Image Multiple Instance Learning Per class probability Spatial class probability maps 1 pijw 1 exp( (WwT (FC 7) bw )) Slide credit: Saurabh Gupta

4 Multiple Instance Learning (MIL) In MIL, instead of giving the learner labels for the individual examples, the trainer only labels collections of examples, which are called bags. Multi-Instance Learning Unsupervised 1. 2. Supervised

Kmeans Transductive Inference Neural Nets PCA Co-training Mixture ... Perceptron Models s SVM ... ... Recover latent structure, 1. Train classifier but not a classifier 2. Success = low test error Hope that structure is

useful 3. No labels required 3. Requires labels Multiple Instance Learning (MIL) A bag is labeled positive if there is at least one positive example in it and it is labeled negative if all the examples in it are negative Negative Bags (Bi-) Positive Bags (Bi+) Multiple Instance Learning (MIL) we want to know target class based on its visual content. For instance, the target class might be "beach", where the image contains both "sand" and

"water". In MIL terms, the image is described as a bag X {x1 , x 2 ,..., x n }where each i is the feature vector (called instance) extracted from the corresponding i-th region in the image and N is the total regions (instances) partitioning the image. The bag is labeled positive ("beach") if it contains both "sand" region instances and "water" region instances. Image credit: Multiple instance classification, review, taxonomy and comparative study Jaume Amores (2013) Noisy-Or for Estimating the Density It is assumed that the event can only happen if at least one of the causations occurred

It is also assumed that the probability of any cause failing to trigger the event is independent of any other cause Caption generation: Visual detectors: Detects a set of words that are likely o be part of the image caption MIL CNN FC6, FC7, FC8 as fully convolutional layers Image

Multiple Instance Learning Spatial class probability maps 1 p 1 exp( (WwT (FC 7) bw )) w ij Per class probability piw 1 w

(1 p ij ) jbi Slide credit: Saurabh Gupta Visual detectors Slide credit: Saurabh Gupta crowd woman holding

camera cat Purple 1. Word Detection woman, crowd, cat, camera, holding, purple 2. Sentence 3. Sentence Generation

Re7Ranking A purple camera with a woman. A woman holding a camera in a crowd. ... A woman holding a cat. #1 A woman holding a camera in a crowd. Slide credit: Saurabh Gupta Probabilistic language modeling Goal: compute the probability of a sentence or sequence of words:

P(W) = P(w1,w2,w3,w4,w5wn) Related task: probability of an upcoming word: P(w5|w1,w2,w3,w4) A model that computes either of these: P(W) or P(wn|w1,w2wn-1) is called a language model. Probabilistic language modeling A woman holding holdin g

camer a ca purple t crow d Slide credit: Saurabh Gupta Probabilistic language modeling A woman holding holding ca

t camer a purple crow d Slide credit: Saurabh Gupta Probabilistic language modeling A woman holding holding cat purple

Slide credit: Saurabh Gupta Probabilistic language modeling A woman holding holding cat purple A woman holding a camera in a crowd. Slide credit: Saurabh Gupta

Maximum entropy language model The ME LM estimates the probability of a word conditioned on preceding words and set of words in dictionary that yet to be mentioned in the sentence. To train the ME LM, the objective function is the log likelihood of the captions conditioned on the corresponding set of detected objects Language model Slide credit: Saurabh Gupta Slide credit: Saurabh Gupta Slide credit: Saurabh Gupta

crowd woman holding camera cat Purple 1. Word Detection 2. Sentence 3. Sentence Re-Ranking

Generation woman, crowd, cat, camera, holding, purple A purple camera with a woman. A woman holding a camera in a crowd. ... A woman holding a cat. #1 A woman holding a camera in a crowd. Slide credit: Saurabh Gupta Re-rank rank hypotheses globally Text vector: yD

Deep Multimodal Similarity Model (DMSM): Image vector :yQ We measure similarity between images and text by measuring cosine similarity between their corresponding vectors. This cosine similarity score is used by MERT to re-rank the sentences. we can compute the posterior probability of the text being relevant to the image via: A woman holding a camera in a crowd. DMSM -rank Embedding

to maximize similarity between image and its corresponding caption Where Slide credit: Saurabh Gupta Results Caption generation performance for seven variants of our system on the Microsoft COCO dataset Slide credit: Saurabh Gupta Thank You

Recently Viewed Presentations

  • The ArrayList Class

    The ArrayList Class

    The ArrayList class is part of the java.util package You can reference each object in the list using a numeric index An ArrayList object grows and shrinks as needed, adjusting its capacity as necessary
  • Variables - Brain Energy Lab

    Variables - Brain Energy Lab

    The scope determines the lifetime of a variable; or another way to look at it, were the computer can read the variable. Matlab is an exception and will have its own slide. Scope is important for memory allocations, and it's...
  • Epidemiology Venous Pathophysiology  Etiology  Risk Factors and Screening

    Epidemiology Venous Pathophysiology Etiology Risk Factors and Screening

    CEAP Classification . Current Treatments. Catheter-based Treatments. Sclerotherapy and Sclerosing Agents. Physician-made Foam for Sclerotherapy. FDA-approved Polidocanol Injectable Foam. VANISH-2 Efficacy. VANISH-2: Symptom Results - VVSymQ™ Scores.
  • Freshman/Sophomore Night - Bloomingdale High School

    Freshman/Sophomore Night - Bloomingdale High School

    Khan Academy test prep and subject tutoring. Bloomingdale High School App - Stay connected to Bloomingdale High School where ever you go using our mobile app. ... Completed logs MUST be signed by an adult supervising the activity.
  • Transactional Memory  Implementation Lecture 1 COS597C, Fall 2010

    Transactional Memory Implementation Lecture 1 COS597C, Fall 2010

    Since nested parallelism is unsupported by proposed transactional memory (TM) systems, the parallel sort will not run correctly. Problems first arise at the call to spawn. Since current TM proposals only provide single-threaded atomicity, the spawned thread necessarily does not...
  • SKETCHING - Cabarrus County Schools

    SKETCHING - Cabarrus County Schools

    Introduction. This unit will cover the purpose of sketching, materials needed for sketching, techniques for sketching, importance of proportions, the types of sketches and differences between isometric, oblique and perspective sketches.
  • Canadian Mutual Fund Investors' Perceptions of Mutual Funds ...

    Canadian Mutual Fund Investors' Perceptions of Mutual Funds ...

    Il n'y a que trois investisseurs en FNB sur dix qui se sentent à l'aise d'utiliser des conseillers-robots, avec 40 % d'entre eux qui ne sont pas à l'aise. Bien qu'il s'agisse d'un taux plus élevé que la confiance démontrée...
  • Managing Interest Rate Risk: GAP and Earnings Sensitivity

    Managing Interest Rate Risk: GAP and Earnings Sensitivity

    Managing Interest Rate Risk (I): GAP and Earnings Sensitivity Interest Rate Risk Interest Rate Risk The potential loss from unexpected changes in interest rates which can significantly alter a bank's profitability and market value of equity.