Seeing What You're Told: Sentence-Guided Activity Recognition In Video

被引:12
|
作者
Siddharth, N. [1 ]
Barbu, Andrei [2 ]
Siskind, Jeffrey Mark [3 ]
机构
[1] Stanford Univ, Stanford, CA 94305 USA
[2] MIT, Cambridge, MA 02139 USA
[3] Purdue Univ, W Lafayette, IN 47907 USA
关键词
D O I
10.1109/CVPR.2014.99
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present a system that demonstrates how the compositional structure of events, in concert with the compositional structure of language, can interplay with the underlying focusing mechanisms in video action recognition, providing a medium for top-down and bottom-up integration as well as multi-modal integration between vision and language. We show how the roles played by participants (nouns), their characteristics (adjectives), the actions performed (verbs), the manner of such actions (adverbs), and changing spatial relations between participants (prepositions), in the form of whole-sentence descriptions mediated by a grammar, guides the activity-recognition process. Further, the utility and expressiveness of our framework is demonstrated by performing three separate tasks in the domain of multi-activity video: sentence-guided focus of attention, generation of sentential description, and query-based search, simply by leveraging the framework in different manners.
引用
收藏
页码:732 / 739
页数:8
相关论文
共 22 条