Arijit Ray, Jiafei Duan, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kembhavi, Bryan A. Plummer, Ranjay Krishna*, Kuo-Hao Zeng*, Kate Saenko*, SAT: Spatial Aptitude Training for Multimodal Language Models, [webpage] [arxiv], [data], [code coming soon!]
Tl; dr: Simulated spatial aptitude data (SAT) can improve spatial reasoning in real images for MLMs while maintaining pretraining commonsense. When instruction-tuned on SAT, LLaVA-13B matches some larger MLMs like GPT4-V and Gemini-3-1.0 in spatial reasoning.
Jimuyang Zhang, Zanming Huang, Arijit Ray, Eshed-Ohn Bar, FED: Feedback-Guided Autonomous Driving, CVPR 2024 (Highlight), [paper]
Tl; dr: MLMs can benefit autonomous driving by understanding natural language feedback and refining the next waypoint prediction.
Dina Bashkirova, Arijit Ray, Rupayan Mallick, Sarah Adel Bargal, Jianming Zhang, Ranjay Krishna, Kate Saenko, Lasagna: Layered Score Distillation for Disentangled Object Relighting, [arxiv] [project page, data]
Tl; dr: Synthetcially generated examples are effective in teaching physics-aware edits like relighting if we use score-distillation to avoid overfitting.
Arijit Ray, Filip Radenovic, Abhimanyu Dubey, Bryan Plummer, Ranjay Krishna, Kate Saenko, Cola: A Benchmark for Compositional Text-to-image Retrieval, NeurIPS 2023, [arxiv] [project page, data]
Tl; dr: Tuning multimodal layers improve the unseen compositional reasoning ability in CLIP-style vision-language models the most over tuning other parts of the model.
Katherine Deng, Arijit Ray, Reuben Tan, Saadia Gabriel, Bryan A. Plummer, Kate Saenko, Socratis: Are Large Multimodal Models Emotionally Aware?, ICCV Workshops 2023 (oral), Workshop on Emotionally and Culturally Intelligent AI, [arxiv] [project page, data]
Tl; dr: A preliminary study showing MLMs seem to lack diverse perspectives for why different people may feel differently while viewing the same image-text content.
Reuben Tan, Arijit Ray, Andrea Burns, Bryan A. Plummer, Justin Salamon, Oriol Nieto, Bryan Russell, Kate Saenko, Language-Guided Audio-Visual Source Separation via Trimodal Consistency, CVPR 2023, [arxiv] [code]
Tl; dr: We can use language as a bridge to teach models to perform audio-visual sound separation.
Ajay Divakaran, Karan Sikka, Arijit Ray, Xiao Lin, Yi Yao, User-targeted content generation using multimodal embeddings, US Patent App. 17/191,698 [webpage]
Kamran Alipour, Arijit Ray, Xiao Lin, Michael Cogswell, Jurgen Schulze, Yi Yao, Giedrius Burachas, Improving Users' Mental Model with Attention-directed Counterfactual Edits, 2021 Applied AI Letters (Wiley) [pdf]
Arijit Ray, Michael Cogswell, Xiao Lin, Kamran Alipour, Ajay Divakaran, Yi Yao, Giedrius Burachas, Knowing What VQA Does Not: Pointing to Error-Inducing Regions to Improve Explanation Helpfulness, 2021 Applied AI Letters (Wiley), [pdf] [arXiv] [Project Page]
Arijit Ray, Karan Sikka, Ajay Divakaran, Stefan Lee, Giedrius Burachas, Sunny and Dark Outside?! Improving Answer Consistency in VQA through Entailed Question Generation , (EMNLP 2019), also at CVPR-W 2019 VQA and Visual Dialog Workshop, [arXiv], [bibTex] [Data]
Arijit Ray, Yi Yao, Rakesh Kumar, Ajay Divakaran, Giedrius Burachas, Can You Explain That: Lucid Explanations Help Human-AI Collaboratve Image Retrieval , (AAAI-HCOMP 2019), [arXiv], [bibTex] [press coverage]
Arijit Ray, Gordon Christie, Mohit Bansal, Dhruv Batra, Devi Parikh, "Question Relevance in VQA: Identifying Non-Visual And False-Premise Questions.", (EMNLP 2016). [pdf] [code] [Video]
Prashant Chandrasekar, Xuan Zhang, Saurabh Chakravarty, Arijit Ray, John Krulick, and Alla Rozovskaya, "The Virginia Tech System at CoNLL-2016 Shared Task on Shallow Discourse Parsing", CoNLL Shared Task (2016).
The Art of Deep Connection - Towards Natural and Pragmatic Conversational Agent Interactions. [Master's Thesis], Virginia Tech E-Library, 2017
Make RBF Networks Fast Again- Exploiting Multi-Threaded Computing to Speed Up RBF Networks, Multiprocessor Programming Class Project, Fall 2016, [draft paper] [code]
Object Prediction using Image Context: Predict next object in an image reasoned on present image context in a sequential manner, Computer Vision Class Project Fall 2015
Online Demo for Predicting Plausibility of Common Sense Assertions: Enter a three-phrase tuple to assess the plausibility score based on a joint language-vision common-sense reasoning, Class Project, Fall 2015
Learning to Listen: Matching Cover songs with Original Productions: Match Original Songs to Cover Songs using an Ensemble of Supervised and Unsupervised Approaches, Machine Learning Class Project, Fall 2015.
Ray, Arijit, Kishan Prudhvi Guddanti, and N. Chellammal. "An Approach to Intelligent Traction Control Using Regression Networks and Anomaly Detection.", Junior (3rd Year) Semester Project, Fall 2013, published in Springer Applied Artificial Intelligence 29.6 (2015): 597-616.
Tutorial code on how to run CARLA without a display on an Ubuntu server and get image frames/sensor data
Some of the amazing people I have been fortunate to work with:
Prof. Dhruv Batra (at Virgina Tech), Prof. Stefan Lee (at Virgina Tech and SRI Intl.), Dr. Dhruv Mahajan (at FAIR), Dr. Filip Radenovic (at FAIR), Dr. Abhimanyu Dubey (at FAIR), Dr. Ajay Divakaran (at SRI Intl), Dr. Yi Yao (at SRI Intl), Dr. Giedrius Burachas (at SRI Intl), Dr. Kezhen Chen (at Google X, Mineral)
Best way to reach me would be to drop an email to array at bu dot edu.