hackquest logo

Modalities got latent

We present to you a novel approach to solving multi-modal problems using what we call, "Latent processing": This approach was heavily inspired by the Multi-head Latent Attention paper.

Videos

Descripción

This approach not only makes training the thing easier, but also ensures faster evaluation.

Upon further improvement this model can be of much use in edge AI applications like, robotics. Upon arranging the proper hardware, and a more diverse dataset, this approach can certainly qualify to come in the big league to Vision-Language-Action models!

The Visual Question Answering model that we built using the approach is a mere demonstration of Latent Processing's capabilities...

Team IkAI members: Srijito Ghosh:- GitHub: https://www.github.com/Srijito354 Muskan Kumari:- GitHub: https://www.github.com/Muskan040399

In this project we tried building a Visual Question Answering (VQA) web-app using a CLIP model (built entirely from scratch), trained using the same original to latent space compression technique, as mentioned before. It was trained on the EasyVQA dataset (GH link: https://github.com/vzhou842/easy-VQA.git).

Libraries and Frameworks used: Pytorch Streamlit

Progreso del hackathon

Discovered processing using latent spaces for smaller and faster processing at a similar scale to larger models.

Pila tecnológica

Python
Pytorch
Streamlit
AI
GenAI
Computer vision
C++
Web2

Estado de recaudación de fondos

NA

Líder del equipoSSrijito Ghosh
Sector
AIOther