Modalities got latent
We present to you a novel approach to solving multi-modal problems using what we call, "Latent processing": This approach was heavily inspired by the Multi-head Latent Attention paper.
Videos
Descripción
This approach not only makes training the thing easier, but also ensures faster evaluation.
Upon further improvement this model can be of much use in edge AI applications like, robotics. Upon arranging the proper hardware, and a more diverse dataset, this approach can certainly qualify to come in the big league to Vision-Language-Action models!
The Visual Question Answering model that we built using the approach is a mere demonstration of Latent Processing's capabilities...
Team IkAI members: Srijito Ghosh:- GitHub: https://www.github.com/Srijito354 Muskan Kumari:- GitHub: https://www.github.com/Muskan040399
In this project we tried building a Visual Question Answering (VQA) web-app using a CLIP model (built entirely from scratch), trained using the same original to latent space compression technique, as mentioned before. It was trained on the EasyVQA dataset (GH link: https://github.com/vzhou842/easy-VQA.git).
Libraries and Frameworks used: Pytorch Streamlit
Progreso del hackathon
Discovered processing using latent spaces for smaller and faster processing at a similar scale to larger models.
Pila tecnológica
Estado de recaudación de fondos
NA