TinyVQA, a compact multimodal visual question

Lurker · April 8

Multimodal machine learning models have been surging in popularity, marking a significant evolution in artificial intelligence (AI) research and development. These models, capable of processing and integrating data from multiple modalities such as text, images, and audio, are of great importance due to their ability to tackle complex real-world problems that traditional unimodal models struggle with. The fusion of diverse data types enables these models to extract richer insights, enhance decision-making processes, and ultimately drive innovation.

Among the burgeoning applications of multimodal machine learning, Visual Question Answering (VQA) models have emerged as particularly noteworthy. VQA models possess the capability to comprehend both images and accompanying textual queries, providing answers or relevant information based on the content of the visual input. This capability opens up avenues for interactive systems, enabling users to engage with AI in a more intuitive and natural manner.

However, despite their immense potential, the deployment of VQA models, especially in critical scenarios such as disaster recovery efforts, presents unique challenges. In situations where internet connectivity is unreliable or unavailable, deploying these models on tiny hardware platforms becomes essential. Yet the deep neural networks that power VQA models demand substantial computational resources, rendering traditional edge computing hardware solutions impractical.

Inspired by optimizations that have enabled powerful unimodal models to run on tinyML hardware, a team led by researchers at the University of Maryland has developed a novel multimodal model called TinyVQA that allows extremely resource-limited hardware to run VQA models. Using some clever techniques, the researchers were able to compress the model to the point that it could run inferences in a few tens of milliseconds on a common low-power processor found onboard a drone. In spite of this substantial compression, the model was able to maintain acceptable levels of accuracy.

To achieve this goal, the team first created a deep learning VQA model that is similar to other state of the art algorithms that have been previously described. This model was far too large to use for tinyML applications, but it contained a wealth of knowledge. Accordingly, the model was used as a teacher for a smaller student model. This practice, called knowledge distillation, captures much of the important associations found in the teacher model, and encodes them in a more compact form in the student model.

In addition to having fewer layers and fewer parameters, the student model also made use of 8-bit quantization. This reduces both the memory footprint and the amount of computational resources that are required when running inferences. Another optimization involved swapping regular convolution layers out in favor of depthwise separable convolution layers — this further reduced model size while having a minimal impact on accuracy.

Having designed and trained TinyVQA, the researchers evaluated it by using the FloodNet-VQA dataset. This dataset contains thousands of images of flooded areas captured by a drone after a major storm. Questions were asked about the images to determine how well the model understood the scenes. The teacher model, which weighs in at 479 megabytes, was found to have an accuracy of 81 percent. The much smaller TinyVQA model, only 339 kilobytes in size, achieved a very impressive 79.5 percent accuracy. Despite being over 1,000 times smaller, TinyVQA only lost 1.5 percent accuracy on average — not a bad trade-off at all!

In a practical trial of the system, the model was deployed on the GAP8 microprocessor onboard a Crazyflie 2.0 drone. With inference times averaging 56 milliseconds on this platform, it was demonstrated that TinyVQA could realistically be used to assist first responders in emergency situations. And of course, many other opportunities to build autonomous, intelligent systems could also be enabled by this technology.

source:

hackster.io

Sign In

TinyVQA, a compact multimodal visual question

Recommended Posts

Lurker

Link to comment

Share on other sites

Join the conversation

Activity

Important Information