What do multimodal AI and smaller language models mean for enterprises?
People have learned about the world through multiple inputs, so it makes sense that as communication between humans and machines continues to advance, it will also be multimodal.
In the rapidly changing field of multimodal AI, systems can process multiple data inputs to provide insight or make predictions by training with and using video, audio, speech, images and text. These inputs offer a way to gain the benefits of generative artificial intelligence without the complexity associated with building large language models.
"In looking at the use of this particular technology set, we may be able to solve problems that enterprises have without engaging larger generative AI systems or building LLMs, which are going to be very significant and complex to build and also very expensive to train," said David Linthicum, principal analyst for theCUBE Research. "In some cases, multimodal AI will be just fine for the purposes that you need to use it for as it's embedded in a business application."
The AI Insights and Innovation series from theCUBE, SiliconANGLE Media's livestreaming studio, is the go-to podcast for the latest news, trends and insights in artificial intelligence, including generative AI. In this segment, Linthicum provides an overview of multimodal AI and how it offers businesses a potentially attractive set of options versus the cost and complexity required to train LLMs.
Data compiled by KBV Research showed that the market for multimodal AI was on pace to reach $8.4 billion by 2030, an average annual growth rate of 32% over a seven-year period. Interest in the field is being driven by an ability to recognize an object using factors such as visual appearance and sound, allowing AI to make more informed decisions, according to Linthicum.
"Multimodal AI and smaller language models are really a trend in the field of AI, and it's about the ability to accept other formats other than text," he explained. "It can take images, it can take video, it can take audio, and then understand and convert them and make sense of it within an AI model. It allows you to better understand what that content is and generate responses to the content, which is handy."
Interest in multimodal AI is leading to its adoption for a number of use cases, including context-based image recognition. For example, doorbells with a camera that can recognize images can allow people to recognize a familiar face or know that a package has been delivered, according to Linthicum.
"It will tell you what's actually going on versus just showing you the video," he said. "We're able to perhaps even submit images of a problem that we're having, such as a plumbing misconfiguration, and it's able to diagnose a problem just via the image [which] is an example of multimodal AI. The use cases really get into the versatility of this technology."
Multimodal AI offers a less complex process than generative AI for deriving insights from data. It can also provide cost savings, according to Linthicum.
"The common question is ... how does it relate to generative AI?" he asked. "The reality is they are complementary, and you can use them by themselves, and you can use them together. In many cases, enterprises that are using AI technology or want to use AI technology are finding that multimodal AI is just fine for the particular use case that they have without having to employ a generative AI system, without having to employ an LLM. It does not require the same size of processors [and] it does not require normally a [graphics processing unit], which are very expensive."
Another question surrounds how businesses can acquire the tools they need to use multimodal AI. The answer is that if generative AI is already being employed, the tools for multimodal are already available, according to Linthicum.
"The good news is it's the same stuff that we're using to build generative AI systems," he noted. "I understand there's a big complex tool stack that comes along with a ride [that] can be used for creating multimodal AI systems as well. If you understand how to work and build generative AI models, LLMs, you're also able to build small language models and leverage multimodal AI as a specific component of that particular toolset."
Interest in multimodal AI is being driven by a desire among businesses to deploy specific tools that can help with tactical use cases in the enterprise. Organizations also prioritize having the ability to increase the use of automation, Linthicum noted.
"The ability to read invoices, the ability to look at [the] productivity of a factory floor, the ability to look at the productivity of people in general, and the ability to look at images in different ways are going to be pragmatic use cases for this technology," he said. "Businesses need to keep that in mind; that's the core message here."
Here is the complete discussion from David Linthicum: