Multimodal AI is now moving beyond specialized domains. In April 2025, Meta Platforms announced the launch of a new model of its artificial intelligence, Llama 4. Llama is a multimodal system capable of processing and integrating different types of data, such as text, video, images, and audio, and can convert them between each other. With this breakthrough, LLMs are expanding their capabilities to be practical and versatile assistants in human life.
Next, we will examine the main aspects of multimodal annotation, how it affects the creation of multimodal AI systems, and its use in different areas.
Definition of multimodal annotation
Multimodal annotation is the process of annotating data from several different modalities to create a consistent and connected representation for training AI models. Modalities include text, images, audio, video, and sensory data. This annotation type has made creating and developing multimodal AI systems possible.
Multimodal AI systems are models or technologies that simultaneously perceive, process, and analyze information from several modalities.
In a classic computer vision system, images of a car are used to detect its presence or brand recognition. A multimodal system simultaneously analyzes video from cameras (road images), text labels (GPS data or road signs), and audio (siren, signal).
This approach is similar to how humans perceive information and can naturally receive and process data through different senses. Thanks to the multimodal approach, AI models can imitate human understanding. This reduces the number of errors and contributes to better decision-making. AI programs become reliable and personalized in their responses and predictions.

Knowing what multimodal annotation is and the principle of operation of multimodal systems, let’s consider where they are used and how they help in different areas.
Healthcare
In healthcare, multimodal AI improves patient diagnosis and care. It combines medical images (X-rays, MRIs) with text data (patient treatment records, diagnoses, etc.) so that AI systems can comprehensively analyze the information and provide correct diagnoses and treatment plans.
In August 2024, SoundHound acquired Amelia AI, which implemented an AI agent to manage patient records at MUSC Health. This provided 24/7 support for patients, allowing them to self-manage appointments and get answers to non-clinical questions. Also, the system processes initial inquiries without involving human staff, reducing the load on call centers and patient waiting times.
Google Health is developing multimodal AI models integrating medical images (such as mammograms) with clinical records to improve disease prediction and diagnosis. The Med-Gemini project is a family of multimodal medical models built on the Gemini architecture. They can interpret complex 3D scans, answer clinical questions, and generate state-of-the-art radiology reports. One example is the generation of reports for chest X-rays, which outperforms previous state-of-the-art results by up to 12% for regular and abnormal scans from two separate datasets.
Automotive
Multimodal models analyze visual (cameras), spatial (lidar, radar), and speech (driver commands) data simultaneously to recognize objects, determine the road situation, and make decisions (braking, lane change). They are also used in ADAS systems, where the system warns the driver of danger based on the analysis of images, sound, and context (speed, vehicle position).
Waymo has developed EMMA models that process raw camera inputs and textual data to generate various driving outputs, including planner trajectories, perception objects, and road graph elements. EMMA uses Google’s Gemini model to integrate world knowledge, improving the car’s ability to navigate complex scenarios.
Multimodal models are also used in voice assistants to control car functions and driver condition monitoring systems to prevent fatigue or inattention.
Toyota Connected North America (TCNA) has created a prototype of a digital user manual that uses LLMs to understand user queries. It combines natural language processing with computer vision and processes text and visual information from the manual. For example, suppose a user asks a question about a specific feature of a car. In that case, the system can find relevant sections in the manual and provide an answer, supplemented with images for better understanding.
Manufacturing
Multimodal AI in manufacturing controls quality, automates processes, and trains AI models. Image analysis lets you detect visual defects in products (scratches, cracks, dents), and audio records the sound of equipment operation (knocking, creaking) that can indicate a malfunction. This data is annotated together, which allows for training multimodal AI models that not only detect a defect but also understand the cause of its occurrence.
Festo, which presented its own AX Industrial Intelligence model at the SPS exhibition, integrates multimodal data for quality control and predictive maintenance in manufacturing. This model has reduced equipment downtime by up to 25% and reduced defective products by 20%.
Education
Multimodal search systems create an interactive, personalized learning process. They make obtaining and assimilating information easier for people with visual, hearing, or speech impairments. Multimodal systems convert text to speech, speech to text, or add subtitles to videos.
The popular Duolingo app uses a multimodal system. Each lesson presents text, video, and images. The user sees a word, hears its pronunciation, and memorizes it using a visual representation of the translation. The system recognizes the user’s language and corrects their name. Thanks to comprehensive path analysis, the system selects a personally effective language learning method based on the user’s mistakes.
Large volumes of qualitatively annotated data are required to train such models and implement them in the real world, which is a challenge in the field of artificial intelligence. Companies like Keymakr provide services for annotating images and videos. They also create, collect, and verify data. It is on such companies that it depends on how the AI model learns to react and adapt to real-world situations and tasks.

Challenges and limitations of multimodal annotation
Multimodal AI has many advantages, but it still evolves and faces many obstacles. One of the main challenges is the large volume and complexity of data required to train these systems. Collecting, storing, and annotating information in different formats is costly and time-consuming. Also, the large volumes and types of data raise ethical issues. Data privacy and reducing bias in multimodal AI systems are essential for their responsible implementation.
Maintaining continuous communication between different modalities without losing context and compromising performance remains challenging. Complex algorithms are required to properly combine various data sources, each with its own noise and inconsistencies. Developing new fusion methods is an ongoing area of research.
Future of the multimodal models
Institutions, independent researchers, and technology companies are actively exploring next-generation multimodal models. The McKinsey Technology Trends Outlook 2024 report highlights that multimodal models are becoming a key direction for the development of AI.
Key aspects influencing future developments:
- Integrating different modalities allows for the creation of flexible and adaptive systems that can understand and interact with the world around them.
In the study Towards Deployment-Centric Multimodal AI Beyond Vision and Language, the authors propose an implementation-oriented approach to developing multimodal AI systems. The authors consider three real-world use cases for multimodal AI: responding to pandemics, designing autonomous vehicles, and adapting to climate change. They demonstrate how multimodal systems can be applied across industries, combining data from different sources to improve understanding, prediction, and decision-making.
- Better human-machine interaction. Multimodal interfaces provide intuitive interaction, which is important for user applications and robotics.
Amazon introduced the Vulcan robot, which can handle approximately 75% of the items in its warehouses through sensory perception. It combines visual, tactile, and speech cues, allowing it to navigate its physical environment better. Vulcan places items on different levels of shelves, reducing the physical strain on workers. Amazon emphasizes that robots complement, not replace, human roles. Machines perform repetitive tasks, while workers monitor and ensure the system is safe.
- Improved adaptability and learning. Multimodal systems can better adapt to new tasks and environments, reducing the need for extensive data for training.
The AdaptAgent study shows that multimodal web agents can adapt to new tasks with just 1–2 demonstrations from a human. This allows agents to perform functions on new websites and domains without needing extensive training data.
This research and development lay the foundation for creating more versatile, adaptive, and intelligent AI systems that can effectively interact with the real world in different contexts.