According to the AI Index 2025 report, the computational efficiency of large language models (LLMs) has increased hundreds of times in just two years. Microsoft’s Phi-3-mini model, with only 3.8 billion parameters, surpassed 60% accuracy on the MMLU test in 2024, a threshold that in 2022 was only achieved by models like PaLM with 540 billion parameters. Meanwhile, the cost of running queries at the GPT-3.5 level has dropped from $20 to $0.07 per million tokens.
This article will analyze why LLM models do not work “out of the box”, what approaches turn them into innovative and helpful assistants, what innovations are shaping the new wave of LLM technologies, and where exactly this industry is heading in the coming years.
Why LLMs need additional training
Despite rapid progress in LLM development, they are still not full-fledged agents capable of logical reasoning, coherent decision-making, and ethical behavior. After basic training, these models are good at mimicking human writing style. However, this is not enough to ensure deep analysis, consistent argumentation, or the ability to maintain context and logic. LLMs are required to undergo consistent training.
Main approaches to training LLM reasoning
- Instruction tuning teaches the model to follow human instructions precisely and respond in a user-friendly format.
- Reinforcement learning from human feedback (RLHF) incorporates human input to shape subsequent responses.
- Chain-of-thought prompting allows the model to break down complex tasks into logical steps, mimicking human reasoning processes.
- External memory integrates tools or systems that enable the storage of long-term context and refer back to it as needed.
- Planners help the model structure actions, formulate intermediate goals, and progress step-by-step toward a result.
- Agent architectures train the model to interact with tools, environments, and its history, transforming it from a passive responder into an active cognitive agent.
Innovations in LLM Training Technologies
The modern development of large language models is accompanied by an increase in computational power and the introduction of cutting-edge training methods that dramatically expand their capabilities.
This approach allows models to independently adjust their responses based on gained experience without constant human intervention. As a result, models become more flexible, quickly adapt to new tasks and contexts, and reduce the need for large amounts of labeled data.
An example is the Gemini-Pro-Vision 1.5 model from Google DeepMind, which successfully demonstrated zero-shot training, that is, without additional task-specific training, and was able to accurately detect critical traffic events, such as determining the direction of vehicle movement and recognizing dangerous situations, using only raw images and contextual data.
This direction involves the simultaneous processing and integration of different types of data: text, images, audio, and sometimes video. It aids in developing intelligent systems by enabling better contextual understanding, for instance, analyzing text alongside accompanying graphics, describing images, or even recognizing emotions in audio.
Meta’s LLaMA 3.2 model, presented at the Meta Connect 2024 conference, gained the ability to process images and is optimized for mobile devices, enabling the integration of powerful AI features across the company’s products.

Integration with the world through APIs and robotics
One advanced example is PaLM-E, a Google-embodied large language model combining robotic devices’ text, image, and sensory data. PaLM-E can gather information from various sensors (such as cameras and LiDAR) to build action plans, answer questions about the surrounding environment, and perform motion planning tasks.
This model acts as a “brain” for robots, enabling them to understand complex instructions and the surrounding context. PaLM-E opens new opportunities for automating industrial processes, logistics, service robots, and even assistive devices in healthcare.
Behind such integration is the need for annotated sensor fusion data, one of Keymakr’s specialized services. The company supports robotics developers by preparing training datasets that include synchronized inputs from multiple modalities, such as vision and depth sensors.
What awaits LLMs in the coming years
In the next 2-3 years, large language models (LLMs) will radically transform, evolving into true cognitive agents with advanced reasoning, planning, and decision-making skills. During this period, significant growth is expected in deploying LLMs capable of working closely with users, considering their needs and priorities.
Such cognitive agents will greatly enhance productivity by automating complex analytical tasks and routine processes that previously required expert involvement. In medicine, this will mean more accurate diagnostics and personalized treatment, in business, more efficient resource management and strategic planning, and in education, customized approaches and deep explanations of complex topics.