Generative AI tools are capable of performing tasks that once seemed like the stuff of science fiction, but most of them still struggle with many basic skills, including reading analog clocks and calendars. A new study finds that overall, artificial intelligence systems read clock faces correctly less than a quarter of the time.
A research team at the University of Edinburgh tested some of the top multimodal large-scale language models to see how well they could answer questions based on images of clocks and calendars.
Systems tested include GoogleDeepMind's Gemini2.0, Anthropic's Claude3.5Sonnet, Meta's Llama3.2-11B-Vision-Instruct, Alibaba's Qwen2-VL7B-Instruct, ModelBest's MiniCPM-V-2.6, and OpenAI's GPT-4o and GPT-o1.
Various types of clocks appear in the image: ones with Roman numerals, ones with seconds hands, ones without seconds hands, dials of different colors, etc.
The system correctly reads the clock less than 25% of the time. They struggle even more with clocks that use Roman numerals and stylized hands.
The AI's performance didn't improve when the second hand was removed, leading the researchers to believe that the problem came from detecting the clock's hands and interpreting the angles on the clock face.
Using 10-year calendar images, the researchers asked questions such as what day of the week is New Year's Day? Even the most successful AI models get the calendar problem wrong 20% of the time.
Success rates vary depending on the AI system used. Gemini-2.0 scored highest on the clock test, while GPT-01 was 80% accurate on the calendar question.
"Most people have grown up telling time and using calendars," said study leader Rohit Saxena of the University of Edinburgh's School of Informatics. "The findings highlight the huge gaps in AI's ability to perform basic human skills. These shortcomings must be addressed if AI systems are to be successfully integrated into time-sensitive real-world applications such as scheduling, automation and assistive technology."
Aryo Gema, another researcher at the University of Edinburgh's School of Informatics, said: "Today's artificial intelligence research often emphasizes complex reasoning tasks, but ironically many systems still struggle to handle simpler daily tasks."
The findings will be reported in a peer-reviewed paper to be presented at the Large-Scale Language Model Reasoning and Planning Workshop at the 13th International Conference on Learning Representations (ICLR) in Singapore on April 28. The research results are currently available on the preprint server arXiv.
This isn't the first study this month to suggest that AI systems still make a lot of mistakes. The Dow Digital News Center conducted a study of eight artificial intelligence search engines and found that they were inaccurate 60% of the time. The worst is the Grok-3, which has an accuracy rate of 94%.