Article contents
Calligraphic Text Recognition by Gemini, Ernie ViLG and Google Translate: A Comparative Study of Arabic, Japanese and Chinese
Abstract
This study investigates the ability of Gemini, Ernie ViLG, and Google Translate (GT) to recognize Arabic, Japanese, and Chinese calligraphic text images. Analysis of 15 Arabic, 7 Japanese, and 10 Chinese calligraphic samples shows that Gemini successfully matched 12/15 Arabic calligraphic texts with their correct Qur’anic verses, produced accurate translations for all Japanese samples, and correctly interpreted 9/10 Chinese texts. Ernie ViLG generated incorrect or random Qur’anic matches for Arabic, correctly translated 4/7 Japanese and 3/10 Chinese samples, and frequently defaulted to culturally common themes when unable to decode strokes. GT failed to produce any translations for Arabic calligraphy and rendered partial, fragmented, or incoherent translations for Japanese and Chinese images. Across Arabic, Japanese or Chinese calligraphic texts, the three AI models exhibited distinct strengths and weaknesses due to their differing approaches to calligraphic text recognition. Gemini proved to be the most reliable, leveraging multilingual training, pattern matching, and semantic retrieval to associate stylized characters with canonical works, enabling coherent and culturally grounded interpretations. On the contrary, Ernie ViLG struggled with literal recognition, especially in Arabic, and relied heavily on cultural priors when visual cues were ambiguous. GT was the weakest, as its OCR pipeline is optimized for printed or clean handwritten text and breaks down when confronted with stylized or artistically distorted calligraphy. Modern AI models process calligraphic text through a combination of visual feature extraction and linguistic prediction. Contemporary neural architectures employ convolutional and transformer-based encoders to interpret strokes, curves, and spatial patterns holistically, allowing them to infer characters even when calligraphy departs from standard rules of spacing, baseline alignment, or shape consistency. Arabic, Chinese, and Japanese calligraphic scripts challenge these AI systems because they intentionally distort or stylize characters, requiring both visual recognition and the ability to match ambiguous forms with known verses, idioms, or poetic structures. Overall, the current results highlight how multimodal AI models diverge when confronted with stylization, cultural priors, and incomplete visual information.
Article information
Journal
Journal of Computer Science and Technology Studies
Volume (Issue)
7 (12)
Pages
474-494
Published
Copyright
Copyright (c) 2025 https://creativecommons.org/licenses/by/4.0/
Open access

This work is licensed under a Creative Commons Attribution 4.0 International License.

Aims & scope
Call for Papers
Article Processing Charges
Publications Ethics
Google Scholar Citations
Recruitment