ChatGPT-4o Shows Promising Potential in Translation Post-Editing, But Lags Behind Humans in Key Metrics

A new study published on osf.io preprints has investigated the effectiveness of ChatGPT-4o, a Large Language Model, in post-editing Arabic translations across various domains. The research found that while ChatGPT-4o demonstrated superior efficiency in post-editing tasks, it lagged behind human post-editors in most quality metrics. Despite facing challenges in handling grammatical and syntactic nuances, the model showed competitive performance in producing fluent, coherent, and stylistically consistent text. The study highlights the potential of ChatGPT-4o as a supportive tool in translation post-editing workflows, complementing human translators by enhancing productivity and maintaining acceptable quality standards.

Key Takeaways:

The study found that human post-editors outperformed ChatGPT-4o in most quality metrics, including accuracy, coherence, and consistency.
ChatGPT-4o demonstrated superior efficiency in post-editing tasks, yielding a positive t-statistic of 8.00 and a p-value of 0.015, indicating a statistically significant difference.
The model faced challenges in handling grammatical and syntactic nuances, domain-specific idioms, and complex terminology, especially in medical and sports contexts.
ChatGPT-4o showed competitive performance in producing fluent, coherent, and stylistically consistent text, particularly in English-to-Arabic post-editing.
The study suggests that ChatGPT-4o can be a supportive tool in translation post-editing workflows, complementing human translators by enhancing productivity and maintaining acceptable quality standards.

Statistics:

The study used a paired t-test to statistically assess the differences in quality between humans and ChatGPT-4o post-edits.
The results showed a significant difference in efficiency between human and ChatGPT-4o post-edits (t-statistic = 8.00, p-value = 0.015).
No significant difference was observed between human and ChatGPT-4o post-edits in fluency metrics (t-statistic = -3.5, p-value = 0.074).

Sources:

osf.io/preprints/socarxiv/fyq42_v1/ (Preprint abstract)