The deciphering level is close to that of experts. The core error rate of Google AI in interpreting ancient books is only 0.56%

Google's AI Studio platform is testing an unnamed AI model and has made important progress in deciphering illegible historical manuscripts. The model's error rate in core character recognition is only 0.56%, and its accuracy is close to the level of professional researchers in this field.

Historian Mark Humphries conducted a systematic evaluation of the model using a purpose-built benchmark data set. In the five difficult manuscripts from the 18th to 19th centuries covered in the test, the overall character error rate of the model was about 1.7%. Most of the errors occurred in non-core issues such as punctuation and capitalization specifications, and did not affect the correct recognition of the words themselves.

If these non-critical errors are excluded, the character error rate of the model can be further reduced to 0.56%, which is equivalent to only one substantive error for every 200 characters transcribed. Its performance is already comparable to that of professional workers who focus on document transliteration.

The test manuscripts cover a diverse range of writing styles, including complex situations such as illegible handwriting, non-standard spelling, and inconsistent grammar, fully verifying the strong adaptability of the model.What's more noteworthy is that this model can not only complete text transcription, but also demonstrate certain contextual reasoning capabilities.

For example, when processing an 18th-century merchant's diary, the model encountered a sugar purchase record of "145" without unit labelling. By back-checking the account total and combining it with the British currency and weight unit systems of the time, it successfully deduced that the figure represented "14 pounds 5 ounces."

Humphries also pointed out that the current assessment still has certain limitations. Since this model only appears sporadically in the form of A/B testing, it is difficult to conduct large-scale systematic verification. Currently, only about 10% of the samples in the benchmark data set have been evaluated.