Recognizing Toponyms in a Diachronic Alpine Corpus of Texts
How can geographical names (toponyms) be recognized and georeferenced efficiently and reliably in a large collection of texts with the help of volunteers?
(1) Diachronic texts contain historical geographic names, the recognition and geo-referencing of which requires detective skills and intellectual work.
(2) When aided by the user interface, amateurs can verify and link geographic names in a way that is efficient and reliable.
The Institute of Computational Linguistics has digitized the annual publications of the Swiss Alpine Club (SAC) issued regularly since 1864. The previous Citizen Science project SACKokos enabled optical character recognition (OCR) errors to be corrected. The aim of the GeoKokos project is to enrich the texts with semantic information and comprises two tasks: 1. to recognize all text segments that represent geographical names (toponyms), 2. to link the toponyms to geographic databases (entity linking). The Citizen Scientists can then correct the automatically pre-annotated text passages on the GeoKokos website.
The computer faces three problems: 1. It does not recognize all the toponyms, especially the names or spellings that are no longer common today (Viesch has become Fiesch). 2. Some toponyms are in use as normal nouns (virgin, monk) or are part of other names (Hotel Arosa). 3. The same geographical name can refer to totally different places (in Switzerland there are more than 12 peaks named Schwarzhorn).
The computer makes mistakes in detecting toponyms because it cannot really grasp the text. By contrast, a volunteer who reads carefully and with interest usually has no problem in understand the meaning of words in the context and sees the computer’s errors immediately. In this project, ‘human computation’ and machine learning complement each other. Once human beings have corrected a number of texts, we can then create better systems for recognizing toponyms automatically. This in turn reduces the correction work for our volunteers. Ultimately, this synergy produces texts annotated with precise toponyms that will provide solutions to linguistic and geographic questions.
It is only with the knowledge of our enthusiastic volunteers that the toponyms of these historical texts can be linked to real geographical references.
We are expecting a similar level of participation as with SACKokos, where voluntary helpers corrected nearly 200,000 errors on 21,000 pages in 6 months.
Prof. Dr. Martin Volk, Dr. Simon Clematide, Institute of Computational Linguistics, University of Zurich
Prof. Dr. Ross Purves, Department of Geography, University of Zurich