PeGazUs: A knowledge graph based approach to build urban perpetual gazetteers
In International conference on knowledge engineering and knowledge management (EKAW 2024)
In International conference on knowledge engineering and knowledge management (EKAW 2024)
In Document analysis and recognition - ICDAR 2024
Text on digitized historical maps contains valuable information, e.g., providing georeferenced political and cultural context. The goal of the ICDAR 2024 MapText Competition is to benchmark methods that automatically extract textual content on historical maps (e.g., place names) and connect words to form location phrases. The competition features two primary tasks—text detection and end-to-end text recognition—each with a secondary task of linking words into phrase blocks. Submissions are evaluated on two data sets: 1) David Rumsey Historical Map Collection which contains 936 map images covering 80 regions and 183 distinct publication years (from 1623 to 2012); 2) French Land Registers (created during the 19th century) which contains 145 map images of 50 French cities and towns. The competition received 44 submissions among all tasks. This report presents the motivation for the competition, the tasks, the evaluation metrics, and the submission analysis.
In 34es journées francophones d’ingénierie des connaissances (IC 2023) @ plate-forme intelligence artificielle (PFIA 2023)
Les annuaires professionnels anciens, édités à un rythme soutenu dans de nombreuses villes européennes tout au long des XIXe et XXe si‘ecles, forment un corpus de sources unique par son volume et la possibilité qu’ils donnent de suivre les transformations urbaines à travers le prisme des activités professionnelles des habitants, de l’échelle individuelle jusqu’à celle de la ville enti‘ere. L’analyse spatiotemporelle d’un type de commerces au travers des entrées d’annuaires demande cependant un travail considérable de recensement, de transcription et de recoupement manuels. Pour pallier cette difficulté, cet article propose une approche automatique pour construire et visualiser un graphe de connaissances géohistorique des commerces figurant dans des annuaires anciens. L’approche est testée sur des annuaires du commerce parisien du XIXe si‘ecle allant de 1799 à 1908, sur le cas des métiers de la photographie.
In Proceedings of the international conference on document analysis and recognition (ICDAR 2023)
Named Entity Recognition (NER) is a key step in the creation of structured data from digitised historical documents. Traditional NER approaches deal with flat named entities, whereas entities are often nested. For example, a postal address might contain a street name and a number. This work compares three nested NER approaches, including two state-of-the-art approaches using Transformer-based architectures. We introduce a new Transformer-based approach based on joint labelling and semantic weighting of errors, evaluated on a collection of 19th-century Paris trade directories. We evaluate approaches regarding the impact of supervised fine-tuning, unsupervised pre-training with noisy texts, and variation of IOB tagging formats. Our results show that while nested NER approaches enable extracting structured data directly, they do not benefit from the extra knowledge provided during training and reach a performance similar to the base approach on flat entities. Even though all 3 approaches perform well in terms of F1-scores, joint labelling is most suitable for hierarchically structured data. Finally, our experiments reveal the superiority of the IO tagging format on such data.
In Proceedings of the 15th IAPR international workshop on document analysis system
Named entity recognition (NER) is a necessary step in many pipelines targeting historical documents. Indeed, such natural language processing techniques identify which class each text token belongs to, e.g. “person name”, “location”, “number”. Introducing a new public dataset built from 19th century French directories, we first assess how noisy modern, off-the-shelf OCR are. Then, we compare modern CNN- and Transformer-based NER techniques which can be reasonably used in the context of historical document analysis. We measure their requirements in terms of training data, the effects of OCR noise on their performance, and show how Transformer-based NER can benefit from unsupervised pre-training and supervised fine-tuning on noisy data. Results can be reproduced using resources available at https://github.com/soduco/paper-ner-bench-das22 and https://zenodo.org/record/6394464
Copyright (c) 2022, LRE; all rights reserved.
Template by Bootstrapious. Ported to Hugo by DevCows.