Publications

A benchmark of nested named entity recognition approaches in historical structured documents

By Solenn Tual, Nathalie Abadie, Joseph Chazalon, Bertrand Duménieu, Edwin Carlinet

2023-06-01

In Proceedings of the international conference on document analysis and recognition (ICDAR 2023)

Abstract

Named Entity Recognition (NER) is a key step in the creation of structured data from digitised historical documents. Traditional NER approaches deal with flat named entities, whereas entities are often nested. For example, a postal address might contain a street name and a number. This work compares three nested NER approaches, including two state-of-the-art approaches using Transformer-based architectures. We introduce a new Transformer-based approach based on joint labelling and semantic weighting of errors, evaluated on a collection of 19th-century Paris trade directories. We evaluate approaches regarding the impact of supervised fine-tuning, unsupervised pre-training with noisy texts, and variation of IOB tagging formats. Our results show that while nested NER approaches enable extracting structured data directly, they do not benefit from the extra knowledge provided during training and reach a performance similar to the base approach on flat entities. Even though all 3 approaches perform well in terms of F1-scores, joint labelling is most suitable for hierarchically structured data. Finally, our experiments reveal the superiority of the IO tagging format on such data.

Continue reading

Clustering en chémoinformatique pour le raffinement de l’activité des molécules

By Maroua Lejmi, Ilef Ben Slima, Bertrand Cuissart, Nida Meddouri, Ronan Bureau, Alban Lepailleur, Jean-Luc Lamotte, Amel Borgi

2023-06-01

In Proceedings of the second computer science UTM PhD symposium

Abstract

Dans le domaine de la conception des médicaments, la chémoinformatique utilise des méthodes informatiques et mathématiques pour analyser des données chimiques et biologiques et essayer de trouver très en amont des molécules intéressantes. Dans notre contexte, nous transformons les molécules pour ne conserver que leurs caractéristiques pharmacophoriques (partie active de la molécule). L’objectif de ce travail est de raffiner l’activité des molécules qui seront utilisées dans le processus de conception des médicaments en des classes d’activité. Cela permettra aux chimistes et pharmaciens une meilleure visualisation et compréhension de l’activité des molécules, et fournira des données plus fines pour le développement ultérieur d’un modèle de prédiction des molécules d’interêt therapeutique.

Continue reading

Could the topology of virtual processors affect the performance of a BSD-family OS running in a VM?

By David Beserra, Marc Espie, Jean Araujo, Léo Tomasimo, Hector Poncins, Hadrien-Samrek Lacombe, Thomas Vondracek

2023-06-01

In 18th iberian conference on information systems and technologies (CISTI’2023)

Abstract

Virtual machines are an essential technology in distributed and pervasive systems. One of its configurable parameters is the topology of the virtual processing system, which can potentially impact its performance. In this work, we verify how different virtual processing topologies affect the performance of VMs running BSD OSes. We conclude that in some types of application the topology does not affect the VM performance, while in others it does, and that the performance impact also depends on the OS adopted by the VM.

Continue reading

CRACS: Compaction of rules in anticipatory classifier systems

By Romain Orhand, Pierre Collet, Pierre Parrend, Anne Jeannin-Girardon

2023-06-01

In Proceedings of the companion conference on genetic and evolutionary computation

Abstract

Rule Compaction of populations of Learning Classifier Systems (LCS) has always been a topic of interest to get more insights into the discovered underlying patterns from the data or to remove useless classifiers from the populations. However, these techniques have neither been used nor adapted to Anticipatory Learning Classifier Systems (ALCS). ALCS differ from other LCS in that they build models of their environments from which decision policies to solve their learning tasks are learned. We thus propose CRACS (Compaction of Rules in Anticipatory Classifier Systems), a compaction algorithm for ALCS that aims to reduce the size of their environmental models without impairing these models or the ability of these systems to solve their tasks. CRACS relies on filters applied to classifiers and subsumption principles. The capabilities of our compaction algorithm have been studied with three different ALCS on a thorough benchmark of 23 mazes of various levels of environmental uncertainty. The results show that CRACS reduces the size of populations of classifiers while the learned models of environments and the ability of ALCS to solve their tasks are preserved.

Continue reading

Explorer les débats parlementaires français de la troisième république par leurs sujets

By Marie Puren, Aurélien Pellet

2023-06-01

In Humanistica 2023

Abstract

Cet article compare trois méthodes pour explorer de grands corpus de documents historiques par leurs sujets. Nous travaillons ici sur les débats parlementaires franais de la Troisième République, qui se prêtent particulièrement bien à ce type d’analyse. Après avoir présenté le contexte de cette étude, nous exposons les résultats obtenus avec trois méthodes issues du traitement automatique des langues et appliquées sur des textes publiés entre 1876 et 1914 : l’allocation de Dirichlet latente, les plongements de mots et le Transfer Learning.

Continue reading

L’identification des projets de logiciel libre accessibles aux nouveaux contributeurs

By Paul Hervot, Benoît Crespin

2023-06-01

In EIAH2023 : 11ème conférence sur les environnements informatiques pour l’apprentissage humain

Abstract

FOSS makes an increasing amount of the public and industrial software landscape, notably for its transparency and democratic governance. However, simply publishing the source code of a software does not automatically make it accessible, and many barriers impede new contributors approaching these projects. Through a large-scale software mining of the Software Heritage archive, we test the pertinence of three signals in the identification of accessible FOSS projects for new contributors. Our results show a positive correlation between the number of new contributors of a project successfully bringing their contribution to completion and the presence of contributing guidelines, as well as between that same number and the number of recent unique contributors in the project. Such signals could find a use in the teaching of FOSS practices, helping teachers to select accessible projects for their students.

Continue reading

Linear object detection in document images using multiple object tracking

By Philippe Bernet, Joseph Chazalon, Edwin Carlinet, Alexandre Bourquelot, Élodie Puybareau

2023-06-01

In Proceedings of the international conference on document analysis and recognition (ICDAR 2023)

Abstract

Linear objects convey substantial information about document structure, but are challenging to detect accurately because of degradation (curved, erased) or decoration (doubled, dashed). Many approaches can recover some vector representation, but only one closed-source technique introduced in 1994, based on Kalman filters (a particular case of Multiple Object Tracking algorithm), can perform a pixel-accurate instance segmentation of linear objects and enable to selectively remove them from the original image. We aim at re-popularizing this approach and propose: 1. a framework for accurate instance segmentation of linear objects in document images using Multiple Object Tracking (MOT); 2. document image datasets and metrics which enable both vector- and pixel-based evaluation of linear object detection; 3. performance measures of MOT approaches against modern segment detectors; 4. performance measures of various tracking strategies, exhibiting alternatives to the original Kalman filters approach; and 5. an open-source implementation of a detector which can discriminate instances of curved, erased, dashed, intersecting and/or overlapping linear objects.

Continue reading

Metrics for community dynamics applied to unsupervised attacks detection

By Julien Michel, Pierre Parrend

2023-06-01

In Rencontres des jeunes chercheurs en intelligence artificielle

Abstract

Attack detection in big networks has become a necessity. Yet, with the ever changing threat landscape and massive amount of data to handle, network intrusion detection systems (NIDS) end up being obsolete. Different machine-learning-based solutions have been developed to answer the detection problem for data with evolving statistical distributions. However, no approach has proved to be both scalable and robust to passing time. In this paper, we propose a scalable and unsupervised approach to detect behavioral patterns without prior knowledge on the nature of attacks. For this purpose, we define novel metrics for graph community dynamics and use them as feature with unsupervised detection algorithm on the UGR’16 dataset. The proposed approach improves existing detection algorithms by 285.56% in precision and 222.82% in recall when compared to usual feature extraction (FE) using isolation forest.

Continue reading

Learning sentinel-2 reflectance dynamics for data-driven assimilation and forecasting

By Anthony Frion, Lucas Drumetz, Guillaume Tochon, Mauro Dalla Mura, Abdeldjalil Aïssa El Bey

2023-05-29

In Proceedings of the 31th european signal processing conference (EUSIPCO)

Abstract

Over the last few years, massive amounts of satellite multispectral and hyperspectral images covering the Earth’s surface have been made publicly available for scientific purpose, for example through the European Copernicus project. Simultaneously, the development of self-supervised learning (SSL) methods has sparked great interest in the remote sensing community, enabling to learn latent representations from unlabeled data to help treating downstream tasks for which there is few annotated examples, such as interpolation, forecasting or unmixing. Following this line, we train a deep learning model inspired from the Koopman operator theory to model long-term reflectance dynamics in an unsupervised way. We show that this trained model, being differentiable, can be used as a prior for data assimilation in a straightforward way. Our datasets, which are composed of Sentinel-2 multispectral image time series, are publicly released with several levels of treatment.

Continue reading