Abstract
The objective of this thesis is twofold. On one hand it targets the proposition of a more accurate evaluation protocol designed for text detection systems that solves some of the existing problems in this area. On the other hand, it focuses on the design of a text rectification procedure used for the correction of highly deformed texts. Text detection systems have gained a significant importance during the last years. The growing number of approaches proposed in the literature requires a rigorous performance evaluation and ranking. In the context of text detection, an evaluation protocol relies on three elements: a reliable text reference, a matching set of rules deciding the relationship between the ground truth and the detections and finally a set of metrics that produce intuitive scores. The few existing evaluation protocols often lack accuracy either due to inconsistent matching procedures that provide unfair scores or due to unrepresentative metrics. Despite these issues, until today, researchers continue to use these protocols to evaluate their work. In this Ph.D thesis we propose a new evaluation protocol for text detection algorithms that tackles most of the drawbacks faced by currently used evaluation methods. This work is focused on three main contributions: firstly, we introduce a complex text reference representation that does not constrain text detectors to adopt a specific detection granularity level or annotation representation; secondly, we propose a set of matching rules capable of evaluating any type of scenario that can occur between a text reference and a detection; and finally we show how we can analyze a set of detection results, not only through a set of metrics, but also through an intuitive visual representation. We use this protocol to evaluate different text detectors and then compare the results with those provided by alternative evaluation methods. A frequent challenge for many Text Understanding Systems is to tackle the variety of text characteristics in born-digital and natural scene images to which current OCRs are not well adapted. For example, texts in perspective are frequently present in real-word images because the camera capture angle is not normal to the plane containing text regions. Despite the ability of some detectors to accurately localize such text objects, the recognition stage fails most of the time. Indeed, most OCRs are not designed to handle text strings in perspective but rather expect horizontal texts in a parallel-frontal plane to provide a correct transcription. All these aspects, together with the proposition of a very challenging dataset, motivated us to propose a rectification procedure capable of correcting highly distorted texts.