Thèse financée — Colportage urbains XVIe siècle - XVIIe siècle

Une équipe de recherche à Sorbonne Université propose le projet de thèse "COLPORTAL", jusqu'au 31 janvier 2024.
Il s'agit d'étudier des écrits de colportage urbains de la fin 16eme siècle à la fin du 17eme siècle, d'un point de vue linguistique et discursif. La langue qu'ils emploient a-t-elle des traits spécifiques ? Des caractéristiques d'une langue 'populaire'? Une langue écrite par des lettrés vers des moins lettrés pour connoter le "populaire? On voudrait interroger la formation des publics à partir des formes linguistiques de l'écrit. 
Le projet reposera sur un corpus numérisé qu'il s'agira de préparer, d'annoter, de fouiller. Des notions en humanités numériques et une appétence pour les méthodes du TAL (NLP) sont requises.
Le candidat ou la candidate doit avoir passé moins de 12 mois en France dans les 36 derniers mois. Une audition sera organisée. La date limite est le 31 janvier 2024. Les conditions complètes d'éligibilité sont ici : https://soundai.sorbonne-universite.fr/dl/for_students

 

Descriptif étendu

COLPORTAL - Detailed Doctoral Research Project 
SUMMARY

Chapbooks, broadcasted by peddlers, are known to have been the first means of disseminating books outside the elite, in both town and country, from the 16th century onwards. Its effect has been compared to that of the mass media (Botrel 1993). They have already addressed for the French area as way of circulation of norms and ideas, but not enough yet as a resource for linguistic knowledge. The aim of this thesis is to propose an automatic approach of these data, which, from a computer science point of view, questions the ability of systems to be effective on discursive forms that evolve diachronically.

TITLE

Distant reading of chapbooks: questioning the varieties of French in the 16th and 17th centuries

SUPERVISORS

Gaël Lejeune and Karine Abiven (Sorbonne University)

Eligibility criteria for the candidates

https://soundai.sorbonne-universite.fr/dl/for_students

Applications : until 31st of January.

DESCRIPTION OF THE PROJECT

    Chapbooks are known to have been the first way of disseminating books outside the elite, in towns from the 16th century onwards, and later in the countryside. Their effect has been compared to that of mass media (Botrel 1993). It is thus a source of "pop culture", a culture that can be said to be popular, in the sense that it has a wide social circulation (and not because it relates to "the people", an overly essentializing notion).
    Studies on the subject have so far focused on the uses of reading (Chartier), the history of the print (Chartier and Lüsebrink), the social history of the peddlers' guild (Fontaine 1993) or the mediation of literature by these writings (in particular tales, Andries and Bolleme). But no study to date has embraced the many genres present in these documents : news (“faits divers”), fun facts, songs, broadside ballads, almanacs, fictions. Indeed, it's difficult for the human eye to study such a wide range of textual genres together. If chapbooks have been taken so far as a source of knowledge on ancient societies, we'd like to make them contribute to the history of language.
    This is why the contribution of automatic analysis methods, Natural Language Processing and Computer Vision is necessary. First of all, we need to have a general view of this data, which is impossible for the human mind, but also to observe the internal dynamics of these sets of documents, particularly at the linguistic and discursive level. More concretely, the aim is to challenge one of the prejudices about chapbooks: their inertia (always the same tales, the same repeated almanacs ...). The texts are said to be reproduced more or less identically, an impression that lends credence to the thesis of the "conservative" mind of the "people" (Mandrou 1964 reed. 1981). In the case of songs, for example, Keilhauer (2000) has shown that the repertoire is much more mobile than previously thought. The same is true of news items ("canards"), which are certainly based on well-known themes, but also adapt to current events. What about the evolution of forms of discourse? Is street literature fundamentally conservative (as would be "mass literature" in the capitalist regime of the contemporary area: Dubois 1983: 131), or can we find formal, discursive and linguistic innovations?
    Wide diffusion writings are an exceptional source of linguistic variation : diatopic variation (regional languages or local variations of language) ; diastratic variation (i.e. according to the social situation of the speakers); diachronic variation (according to the evolution of discursive forms over time). Computational methods (in particular classical similarity calculations or the use of diachronic embeddings) will be able to reveal the modifications that affect writings as they are republished and as society evolves. This project will take a more textual approach than the usual treatments this corpus has undergone, in order to determine what the content of the writings themselves can tell us about the uses and appropriations of these widely circulated objects. 
    The variations exhibited by these documents raise the question, little addressed on the Natural Language Processing side, of the ability of language models (including LLMs, Large Language Models) to represent plural languages. What can an LLM trained on contemporary data reveal about 16th-century French, and about all the internal variations of this language ? If there is not one French but various forms of French, what kind of French do we model if we fine-tune an existing LLM to these data? Does a model of the French language at some point become a model of an unique French language, and thus a multilingual model? How much this can bias the representation of language variations ? This work would enrich our knowledge of the behavior of LLMs on varied and under-resourced language states (Dufter 2020, Imani Googhari 2023).
    Another interesting aspect of this corpus is the material form of the documents: formatting, visual structure of the leaflets, presence or absence of illustration (woodcuts in particular), typographical formatting (bold, italics). This materiality of the document can provide information on its cognitive apprehension: what sequence of text is highlighted (summaries of texts, slogan-type forms or elements of language)? How can we infer hypotheses from the interactions between text and image, particularly for strategies of dissemination and memorization? The visual structure of documents will be another angle of study. Indeed, the cross-fertilization between Computer Vision and NLP is seldom exploited in Digital Humanities. In particular it appears that the typo-dispositional considerations of corpora are relatively little studied in the field.
    This angle of study, which borrows the point of view of sciences of the text, requires working on precise and structured corpora. For the study to be feasible, the corpus needs to be narrowed down: we have chosen to do this by focusing on productions from the early days of the peddlers' guild in Europe (late 16th-late 17th century). On the one hand, this allows us to focus on the penetration of written culture in a largely non-reading population at that time (later periods are already better studied). On the other hand, it allows us to focus on an urban language (since at the time, peddling was more concentrated in towns and not yet widespread in the countryside). The other limitation concerns linguistic observables: we'll be focusing on fixed forms of the language (in particular proverbs) and markers of orality (apostrophe, stimulus connectors, etc.). In return for these limitations, the corpus will take into account the generic heterogeneity of the content (tales, but also songs, miscellaneous facts, almanacs).  In addition, and as far as possible according to the profile of the doctoral student, the study would be multilingual : English chapbooks, German Volkbuch, or Spanish papeles públicos and Portuguese litteratura de cordel. As research on peddling is more developed in these other linguistic fields, the resources would be more readily exploitable.

The aim is to work jointly on textual data and metadata (dates, language, text genre, format, presence or absence of illustrations). Many resources already exist, in several languages (see numerous digital libraries or other online databases infra : websites references). The data produced as part of the thesis may be provided in open source format to enrich existing websites devoted to digitized chapbooks collections. Data obtained via trained model OCR could be contributed, possibly enriched by applying Named Entity Recognition (useful for detecting referent variability). 

Domaine :
Littérature – Linguistique – Humanités numériques- TAL – Histoire de l’imprimé

REFERENCES

-    Andries, Lise, et Bollème, Geneviève (éd.), La Bibliothèque bleue, en collaboration avec G.Bollème, Paris, Laffont, 2003.
-    Botrel, Jean-François, Les Aveugles colporteurs d'imprimés en Espagne, Mélanges de la casa Velazquez, 1993, n°9, p. 417-482.
-    Chartier, Roger, Lectures et lecteurs dans la France d’Ancien Régime, Seuil, Univers historique, 1987.
-    Chartier, Roger, et Lüsebrink Hans-Jürgen (dir.), Colportage et lecture populaire. Imprimés de grande circulation en Europe, XVIᵉ-XIXᵉ siècles, Paris, IMEC Éditions et Éditions de la Maison des Sciences de l’Homme, 1996.
-    Dubois, Jacques, L’institution de la littérature. Introduction à une sociologie, Bruxelles, Éditions Labor/Fernand Nathan, 1983.
-    Dufter, Philipp et Schütze, “Hinrich Identifying elements essential for BERT’s multilinguality”,  in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, 2020, p.  4423–4437.
-    Fontaine, Laurence, Histoire du colportage en Europe (XVᵉ-XIXᵉ siècle), Paris, Albin Michel, 1993. 
-     Imani Googhari, Ayyoob et Lin, Peiqin et Kargaran, Amir Hossein  et Severini, Silvia et Sabe, Masoud Jalili et Kassner, Nora et Ma, Chunlan et  Schmid, Helmut  et Martins, André, et Yvon, François et Schütze, Hinrich Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages,  61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, Canada. Association for Computational Linguistics, 2023, p. 1082–1117.
-    Keilhauer, Annette, « La compétence vendue : Aspects de la relation entre éditeurs, chanteurs et public dans le colportage des chansons à Paris au XVIIIe s. », dans Chansons de colportage, Reims, ÉPURE - Éditions et Presses universitaires de Reims, 2002, p. 171-190.
-    Mandrou, Robert, De la culture populaire aux XVIIᵉ-XVIIIIᵉ siècles, La Bibliothèque bleue de Troyes, Paris, Stock, 1964.
 

Announcements: