Publication: Preserving Humanities Research Data: Data Depositing in the TextGrid Repository aka The Fluffy Import
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
If preservation is use, what are the implications for a humanities research data infrastructure? Preserving research data long-term, making it accessible and reusable for the scientific community is a fundamental concern for research infrastructures. This boils down to an alignment of expectations between data depositor and data recipient. For example, the research data infrastructure has requirements relating to format, metadata, responsibilities and licenses. When research data has been successfully deposited/published in a repository, the next pitfall looms: the often unappealing presentation or constrained findability of data. Applying the famous words of John Cotton Dana: If "Preservation is use", the research data infrastructure has to emphasize potential and re-usability of data. The paper introduces the data depositing workflow of the TextGrid Repository (TGRep). The TextGrid Repository TGRep is a pioneer of the Digital Humanities in the German-speaking area. Today, TGRep is part of the Text+ portfolio, the NFDI consortium for language and text-based research data in Germany. Each Text+ data center offers a workflow for incorporating research data that complies to its scope, making it available for reuse. ELTeC ELTeC in TGRep, the European Literary Text Collection, serves as an example for the identification, consultation, ingest, transformation, enrichment, publication and integration in the portfolio of Text+, spelling re-usability and interoperability. ELTeC is a state-of-the-art, open access multilingual collection of corpora containing novels from several European traditions developed for several reasons, among them the development of tools and methods in Computational Literary Studies. Currently, ELTeC contains more than 2000 full-text novels in XML-TEI in 21 languages. They are distributed via multiple platforms (such as GitHub and Zenodo). 1365 full-texts in 15 languages are also published in the TGRep. The Data Depositing Workflow in TGRep As TGRep is of great relevance for Text+ and its community, so is the task of minimizing the effort spent when publishing data there. The solution implemented in Text+ consists of a workflow that automates the creation of the technical files required when importing into the repository, while allowing for as much manual intervention as needed. To use the system, the user interacts with a web-based user interface running inside a Jupyter notebook. After specifying the location of the TEI files to be...