Cross-Linguistic Data Formats: Using Standards in Digitalization to Contribute to the Creation and Curation of Language Data
Standardization and retro-standardization can help to make existing datasets in linguistics comparable and share them with a broader public. The "Cross-Linguistic Data Formats" initiative develops new standards for multilingual language data and applies them to linguistic datasets in order to increase their reusability and transparency.
The "Cross-Linguistic Data Formats" initiative (CLDF) was founded in 2014 and has since then been extended along different dimensions in various projects. The goal of the initiative is to provide standards for cross-linguistic data and to apply them to the multitude of digitally available language data in order to create a pool of research data for historical and typological language comparison, which can be analyzed with unified methods.
At the Chair of Multilingual Computational Linguistics, we plan to extend the CLDF initiative by concentrating on certain areas that have so far not yet been targeted by CLDF. Here, we target specifically the modeling of texts in various forms (example sentences in grammars, poems, bigger corpora) and plan to address additional linguistic constructs (morphology, lexicon, syntax). Additionally, we want to provide server structures that help colleagues to deploy their own data online in the CLLD framework in order to make their data available to larger circle of users.
|Principal Investigator(s) at the University||Prof. Dr. Johann-Mattis List (Lehrstuhl für Multilinguale Computerlinguistik)|
|Project period||01.04.2023 - 31.03.2028|
The CLDF initiative was originally funded by the Max Planck Society. Over the years, parts of the CLDF specification and their application were funded by other research projects. These include, among others, the European Research Council, as part of the project "Computer-Assisted Language Comparison", lead by Johann-Mattis List from 2017 to 2022. With List's move to Passau, additional funding will be provided via the Chair of Multilingual Computational Linguistics through the University of Passau.