Datasets

LLM-based FAQ Rewrites

We introduce a German-language dataset comprising Frequently Asked Question-Answer pairs: raw FAQ drafts, their revisions by professional editors and LLM generated revisions. The data was used to investigate the use of large language models (LLMs) to enhance the editorial process of rewriting customer help pages. The corpus comprises 56 question-answer pairs addressing potential customer inquiries across various topics. For each FAQ pair, a raw input is provided by specialized departments, and a rewritten gold output is crafted by a professional editor of Deutsche Telekom. The final dataset also includes LLM generated FAQ-pairs. Please see our [paper](https://aclanthology.org/2024.inlg-main.13/) accepted at INLG 20204, Tokyo, Japan. You can find the Github repo containing the dataset here [https://github.com/DFKI-NLP/faq-rewrites-llms](https://github.com/DFKI-NLP/faq-rewrites-llms).

The MultiTACRED dataset

MultiTACRED is a multilingual version of the large-scale [TAC Relation Extraction Dataset](https://nlp.stanford.edu/projects/tacred). It covers 12 typologically diverse languages from 9 language families, and was created by machine-translating the instances of the original TACRED dataset and automatically projecting their entity annotations. For details of the original TACRED's data collection and annotation process, see the [Stanford paper](https://aclanthology.org/D17-1004/). Translations are syntactically validated by checking the correctness of the XML tag markup. Any translations with an invalid tag structure, e.g. missing or invalid head or tail tag pairs, are discarded (on average, 2.3% of the instances). Languages covered are: Arabic, Chinese, Finnish, French, German, Hindi, Hungarian, Japanese, Polish, Russian, Spanish, Turkish. Intended use is supervised relation classification. Audience - researchers. The dataset will be released via the LDC (link will follow). Please see [our ACL paper](https://arxiv.org/abs/2305.04582) for full details. You can find the Github repo containing the translation and experiment code here [https://github.com/DFKI-NLP/MultiTACRED](https://github.com/DFKI-NLP/MultiTACRED).

Ex4CDS - Textual Explanations for Clinical Decision Support

Ex4CDS are explanations (or more precisely justifications) of physicians in the context of clinical decision support. In the course of a larger study, physicians estimated the probability of different clinical outcomes in nephology, namely rejection, graft loss and infections, within the next 90 days. Each estimation had to be justified within a short text - these are our explanations. The explanations were provided in German and have strong similarities to general clinical notes. You can find a description and the data here: https://github.com/DFKI-NLP/Ex4CDS

German Adverse Drug Reaction (ADR) detection in patient-generated content

In this work, we present the first corpus for German Adverse Drug Reaction (ADR) detection in patient-generated content. The data consists of 4,169 binary annotated documents from a German patient forum, where users talk about health issues and get advice from medical doctors. As is common in social media data in this domain, the class labels of the corpus are very imbalanced. This and a high topic imbalance make it a very challenging dataset, since often, the same symptom can have several causes and is not always related to a medication intake. We aim to encourage further multi-lingual efforts in the domain of ADR detection. More info: https://aclanthology.org/2022.lrec-1.388/

MobASA Corpus

This repository contains corpus called MobASA: a novel German-language corpus of tweets annotated with their relevance for public transportation, and with sentiment towards aspects related to barrier-free travel. We identified and labeled topics important for passengers limited in their mobility due to disability, age, or when travelling with young children. The data can be used for as a training or test corpus for aspect-oriented sentiment analysis. Moreover, the corpus can benefit building inclusive public transportation systems. You can find the corpus here: https://github.com/DFKI-NLP/sim3s-corpus, and the description of the corpus here: https://aclanthology.org/2022.csrnlp-1.5.pdf

MobIE Corpus

This repository contains the DFKI MobIE Corpus (formerly "DAYSTREAM Corpus"), a dataset of 3,232 German-language documents collected between May 2015 - Apr 2019 that have been annotated with fine-grained geo-entities, such as location-street, location-stop and location-route, as well as standard named entity types (organization, date, number, etc). All location-related entities have been linked to either Open Street Map identifiers or database ids of Deutsche Bahn / Rhein-Main-Verkehrsverbund. The corpus has also been annotated with a set of 7 traffic-related n-ary relations and events, such as Accidents, Traffic jams, and Canceled Routes. It consists of Twitter messages, and traffic reports from e.g. radio stations, police and public transport providers. It allows for training and evaluating both named entity recognition algorithms that aim for fine-grained typing of geo-entities, entity linking of these entities, as well as n-ary relation extraction systems. You can find the description of the corpus here: https://www.dfki.de/web/forschung/projekte-publikationen/publikationen-uebersicht/publikation/11741/

Product Corpus

The Product Corpus is a dataset of 174 English web pages and social media posts annotated for product and company named entities, and the relation CompanyProvidesProduct. The goal is to make extraction of non-standard, B2B products and relations from unstructured text easier and more reliable. The corpus is also annotated for coreference chains of companies and products.

SmartData Corpus

The SmartData Corpus is a dataset of 2598 German-language documents which has been annotated with fine-grained geo-entities, such as streets, stops and routes, as well as standard named entity types. It has also been annotated with a set of 15 traffic- and industry-related n-ary relations and events, such as Accidents, Traffic jams, Acquisitions, and Strikes. The corpus consists of newswire texts, Twitter messages, and traffic reports from radio stations, police and railway companies. It allows for training and evaluating both named entity recognition algorithms that aim for fine-grained typing of geo-entities, as well as n-ary relation extraction systems.