DFKI-NLP is a Natural Language Processing group of researchers, software engineers and students at the Berlin office of the German Research Center for Artificial Intelligence (DFKI) working on basic and applied research in areas covering, among others, information extraction, knowledge base population, dialogue, sentiment analysis, and summarization. We are particularly interested in core research on learning in low-resource settings, reasoning over larger contexts, and continual learning. We strive for a deeper understanding of human language and thinking, with the goal of developing novel methods for processing and generating human language text, speech, and knowledge. An important part of our work is the creation of corpora, the evaluation of NLP datasets and tasks, and the explainability of (neural) models.

Key topics:

  • Applied / domain-specific information extraction
  • Learning in low-resource settings and over large contexts
  • Construction and analysis of IE datasets, linguistic annotation
  • Multilingual information extraction
  • Evaluation methodology research
  • Explainability

Our group forms a part of DFKI’s Speech and Language Technology department led by Prof. Sebastian Möller, and closely collaborates with e.g. the Technische Universität Berlin, DFKI’s Language Technology and Multilinguality department and DFKI’s Intelligent Analytics for Massive Data group.

Latest News

Recent Publications

FitCF: A Framework for Automatic Feature Importance-guided Counterfactual Example Generation

Counterfactual examples are widely used in natural language processing (NLP) as valuable data to improve models, and in explainable artificial intelligence (XAI) to understand model behavior. The automated generation of counterfactual examples remains a challenging task even for large language models (LLMs), despite their impressive performance on many tasks. In this paper, we first introduce ZeroCF, a faithful approach for leveraging important words derived from feature attribution methods to generate counterfactual examples in a zero-shot setting. Second, we present a new framework, FitCF, which further verifies aforementioned counterfactuals by label flip verification and then inserts them as demonstrations for few-shot prompting, outperforming two state-of-the-art baselines. Through ablation studies, we identify the importance of each of FitCF’s core components in improving the quality of counterfactuals, as assessed through flip rate, perplexity, and similarity measures. Furthermore, we show the effectiveness of LIME and Integrated Gradients as backbone attribution methods for FitCF and find that the number of demonstrations has the largest effect on performance. Finally, we reveal a strong correlation between the faithfulness of feature attribution scores and the quality of generated counterfactuals.

Reverse Probing: Evaluating Knowledge Transfer via Finetuned Task Embeddings for Coreference Resolution

In this work, we reimagine classical probing to evaluate knowledge transfer from simple source to more complex target tasks. Instead of probing frozen representations from a complex source task on diverse simple target probing tasks (as usually done in probing), we explore the effectiveness of embeddings from multiple simple source tasks on a single target task. We select coreference resolution, a linguistically complex problem requiring contextual understanding, as focus target task, and test the usefulness of embeddings from comparably simpler tasks tasks such as paraphrase detection, named entity recognition, and relation extraction. Through systematic experiments, we evaluate the impact of individual and combined task embeddings. Our findings reveal that task embeddings vary significantly in utility for coreference resolution, with semantic similarity tasks (e.g., paraphrase detection) proving most beneficial. Additionally, representations from intermediate layers of fine-tuned models often outperform those from final layers. Combining embeddings from multiple tasks consistently improves performance, with attention-based aggregation yielding substantial gains. These insights shed light on relationships between task-specific representations and their adaptability to complex downstream tasks, encouraging further exploration of embedding-level task transfer.

Cross-Refine: Improving Natural Language Explanation Generation by Learning in Tandem

Natural language explanations (NLEs) are vital for elucidating the reasoning behind large language model (LLM) decisions. Many techniques have been developed to generate NLEs using LLMs. However, like humans, LLMs might not always produce optimal NLEs on first attempt. Inspired by human learning processes, we introduce Cross-Refine, which employs role modeling by deploying two LLMs as generator and critic, respectively. The generator outputs a first NLE and then refines this initial explanation using feedback and suggestions provided by the critic. Cross-Refine does not require any supervised training data or additional training. We validate Cross-Refine across three NLP tasks using three state-of-the-art open-source LLMs through automatic and human evaluation. We select Self-Refine (Madaan et al., 2023) as the baseline, which only utilizes self-feedback to refine the explanations. Our findings from automatic evaluation and a user study indicate that Cross-Refine outperforms Self-Refine. Meanwhile, Cross-Refine can perform effectively with less powerful LLMs, whereas Self-Refine only yields strong results with ChatGPT. Additionally, we conduct an ablation study to assess the importance of feedback and suggestions. Both of them play an important role in refining explanations. We further evaluate Cross-Refine on a bilingual dataset in English and German.

Projects

*
TRAILS - Trustworthy and Inclusive Machines
Natural language processing (NLP) has demonstrated impressive performance in some human tasks. To achieve such performance, current neural models need to be pre-trained on huge amounts of raw text data. This dependence on uncurated data has at least four indirect and unintended consequences: 1) Uncurated data tends to be linguistically and culturally non-diverse due to the statistical dominance of major languages and dialects in online texts (English vs. North Frisian, US English vs. UK English, etc.). 2) Pre-trained neural models such as the ubiquitous pre-trained language models (PLM) reproduce the features present in the data, including human biases. 3) Rare phenomena (or languages) in the ‘long tail’ are often not sufficiently taken into account in model evaluation, leading to an underestimation of model performance, especially in real-world application scenarios. 4) The focus on achieving state-of-the-art results through the use of transfer learning with giant PLMs such as GPT4 or mT5 often underestimates alternative methods that are more accessible, efficient and sustainable. As inclusion and trust are undermined by these problems, in TRAILS we focus on three main research directions to address such problems: (i) inclusion of underrepresented languages and cultures through multilingual and culturally sensitive NLP, (ii) robustness and fairness with respect to long-tail phenomena and classes and ‘trustworthy content’, and (iii) robust and efficient NLP models that enable training and deployment of models for (i) and (ii). We also partially address economic inequality by aiming for more efficient models (objective (iii)), which directly translates into a lower resource/cost footprint.
TRAILS - Trustworthy and Inclusive Machines
BIFOLD
BIFOLD conducts foundational research in big data management and machine learning, as well as its intersection, to educate future talent, and create high-impact knowledge exchange. The Berlin Institute for the Foundations of Learning and Data (BIFOLD), has evolved in 2019 from the merger of two national Artificial Intelligence Competence Centers: the Berlin Big Data Center (BBDC) and the Berlin Center for Machine Learning (BZML). Embedded in the vibrant Berlin metropolitan area, BIFOLD provides an outstanding scientific environment and numerous collaboration opportunities for national and international researchers. BIFOLD offers a broad range of research topics as well as a platform for interdisciplinary research and knowledge exchange with the sciences and humanities, industry, startups and society. Within BIFOLD, DFKI SLT conducts research in Clinical AI, specifically addressing the task of Pharmacovigilance. Pharmacovigilance is concerned with the assessment and prevention of adverse drug reactions (ADR) in pharmaceutical products. As the level of medication is generally raising all over the world, the potential risk of unwanted side effects, such as ADRs, is constantly increasing. Patients exchange views in their own language as ‘experts in their own right,’ in social media and disease-specific forums. Our project addresses the detection and extraction of ADR from medical forums and social media across different languages using cross-lingual transfer learning in combination with external knowledge sources.
BIFOLD
BBDC2
In order to optimally prepare industry, science and the society in Germany and Europe for the global Big Data trend, highly coordinated activities in research, teaching, and technology transfer regarding the integration of data analysis methods and scalable data processing are required. To achieve this, the Berlin Big Data Center is pursuing the following seven objectives: 1) Pooling expertise in scalable data management, data analytics, and big data application 2) Conducting fundamental research to develop novel and automatically scalable technologies capable of performing ‘Deep Analysis’ of ‘Big Data’. 3) Developing an integrated, declarative, highly scalable open-source system that enables the specification, automatic optimization, parallelization and hardware adaptation, and fault-tolerant, efficient execution of advanced data analysis problems, using varying methods (e.g., drawn from machine learning, linear algebra, statistics and probability theory, computational linguistics, or signal processing), leveraging our work on Apache Flink 4) Transfering technology and know-how to support innovation in companies and startups. 5) Educating data scientists with respect to the five big data dimensions (i.e., applications, economic, legal, social, and technological) via leading educational programs. 6) Empowering people to leverage ‘Smart Data’, i.e., to discover newfound information based on their massive data sets. 7)Enabling the general public to conduct sound data-driven decision-making.
BBDC2

Datasets

The MultiTACRED dataset
MultiTACRED is a multilingual version of the large-scale TAC Relation Extraction Dataset. It covers 12 typologically diverse languages from 9 language families, and was created by machine-translating the instances of the original TACRED dataset and automatically projecting their entity annotations. For details of the original TACRED’s data collection and annotation process, see the Stanford paper. Translations are syntactically validated by checking the correctness of the XML tag markup. Any translations with an invalid tag structure, e.g. missing or invalid head or tail tag pairs, are discarded (on average, 2.3% of the instances). Languages covered are: Arabic, Chinese, Finnish, French, German, Hindi, Hungarian, Japanese, Polish, Russian, Spanish, Turkish. Intended use is supervised relation classification. Audience - researchers. The dataset will be released via the LDC (link will follow). Please see our ACL paper for full details. You can find the Github repo containing the translation and experiment code here https://github.com/DFKI-NLP/MultiTACRED.

Contact

  • Alt-Moabit 91c
    10559 Berlin
  • Enter Alt-Moabit 91c and take the elevator to Reception on Floor 4
  • 9:00 to 17:00 Monday to Friday