Information Extraction


The goal of the Text2Tech project is the research and development of automated methods for information extraction from unstructured text sources in order to be able to provide companies with decision-relevant knowledge about technological developments quickly and efficiently. AI-based methods for information extraction (IE) already make it possible to extract selected information, e.g. B. to people, companies and places automatically from text sources. In the Text2Tech project, such approaches are to be further developed in order to extract machine-readable knowledge about technologies, technology categories, companies and their relationships with each other from German and English-language, domain-specific text sources, using the example of the automotive industry. The most important research goals are the modeling and filling of domain-specific knowledge graphs (Knowledge Base Population), the development of methods for cross-lingual proper name recognition and linking (Named Entity Recognition or Entity Linking), relation extraction (Relation Extraction), as well as the development of Model compression methods so that models run efficiently even on small hardware.


BIFOLD conducts foundational research in big data management and machine learning, as well as its intersection, to educate future talent, and create high-impact knowledge exchange. The Berlin Institute for the Foundations of Learning and Data (BIFOLD), has evolved in 2019 from the merger of two national Artificial Intelligence Competence Centers: the Berlin Big Data Center (BBDC) and the Berlin Center for Machine Learning (BZML). Embedded in the vibrant Berlin metropolitan area, BIFOLD provides an outstanding scientific environment and numerous collaboration opportunities for national and international researchers. BIFOLD offers a broad range of research topics as well as a platform for interdisciplinary research and knowledge exchange with the sciences and humanities, industry, startups and society. Within BIFOLD, DFKI SLT conducts research in Clinical AI, specifically addressing the task of Pharmacovigilance. Pharmacovigilance is concerned with the assessment and prevention of adverse drug reactions (ADR) in pharmaceutical products. As the level of medication is generally raising all over the world, the potential risk of unwanted side effects, such as ADRs, is constantly increasing. Patients exchange views in their own language as 'experts in their own right,' in social media and disease-specific forums. Our project addresses the detection and extraction of ADR from medical forums and social media across different languages using cross-lingual transfer learning in combination with external knowledge sources.


Language is implicit - it omits information. Filling this information gap requires contextual inference, background- and commonsense knowledge, and reasoning over situational context. Language also evolves, i.e., it specializes and changes over time. For example, many different languages and domains exist, new domains arise, and both evolve constantly. Thus, language understanding also requires continuous and efficient adaptation to new languages and domains, and transfer to, and between, both. Current language understanding technology, however, focuses on high resource languages and domains, uses little to no context, and assumes static data, task, and target distributions. The research in Cora4NLP aims to address these challenges. It builds on the expertise and results of the predecessor project DEEPLEE and is carried out jointly between the language technology research departments in Berlin and Saarbrücken.


In order to optimally prepare industry, science and the society in Germany and Europe for the global Big Data trend, highly coordinated activities in research, teaching, and technology transfer regarding the integration of data analysis methods and scalable data processing are required. To achieve this, the Berlin Big Data Center is pursuing the following seven objectives: 1) Pooling expertise in scalable data management, data analytics, and big data application 2) Conducting fundamental research to develop novel and automatically scalable technologies capable of performing 'Deep Analysis' of 'Big Data'. 3) Developing an integrated, declarative, highly scalable open-source system that enables the specification, automatic optimization, parallelization and hardware adaptation, and fault-tolerant, efficient execution of advanced data analysis problems, using varying methods (e.g., drawn from machine learning, linear algebra, statistics and probability theory, computational linguistics, or signal processing), leveraging our work on Apache Flink 4) Transfering technology and know-how to support innovation in companies and startups. 5) Educating data scientists with respect to the five big data dimensions (i.e., applications, economic, legal, social, and technological) via leading educational programs. 6) Empowering people to leverage 'Smart Data', i.e., to discover newfound information based on their massive data sets. 7)Enabling the general public to conduct sound data-driven decision-making.


The research work in DEEPLEE, which is carried out in the Language Technology research departments in Saabrücken and Berlin, builds on DFKI's expertise in the areas of deep learning (DL) and language technology (LT) and develops it further. They aim for profound improvements of DL approaches in LT by focusing on four central, open research topics: Modularity in DNN architectures, Use of external knowledge, DNNs with explanation functionality, Machine Teaching Strategies for DNNs


The aim of the PLASS project is to develop a prototypical B2B platform for AI-based decision support for supply chain management. The focus is on the automatic recognition of decision-relevant information and the acquisition of structured knowledge from global and multilingual text sources. These sources provide a large database for SCM information, especially for the early detection of critical events and risks, but also of opportunities, e.g. through new technologies, at suppliers and supply chains. PLASS enables SMEs and large companies to continuously monitor their suppliers and supply chains, and supports supply chain managers in risk assessment and decision-making.


In the SIM3S project, data from the BMVI data offerings mCloud and MDM will be linked, refined and jointly analysed with other open data, user-generated content and data from individual modes of transport and other mobility-relevant companies in order to remove barriers and barriers to discrimination in everyday mobility. For the implementation of the project, state-of-the-art technologies and methods from the areas of Big Data Intelligent Analysis of mass data and artificial intelligence, in particular Natural Language Processing (NLP), are used.

MobIE Corpus

This repository contains the DFKI MobIE Corpus (formerly "DAYSTREAM Corpus"), a dataset of 3,232 German-language documents collected between May 2015 - Apr 2019 that have been annotated with fine-grained geo-entities, such as location-street, location-stop and location-route, as well as standard named entity types (organization, date, number, etc). All location-related entities have been linked to either Open Street Map identifiers or database ids of Deutsche Bahn / Rhein-Main-Verkehrsverbund. The corpus has also been annotated with a set of 7 traffic-related n-ary relations and events, such as Accidents, Traffic jams, and Canceled Routes. It consists of Twitter messages, and traffic reports from e.g. radio stations, police and public transport providers. It allows for training and evaluating both named entity recognition algorithms that aim for fine-grained typing of geo-entities, entity linking of these entities, as well as n-ary relation extraction systems. You can find the description of the corpus here:

Product Corpus

The Product Corpus is a dataset of 174 English web pages and social media posts annotated for product and company named entities, and the relation CompanyProvidesProduct. The goal is to make extraction of non-standard, B2B products and relations from unstructured text easier and more reliable. The corpus is also annotated for coreference chains of companies and products.

SmartData Corpus

The SmartData Corpus is a dataset of 2598 German-language documents which has been annotated with fine-grained geo-entities, such as streets, stops and routes, as well as standard named entity types. It has also been annotated with a set of 15 traffic- and industry-related n-ary relations and events, such as Accidents, Traffic jams, Acquisitions, and Strikes. The corpus consists of newswire texts, Twitter messages, and traffic reports from radio stations, police and railway companies. It allows for training and evaluating both named entity recognition algorithms that aim for fine-grained typing of geo-entities, as well as n-ary relation extraction systems.