Language is implicit - it omits information. Filling this information gap requires contextual inference, background- and commonsense knowledge, and reasoning over situational context. Language also evolves, i.e., it specializes and changes over time. For example, many different languages and domains exist, new domains arise, and both evolve constantly. Thus, language understanding also requires continuous and efficient adaptation to new languages and domains, and transfer to, and between, both. Current language understanding technology, however, focuses on high resource languages and domains, uses little to no context, and assumes static data, task, and target distributions. The research in Cora4NLP aims to address these challenges. It builds on the expertise and results of the predecessor project DEEPLEE and is carried out jointly between the language technology research departments in Berlin and Saarbrücken.
The research work in DEEPLEE, which is carried out in the Language Technology research departments in Saabrücken and Berlin, builds on DFKI's expertise in the areas of deep learning (DL) and language technology (LT) and develops it further. They aim for profound improvements of DL approaches in LT by focusing on four central, open research topics: Modularity in DNN architectures, Use of external knowledge, DNNs with explanation functionality, Machine Teaching Strategies for DNNs
This repository contains the DFKI MobIE Corpus (formerly "DAYSTREAM Corpus"), a dataset of 3,232 German-language documents collected between May 2015 - Apr 2019 that have been annotated with fine-grained geo-entities, such as location-street, location-stop and location-route, as well as standard named entity types (organization, date, number, etc). All location-related entities have been linked to either Open Street Map identifiers or database ids of Deutsche Bahn / Rhein-Main-Verkehrsverbund. The corpus has also been annotated with a set of 7 traffic-related n-ary relations and events, such as Accidents, Traffic jams, and Canceled Routes. It consists of Twitter messages, and traffic reports from e.g. radio stations, police and public transport providers. It allows for training and evaluating both named entity recognition algorithms that aim for fine-grained typing of geo-entities, entity linking of these entities, as well as n-ary relation extraction systems. You can find the description of the corpus here: https://www.dfki.de/web/forschung/projekte-publikationen/publikationen-uebersicht/publikation/11741/
The Product Corpus is a dataset of 174 English web pages and social media posts annotated for product and company named entities, and the relation CompanyProvidesProduct. The goal is to make extraction of non-standard, B2B products and relations from unstructured text easier and more reliable. The corpus is also annotated for coreference chains of companies and products.
The SmartData Corpus is a dataset of 2598 German-language documents which has been annotated with fine-grained geo-entities, such as streets, stops and routes, as well as standard named entity types. It has also been annotated with a set of 15 traffic- and industry-related n-ary relations and events, such as Accidents, Traffic jams, Acquisitions, and Strikes. The corpus consists of newswire texts, Twitter messages, and traffic reports from radio stations, police and railway companies. It allows for training and evaluating both named entity recognition algorithms that aim for fine-grained typing of geo-entities, as well as n-ary relation extraction systems.