Textual data are available in rapidly growing volumes and can be processed with ever more elaborate methods. Integrated qualitative and quantitative analysis of these data bears great potential for many disciplines to address new research questions. However, there is a lack of linkage between new textual data and other widely used data in the social sciences like social media or survey data. We are developing a service that will enable this linkage by using unique identifiers at a person or institution level, for example. While doing so, it is important not to focus too narrowly on technical implementation but to consider the important role of user-friendliness when developing such tools.
By working on linking textual data with other types of data, this work package aims at better exploiting the potential of social science data, opening up new research paths, and to document and make available opportunities for linkage for later re-use. The service is able to build on the experience gathered in the PolMine project (https://polmine.github.io), which prepares textual data for social science research and develops workflows and tools for the preparation and analysis of such text data. They are then made available to the community. Within KonsortSWD, data and procedures will be systematically made available to the social, behavioural, educational, and economic sciences for secondary use.
The GermaParl corpus with plenary debates of the German Bundestag from 1949 to 2021 contains first annotations for data linkage. Access to the beta version can be requested via Zenodo (https://doi.org/10.5281/zenodo.6539967). Everyone is invited to work with the data and to contribute to the improvement of the userfriendliness through their feedback. The corpus is planned to be published Open Access in autumn 2022. In addition to the data, generic tools implemented in R will be provided for linguistic annotation of large corpora (R package ‘bignlp’) and for assigning unique identifiers (R package linktools).