Textual data are available in rapidly growing volumes and can be processed with ever more elaborate methods. Integrated qualitative and quantitative analysis of these data bears great potential for many disciplines to address new research questions. However, there is a lack of linkage between new textual data and other widely used data in the social sciences like social media or survey data. We are developing a service that will enable this linkage by using unique identifiers at a person or institution level, for example. While doing so, it is important not to focus too narrowly on technical implementation but to consider the important role of user-friendliness when developing such tools.
By working on linking textual data with other types of data, this work package aims at better exploiting the potential of social science data, opening up new research paths, and to document and make available opportunities for linkage for later re-use. The service is able to build on the experience gathered in the PolMine project (https://polmine.github.io), which prepares textual data for social science research and develops workflows and tools for the preparation and analysis of such text data. They are then made available to the community. Within KonsortSWD, data and procedures will be systematically made available to the social, behavioural, educational, and economic sciences for secondary use.
Corpora with initial annotations for data linkage will be available by mid-2022. In addition to the data, generic tools implemented in R will be provided for linguistic annotation of large corpora (R package ‘bignlp’) and for assigning unique identifiers (R package linktools).