Go to main content

Chat about SOEP Research Data: A RAG System for Interactive SOEP Data Exploration

Publication details

Authors:
Paylag Torossian, Jan Goebel, Knut Wenzig
Publication Date:
07.12.2025
Number:
14/2025
DOI:
10.5281/zenodo.17643487
Proposal for Citation:
Torossian, P., Goebel, J., & Wenzig, K. (2025). Chat about SOEP Research Data: A RAG System for Interactive SOEP Data Exploration. Konsortium für die Sozial-, Verhaltens-, Bildungs- und Wirtschaftswissenschaften (KonsortSWD). https://doi.org/10.5281/zenodo.17643487

ABSTRACT
The German Socio-Economic Panel (SOEP) is one of the world’s longest-running household
panel studies, containing rich longitudinal data spanning over four decades. However, the
complexity and scale of SOEP data present significant challenges for researchers in data
discovery and exploration. This paper presents a prototype Retrieval-Augmented Generation
(RAG) system designed to provide conversational access to SOEP research data through natural language queries. Built on OpenWebUI with PostgreSQL and Qdrant vector databases, while leveraging open-weight LLMs, our system enables researchers to interactively explore dataset descriptions, variable definitions, survey questions, and metadata. While currently covering a substantial portion of SOEP datasets, this prototype demonstrates the potential for conversational AI to enhance research data accessibility and streamline the data discovery process for panel data researchers.

Keywords: LLM, AI, RAG, variable search, metadata, DDI