AISOC Team

Group Photo

About

The AI and Society (AISOC) team is is headed by Prof. Bilal Zafar. We are part of the Faculty of Computer Science at Ruhr University Bochum and affiliated with the Research Center for Trustworthy Data Science and Security (RC Trust), and the Cluster of Excellence CASA.

Our work aims to advance the development of trustworthy Artificial Intelligence (AI) and Machine Learning (ML) systems. By integrating multi-disciplinary perspectives, we aim to establish a comprehensive and practical framework for AI/ML trustworthiness. The key questions guiding our current research are:

Quantifying LLM’s ability to understand the world: The performance of LLMs regularly matches or even exceeds humans on popular benchmarks. However, proficiency in solving benchmarks does not equate human-level reasoning. For instance, a high score on a graduate-level math benchmark rarely implies the same level of conceptual understanding as a graduate-level math student. Since LLMs understand the world very differently than humans, our work aims to quantify these discrepancies.
Measuring and calibrating trust in AI: Despite their impressive capabilities, AI models continue to exhibit vulnerabilities like bias, lack of faithful explanations and hallucinations. At the same time, it is not always clear what it means for a LLM to be biased and for an explanation to be satisfactory. Our work draws insights from social sciences to operationalize these concepts and mitigate problematic behaviors.
Deployment challenges: Current day AI models are so compute-intensive that efficient deployment requires specially designed acceleration procedures like weight quantization and token pruning. These strategies, while enhancing efficiency, can have an adverse impact on the predictive performance. Our work aims to understand the impact of acceleration procedures and balance efficiency with performance.

Latest News

Older News

Jan 2026	Work on Memories in ChatGPT, jointly led by Abhishek, Soumi, Qinyuan and Elisabeth got accepted at WWW'26.
Jan 2026	Debtanu's summer internship work hallucination detection in low resource languages got accepted at EACL'26.
Jan 2026	Qinyuan's paper on work Generalizing over Memorized Data in LLMs got accepted at ICLR'26.
Jan 2026	Leon joined the team as a PhD student. Welcome Leon!
Dec 2025	Our team co-organized the workshop on Metacognition in Generative AI (co-located with EurIPS) with Marcel Binz, Hamed Hassani, Nastaran Okati and Isabel Valera.
Nov 2025	Elisabeth's work on generative search was covered by ArsTechnica, Tagesschau, and Heise. Read the paper here.

Recent Publications and Preprints

preprint

Characterizing Web Search in The Age of Generative AI

Elisabeth Kirsten, Jost Grosse Perdekamp, Mihir Upadhyay, Krishna P. Gummadi, and Muhammad Bilal Zafar

2025

Abs PDF

The advent of LLMs has given rise to a new type of web search: Generative search, where LLMs retrieve web pages related to a query and generate a single, coherent text as a response. This output modality stands in stark contrast to traditional web search, where results are returned as a ranked list of independent web pages. In this paper, we ask: Along what dimensions do generative search outputs differ from traditional web search? We compare Google, a traditional web search engine, with four generative search engines from two providers (Google and OpenAI) across queries from four domains. Our analysis reveals intriguing differences. Most generative search engines cover a wider range of sources compared to web search. Generative search engines vary in the degree to which they rely on internal knowledge contained within the model parameters v.s. external knowledge retrieved from the web. Generative search engines surface varying sets of concepts, creating new opportunities for enhancing search diversity and serendipity. Our results also highlight the need for revisiting evaluation criteria for web search in the age of Generative AI.
WWW

The Algorithmic Self-Portrait: Deconstructing Memory in ChatGPT

Abhisek Dash, Soumi Das, Elisabeth Kirsten, Qinyuan Wu, Sai Keerthana Karnam, Krishna P. Gummadi, Thorsten Holz, Muhammad Bilal Zafar, and Savvas Zannettou

In The Web Conference, 2026

Abs PDF Code

To enable personalized and context-aware interactions, conversational AI systems have introduced a new mechanism: Memory. Memory creates what we refer to as the Algorithmic Self-portrait – a new form of personalization derived from users’ self-disclosed information divulged within private conversations. While memory enables more coherent exchanges, the underlying processes of memory creation remain opaque, raising critical questions about data sensitivity, user agency, and the fidelity of the resulting portrait. To bridge this research gap, we analyze 2,050 memory entries from 80 real-world ChatGPT users. Our analyses reveal three key findings: (1) a striking 96% of memories in our dataset are created unilaterally by the conversational system, potentially shifting agency away from the user; (2) Memories, in our dataset, contain a rich mix of GDPR-defined personal data (in 28% memories) along with psychological insights about participants (in 52% memories); and (3) A significant majority of the memories (84%) are directly grounded in user context, indicating faithful representation of the conversations. Finally, we introduce a framework – Attribution Shield – that anticipates these inferences, alerts about potentially sensitive memory inferences, and suggests query reformulations to protect personal information without sacrificing utility.
EACL

Do LLM hallucination detectors suffer from low-resource effect?

Debtanu Datta, Mohan Kishore Chilukuri, Yash Kumar, Saptarshi Ghosh, and Muhammad Bilal Zafar

In European Chapter of the Association for Computational Linguistics, 2026

Abs PDF Code

LLMs, while outperforming humans in a wide range of tasks, can still fail in unanticipated ways. We focus on two pervasive failure modes: (i) hallucinations, where models produce incorrect information about the world, and (ii) the low-resource effect, where the models show impressive performance in high-resource languages like English but the performance degrades significantly in low-resource languages like Bengali. We study the intersection of these issues and ask: do hallucination detectors suffer from the low-resource effect? We conduct experiments on five tasks across three domains (factual recall, STEM, and Humanities). Experiments with four LLMs and three hallucination detectors reveal a curious finding: As expected, the task accuracies in low-resource languages experience large drops (compared to English). However, the drop in detectors’ accuracy is often several times smaller than the drop in task accuracy. Our findings suggest that even in low-resource languages, the internal mechanisms of LLMs might encode signals about their uncertainty. Further, the detectors are robust within language (even for non-English) and in multilingual setups, but not in cross-lingual settings without in-language supervision.
ICLR

Rote Learning Considered Useful: Generalizing over Memorized Data in LLMs

Qinyuan Wu, Soumi Das, Mahsa Amani, Bishwamittra Ghosh, Mohammad Aflah Khan, Krishna P. Gummadi, and Muhammad Bilal Zafar

In International Conference on Learning Representations, 2026

Abs PDF

Rote learning is a memorization technique based on repetition. It is commonly believed to hinder generalization by encouraging verbatim memorization rather than deeper understanding. This insight holds for even learning factual knowledge that inevitably requires a certain degree of memorization. In this work, we demonstrate that LLMs can be trained to generalize from rote memorized data. We introduce a two-phase memorize-then-generalize framework, where the model first rote memorizes factual subject-object associations using a semantically meaningless token and then learns to generalize by fine-tuning on a small set of semantically meaningful prompts. Extensive experiments over 8 LLMs show that the models can reinterpret rote memorized data through the semantically meaningful prompts, as evidenced by the emergence of structured, semantically aligned latent representations between the two. This surprising finding opens the door to both effective and efficient knowledge injection and possible risks of repurposing the memorized data for malicious usage.
EMNLP

Can LLMs Explain Themselves Counterfactually?

Zahra Dehghanighobadi, Asja Fischer, and Muhammad Bilal Zafar

In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

Abs PDF Code

Explanations are an important tool for gaining insights into model behavior, calibrating user trust, and ensuring compliance.Past few years have seen a flurry of methods for generating explanations, many of which involve computing model gradients or solving specially designed optimization problems.Owing to the remarkable reasoning abilities of LLMs, *self-explanation*, i.e., prompting the model to explain its outputs has recently emerged as a new paradigm.We study a specific type of self-explanations, *self-generated counterfactual explanations* (SCEs).We test LLMs’ ability to generate SCEs across families, sizes, temperatures, and datasets. We find that LLMs sometimes struggle to generate SCEs. When they do, their prediction often does not agree with their own counterfactual reasoning.
FAccT

Position is Power: System Prompts as a Mechanism of Bias in Large Language Models (LLMs)

Anna Neumann, Elisabeth Kirsten, Muhammad Bilal Zafar, and Jatinder Singh

In Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, 2025

Abs PDF Code

System prompts in Large Language Models (LLMs) are predefined directives that guide model behaviour, taking precedence over user inputs in text processing and generation. LLM deployers increasingly use them to ensure consistent responses across contexts. While model providers set a foundation of system prompts, deployers and third-party developers can append additional prompts without visibility into others’ additions, while this layered implementation remains entirely hidden from end-users. As system prompts become more complex, they can directly or indirectly introduce unaccounted for side effects. This lack of transparency raises fundamental questions about how the position of information in different directives shapes model outputs. As such, this work examines how the placement of information affects model behaviour. To this end, we compare how models process demographic information in system versus user prompts across six commercially available LLMs and 50 demographic groups. Our analysis reveals significant biases, manifesting in differences in user representation and decision-making scenarios. Since these variations stem from inaccessible and opaque system-level configurations, they risk representational, allocative and potential other biases and downstream harms beyond the user’s ability to detect or correct. Our findings draw attention to these critical issues, which have the potential to perpetuate harms if left unexamined. Further, we argue that system prompt analysis must be incorporated into AI auditing processes, particularly as customisable system prompts become increasingly prevalent in commercial AI deployments.
NAACL

The Impact of Inference Acceleration on Bias of LLMs

Elisabeth Kirsten, Ivan Habernal, Vedant Nanda, and Muhammad Bilal Zafar

In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics, 2025

Abs PDF Code

Last few years have seen unprecedented advances in capabilities of Large Language Models (LLMs). These advancements promise to benefit a vast array of application domains. However, due to their immense size, performing inference with LLMs is both costly and slow. Consequently, a plethora of recent work has proposed strategies to enhance inference efficiency, e.g., quantization, pruning, and caching. These acceleration strategies reduce the inference cost and latency, often by several factors, while maintaining much of the predictive performance measured via common benchmarks. In this work, we explore another critical aspect of LLM performance: demographic bias in model generations due to inference acceleration optimizations. Using a wide range of metrics, we probe bias in model outputs from a number of angles. Analysis of outputs before and after inference acceleration shows significant change in bias. Worryingly, these bias effects are complex and unpredictable. A combination of an acceleration strategy and bias type may show little bias change in one model but may lead to a large effect in another. Our results highlight a need for in-depth and case-by-case evaluation of model bias after it has been modified to accelerate inference.This paper contains prompts and outputs which may be deemed offensive.
KDD

On Early Detection of Hallucinations in Factual Question Answering

Ben Snyder, Marius Moisescu, and Muhammad Bilal Zafar

In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2024

Abs DOI PDF Code

While large language models (LLMs) have taken great strides towards helping humans with a plethora of tasks, hallucinations remain a major impediment towards gaining user trust. The fluency and coherence of model generations even when hallucinating makes detection a difficult task. In this work, we explore if the artifacts associated with the model generations can provide hints that the generation will contain hallucinations. Specifically, we probe LLMs at 1) the inputs via Integrated Gradients based token attribution, 2) the outputs via the Softmax probabilities, and 3) the internal state via self-attention and fully-connected layer activations for signs of hallucinations on open-ended question answering tasks. Our results show that the distributions of these artifacts tend to differ between hallucinated and non-hallucinated generations. Building on this insight, we train binary classifiers that use these artifacts as input features to classify model generations into hallucinations and non-hallucinations. These hallucination classifiers achieve up to 0.80 AUROC. We also show that tokens preceding a hallucination can already predict the subsequent hallucination even before it occurs.