Reasoning About the Non-Existent in Large Language Models: Benchmark Success, Negation Failure, and Exploratory Creativity

Abdullah Önden; İsmail Önden

doi:10.59543/avhexs22

Authors

Abdullah Önden Department of Computer Engineering, Faculty of Computer and Information Technologies, Istanbul University, Istanbul, Türkiye. https://orcid.org/0000-0003-3769-8193 Author
İsmail Önden Department of AI and Data Engineering, Faculty of Computer and Information Technologies, Istanbul University, Istanbul, Türkiye. https://orcid.org/0000-0003-2807-9454 Author https://orcid.org/0000-0003-2807-9454

DOI:

https://doi.org/10.59543/avhexs22

Keywords:

Frontier Large Language Models, Counterfactual Reasoning, Negation Blindness, Simulation-Reality Gap, Benchmark Saturation, CRASS, Live Evaluation

Abstract

Can artificial intelligence reason about things that do not exist, are not true, or are merely hypothetical? This paper addresses these questions with a two-part analysis of non-existence reasoning in state-of-the-art large language models. The analysis includes four components: counterfactual inference, negation handling, creative novelty, and multi-turn persistence of imaginary knowledge. Part 1 uses a benchmark-calibrated forecasting model to predict future performance for 12 models on 15,300 simulated datapoints. Part 2 revisits the question with live API evaluations of 5 state-of-the-art models on over 654 items covering CRASS-style counterfactual questions, negation tasks, creativity prompts, and multi-turn dialogues about imaginary topics. The outcome is a differentiated capability profile rather than a simple yes or no answer to the question of whether AI can reason about non-existence. The sampled CRASS items produced near-ceiling performance relative to the published human baseline, suggesting that some portions of the benchmark may no longer provide strong discriminative power among state-of-the-art models. By contrast, negation was associated with statistically significant performance drops in the four complete model runs, with the same directional tendency also visible in the partial fifth run. The conclusions about creative novelty and multi-turn imaginary persistence are more provisional since they rested on proxy measures and heuristic contradiction detection. In substance, the results imply that top models can execute certain types of structured counterfactual reasoning effectively but display a robust and replicable deficit with respect to negation; indications of difficulty with creative novelty and multi-turn imaginative persistence are present but rest on weaker proxy measures and should be treated as provisional. In method, they underscore the importance of treating benchmark-calibrated forecasts as perishable hypotheses that must be revalidated with live model evaluations in a rapidly evolving LLM environment.

Reasoning About the Non-Existent in Large Language Models: Benchmark Success, Negation Failure, and Exploratory Creativity

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

License

ABOUT THE JOURNAL

Make a Submission

index

The Journal is indexed or abstracted in:

Other information:

lOCKSS

Information

Latest publications