Reasoning-Grounded Natural Language Explanations for Language Models

Abstract

We propose a large language model explainability technique for obtaining faithful natural language explanations by grounding the explanations in a reasoning process. When converted to a sequence of tokens, the outputs of the reasoning process can become part of the model context and later be decoded to natural language as the model produces either the final answer or the explanation. To improve the faithfulness of the explanations, we propose to use a joint predict-explain approach, in which the answers and explanations are inferred directly from the reasoning sequence, without the explanations being dependent on the answers and vice versa. We demonstrate the plausibility of the proposed technique by achieving a high alignment between answers and explanations in several problem domains, observing that language models often simply copy the partial decisions from the reasoning sequence into the final answers or explanations. Furthermore, we show that the proposed use of reasoning can also improve the quality of the answers.

Natural language explanations can potentially be easy to follow and unlimited in expressivity, but their faithfulness is typically questionable, such as with the simple answer-then-explain setting which tends to lead models into fabulating their explanations. Moreover, it is questionable whether LLMs produce their outputs in a thought process that is anyhow related to human reasoning, as they are in essence mere enhancements of traditional n-gram models. Chain-of-thought reasoning is one notable improvement of the decision process, but it is too computationally intensive for ubiquitous use.

In the research paper, we propose to ground natural language explanations, as well as the answers, in a suitable resource-efficient LLM reasoning process. When converted to a sequence of tokens, the result of the reasoning process can then become part of the context observed by the model when producing its final answer or explanation.

The grounding reasoning sequence does not have to be directly human-readable, as it merely has to encode the explanation together with the answer. This information can then be simply decoded from the reasoning sequence to natural language when the model generates the final answer or explanation. In order for the explanations to be credible, a joint predict-explain setting can be used, in which the answer and explanation are inferred independently of each other.

We evaluate our explainability framework in an LLM-as-a-classifier setting, in which we train LLMs to mimic the behavior of simple machine learning classifiers such as decision trees. Our paper makes the following contributions to the field of LLM explainability:

We propose an LLM explainability technique for producing faithful natural language explanations grounded in reasoning.
We observe that when a suitable reasoning process is included in LLM training, and the outputs of the reasoning process are placed in the LLM input contexts, LLMs will often copy the partial decisions from the reasoning sequence into their answers or explanations.
We demonstrate the plausibility of our proposed explainability technique by achieving a high alignment between answers and explanations in several problem domains.
We show that besides enabling faithful natural language explanations, the inclusion of the reasoning process can also improve the quality of answers.

BibTeX

@misc{cahlik2025reasoninggroundednaturallanguageexplanations,
      title={Reasoning-Grounded Natural Language Explanations for Language Models}, 
      author={Vojtech Cahlik and Rodrigo Alves and Pavel Kordik},
      year={2025},
      eprint={2503.11248},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2503.11248}, 
}

Reasoning-Grounded Natural Language Explanations for Language Models

Abstract

Experiments with joint training of answers and explanations on a decision tree dataset. The colored regions correspond to ground-truth classes. When reasoning is used, answer and explanation classification errors are typically near-perfectly aligned.

BibTeX