Scoring¶
ConversationScorer ¶
Bases: ABC
Pluggable quality scorer for AI responses.
Implement this ABC to evaluate AI response quality. Scorers are
invoked by :class:~roomkit.scoring.ScoringHook after each AI
response via the ON_AI_RESPONSE hook.
Implementations can be:
- LLM-as-judge — call a separate model to rate the response
- Rule-based — regex/keyword checks, length validation
- Heuristic — latency thresholds, tool usage patterns
- Human feedback — bridge to user rating collection
score
abstractmethod
async
¶
Score an AI response.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
response_content
|
str
|
The AI-generated text. |
required |
query
|
str
|
The user message that triggered the response. |
required |
room_id
|
str
|
Room where the response was generated. |
required |
channel_id
|
str
|
AI channel that generated the response. |
required |
usage
|
dict[str, Any] | None
|
Token usage from the provider. |
None
|
thinking
|
str
|
Extended thinking/reasoning (if available). |
''
|
Returns:
| Type | Description |
|---|---|
list[Score]
|
A list of :class: |
list[Score]
|
Return an empty list to skip scoring for this response. |
Score
dataclass
¶
A quality score for an AI response.
Attributes:
| Name | Type | Description |
|---|---|---|
value |
float
|
Score between 0.0 (worst) and 1.0 (best). |
dimension |
str
|
What is being scored (e.g. "relevance", "helpfulness", "safety", "accuracy", "coherence"). |
reason |
str
|
Human-readable explanation of the score. |
metadata |
dict[str, Any]
|
Arbitrary metadata (model used for judging, etc.). |
ScoringHook ¶
Runs conversation scorers on every AI response.
Attach to a :class:~roomkit.RoomKit instance to automatically
score AI responses via the ON_AI_RESPONSE hook. Scores are
stored as :class:~roomkit.models.task.Observation objects in the
conversation store and kept in a bounded in-memory buffer for quick
access.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
scorers
|
list[ConversationScorer]
|
List of :class: |
required |
store
|
ConversationStore | None
|
Optional store for persisting scores. If None, uses kit's store. |
None
|
max_recent
|
int
|
Maximum number of scores to keep in the in-memory buffer. |
100
|
MockScorer ¶
Bases: ConversationScorer
Returns configured scores and records calls.
Example::
scorer = MockScorer(scores=[Score(value=0.9, dimension="relevance")])
results = await scorer.score(
response_content="Hello!", query="Hi", room_id="r1", channel_id="ai"
)
assert len(scorer.calls) == 1