SDS-011: PET Prompt Judge Service


spec_id: SDS-011 title: PET Prompt Judge Service bounded_context: cognitive-extension status: Draft version: 1.1.0 date_created: 2025-12-21 last_updated: 2026-01-03 implements:

Purpose

Defines the Prompt Judge Service, the core backend component of the PET App responsible for evaluating user prompts, inferring intent, detecting weaknesses, and recommending improvements.


1. Architecture Overview

The Prompt Judge Service is a stateless processing module (likely a microservice or serverless function) that accepts a prompt context and returns a structured evaluation.

It implements the Modular Pipeline pattern defined in ADR-018.

graph TD
    A[Client Request] --> B[Judge Service]
    B --> C{Pipeline Orchestrator}
    C --> D[Intent Detector]
    C --> E[Structure Evaluator]
    C --> F[Agentic Viability]
    C --> G[Constraint Checker]
    D & E & F & G --> H[Aggregator]
    H --> I[Suggestion Engine]
    I --> J[Final Evaluation JSON]

2. API Specification

2.1 Endpoint: POST /v1/judge/evaluate

[!NOTE] API versioning follows semantic versioning. Breaking changes require version bump.

Request:

1
2
3
4
5
6
7
8
{
  "prompt": "Summarize this article and send it to Slack.",
  "context": {
    "userLevel": "intermediate",
    "mode": "agentic",
    "domainRules": ["strict-no-pii"]
  }
}

Response:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
{
  "evaluationId": "eval-123456",
  "score": 85,
  "summary": "Good clear intent, but missing specific constraints and tool parameters.",
  "sections": {
    "intent": {
      "inferred": "User wants a summary of text and a Slack notification.",
      "score": 5,
      "feedback": "Intent is clear."
    },
    "structure": {
      "score": 3,
      "feedback": "Lacks specific tool parameters (channel, length)."
    },
    "agentic": {
      "score": 3,
      "feedback": "Implies tool use but doesn't define error handling or format."
    }
  },
  "suggestions": [
    {
      "type": "add_constraint",
      "text": "Specify summary length (e.g., '3 sentences')."
    },
    {
      "type": "add_parameter",
      "text": "Specify Slack channel (e.g., '#general')."
    }
  ],
  "improvedPrompt": "Summarize this article in 3 sentences. Then, send the summary to the #general Slack channel. usage: slack.post(channel='#general', text=summary). Handle errors if Slack is down.",
  "flags": ["#missing_constraints", "#agentic_ambiguity"]
}

3. Component Logic

3.1 Sub-Judges (Logic Pipeline)

  1. Language Detector:
  2. Intent Detector:
  3. Structure Evaluator:
  4. Agentic Viability Evaluator (Active if mode=agentic):
  5. Constraint Checker:

3.2 Suggestion Engine

3.3 Hybrid Scoring

The Judge uses a hybrid scoring model:

Component Method Weight
Intent LLM extraction 0.2
Structure Rule-based + LLM 0.3
Agentic LLM evaluation 0.5

Rule-based checks run first (fast, deterministic), then LLM evaluates remaining aspects.


4. Invariants

The following invariants MUST be maintained:

  1. Determinism in Test Mode: Given identical inputs and mode=test, evaluation results MUST be identical.
  2. Privacy First: User prompts MUST NOT be persisted without explicit consent.
  3. Latency Bound: Evaluation MUST complete within 5 seconds (streaming allowed).
  4. Schema Compliance: All responses MUST validate against the evaluation JSON Schema.

5. Configuration & Rubrics

The service loads Rubric Configurations based on the user’s Organization ID. Enterprises can publish versioned “Best Practice” documents that become part of the system prompt context.

Example rubric.yaml:

1
2
3
4
5
6
7
8
org_id: "default"
weights:
  intent: 0.2
  structure: 0.3
  agentic: 0.5
priorities:
  - no_jailbreaks
  - clear_tool_use

5. Integration


6. Implementation Strategy (v1)