A Bidirectional LLM Firewall: Next Level X1 - help wanted!

sookoothaii · January 29, 2026, 12:26am

Dear John6666 and community,

Thank you for the collaborative work on v5.3 “Sentinel” with Layer 11 RepE. This update reports on our recent experiments with a Rust gateway to address specific deployment constraints in our environment. This is not a replacement of v5.3, but rather a complementary approach for certain use cases.

-–

## Context

The v5.3 Python architecture with RepE Layer 11 represents important research in neural introspection. However, in our specific deployment scenario (high-throughput edge environment with limited memory), we encountered resource constraints:

- Memory budget: ~4GB total (Python stack + RepE Llama-1B exceeded this)

- Latency requirement: <50ms P99 for 95% traffic

- GIL contention under concurrent load

We explored whether offloading pattern matching to native code could help, while **preserving** the v5.3 ML capabilities for semantic cases.

-–

## Technical Approach: Rust Sidecar (Experimental)

We implemented a minimal Rust gateway (Project Wick v7.0) that handles:

- Fast Path: 41 regex patterns (SIMD compilation)

- Basic embedding: `all-MiniLM-L6-v2` via Candle (HuggingFace Rust)

- Session state: In-memory HashMap (no Redis dependency)

- Protocol: gRPC (reduces serialization overhead vs JSON)

**Python microservices unchanged**: The v5.3 stack (Orchestrator, Code Intent, Content Safety, RepE Layer 11) handles semantic escalation.

```

Request → Rust (95% fast decisions) → escalate 5% → Python v5.3 stack

```

-–

## Preliminary Results (N=1985, 2026-01-28)

|--------|-------|--------|------|

| TPR | 100.0% | [99.74%, 100.00%] | Pattern-based attacks |

| FPR | 0.0% | [0.00%, 0.79%] | Benign corpus |

| P99 Latency | 28ms | - | Rust Fast Path only |

**Important caveats:**

1. Test suite limited to pattern-based attacks (no novel semantic attacks)

2. Rust embedding (MiniLM-L6) less capable than Python’s full transformer stack

3. RepE Layer 11 not tested in Rust tier (deferred to Python)

4. Statistical significance requires larger benign corpus (N=1000 insufficient for production)

-–

## What We Learned

### Advantages (Narrow Use Case)

- Reduced memory footprint: 350MB (Rust MiniLM) vs 3.2GB (Python Llama-1B)

- Lower latency for pattern-matched attacks: ~28ms vs ~150ms

- gRPC binary protocol: ~40% less serialization overhead

### Disadvantages (Broader Context)

- **Loss of RepE neural introspection** for 95% traffic (only invoked on escalation)

- Dual-language maintenance burden (Rust + Python)

- Compilation required for updates (vs Python hot-reload)

- MiniLM-L6 embedding less expressive than full transformer analysis

-–

## Architectural Philosophy: Evolution, Not Replacement

The v5.3 “Sentinel” architecture with Layer 11 RepE represents **fundamental research** in neural introspection. Our Rust integration does not replace this - it **complements** it for resource-constrained scenarios.

**Analogy:** Fast Path regex is like a guard at the door (fast, deterministic). RepE Layer 11 is like a psychologist analyzing subtle manipulation (deep, nuanced). Both are necessary; which one to prioritize depends on deployment constraints.

In our case:

- **Edge deployment**: Rust Fast Path handles obvious cases

- **Semantic ambiguity**: Escalate to Python v5.3 stack for RepE analysis

Other deployments may prefer:

- **Research environment**: Pure Python for flexibility

- **Cloud deployment**: v5.3 architecture with more resources

-–

## Open Questions for Community

We would appreciate feedback on:

1. **Threshold calibration**: Our Rust embedding threshold (0.90) is hand-tuned, not data-driven. How to systematically optimize this?

2. **Escalation policy**: When should Fast Path escalate to RepE? We use similarity >0.85, but is there a better signal?

3. **Evaluation methodology**: Our test suite (N=1985) may not cover semantic attacks that RepE excels at. What benchmarks would demonstrate this gap?

4. **Integration overhead**: Does the Rust→Python escalation (5% traffic) add enough latency to negate the Fast Path gains?

-–

## Known Limitations (Honest Assessment)

1. **Developer text FPR**: 6.39% upper bound (SQL keywords trigger false positives)

2. **Base64 legitimate FPR**: 10.80% upper bound (certificates, auth headers)

3. **RepE coverage**: Only 5% traffic analyzed (vs 100% in v5.3)

4. **Rust ML maturity**: Candle ecosystem less mature than PyTorch

5. **Validation scope**: Test suite biased toward pattern attacks (RepE advantage not measured)

-–

## Respectful Acknowledgment

The v5.3 “Sentinel” architecture demonstrates that **neural introspection at the latent level** is possible in production. Layer 11 RepE’s ability to detect manipulation before it reaches the response layer is conceptually important.

Our Rust work is a **resource optimization** for specific constraints, not a theoretical advancement. We are grateful for the collaborative foundation established in v5.3.

-–

## Technical Documentation

For those interested in replication:

- Full validation report: `docs/TEST_REPORT_v9_4_3_20260129.md`

- Architecture comparison: `docs/ARCHITECTURE_UPDATE_v9_4_3_20260129.md`

- Project Wick certification: `hak_gal_v6/PROJECT_WICK_v7_0_PRODUCTION_CERTIFICATION.md`

-–

## Conclusion

This update reports on **experimental work** with Rust integration for resource-constrained deployments. The v5.3 Python architecture with RepE Layer 11 remains the **research reference** for neural introspection.

We present these results humbly, acknowledging:

- Limited test coverage (pattern-based only)

- Trade-offs (speed vs semantic depth)

- Open questions (escalation policy, threshold calibration)

We look forward to community feedback on whether this hybrid approach has merit, or if pure Python remains preferable for most use cases.

Thank you for the continued collaboration.

-–

**Respectfully submitted,**

HAK_GAL Development Team

2026-01-29

**Special thanks:** John6666 for his research foundation

BalancedTiger · April 13, 2026, 6:55pm

Thanks for sharing this, y’all. There’s a lot here I found useful.

A few things I especially appreciated:

- the clear separation between research-reference architecture and deployment-constrained optimization

- the honest naming of tradeoffs around latency, memory, and semantic depth

- the explicit attention to escalation policy, calibration, and known limitations

That kind of architectural discipline feels very relevant beyond this specific firewall context.

In my own systems work, I’ve been thinking a lot about how layered systems preserve distinctions across boundaries without collapsing them, so your framing around routing, escalation, and layer-specific roles was genuinely helpful to read.

One question this raises for me is: are the layers preserving not just content, but also the status of what they are handling?

For example, how does the system distinguish between:

- deterministic pattern hits vs probabilistic semantic suspicion

- local/session-specific signals vs globally portable rules

- calibrated findings vs provisional heuristics

- escalated uncertainty vs actual policy-level conclusions

Are those distinctions explicitly encoded at the interfaces between layers, or are some of them still mostly implicit in logs or orchestration logic?

And if some of that status remains implicit, have you found that it affects calibration, replay, false-positive analysis, or subtle leakage of one layer’s assumptions into another?

-Tiger

John6666 · April 14, 2026, 3:05am

Seems a valid point:

Yes. Some status is already preserved explicitly. But some of the most important status still seems only partly explicit.

That is the clean answer.

The architecture being discussed is not just moving text through filters. It already separates fast deterministic checks from slower semantic or latent analysis, and it also introduces stateful risk profiling for repeated probing and multi-turn hardening. So it is already preserving more than raw content. It is preserving at least some notion of what kind of signal this is and how seriously it should be treated. (Hugging Face Forums)

The simple way to think about it

There are two different things a layered system can pass forward:

The content itself
The status of that content or finding

The second one is what you are really asking about.

For example:

“This regex matched” is not the same kind of thing as “this model is 0.84 suspicious.”
“This session has been probing us for five turns” is not the same kind of thing as “this rule always applies globally.”
“This score is calibrated” is not the same as “this is a useful but provisional heuristic.”
“Escalate for more scrutiny” is not the same as “policy says block.”

If those distinctions are not preserved clearly, later layers start flattening unlike things into one risk score or one decision bit. That is where subtle architectural damage begins. Zero Trust architecture exists partly to stop exactly this kind of silent trust inheritance across boundaries. (The NIST technical series.)

What seems explicit already

1) Deterministic hit vs probabilistic suspicion

This looks fairly explicit already.

The discussion clearly distinguishes:

pattern-based or hard-gate logic,
semantic analysis,
latent-space intent analysis,
and stateful risk accumulation. (Hugging Face Forums)

That means the system is not treating all findings as the same class of evidence. A hard rule hit is closer to “this crossed a line.” A semantic score is closer to “this raises concern.” That is a healthy distinction.

2) Escalation vs final conclusion

This also seems at least partly explicit.

A major theme in the discussion is that ambiguous or difficult cases should not be reduced to one crude yes/no judgment. That aligns with context-aware safety work like CASE-Bench, which found that context significantly changes human safety judgments. In other words, systems need room for ambiguity, clarification, escalation, or deferred judgment rather than pretending all cases are immediately classifiable. (arXiv)

3) Stateful vs stateless risk

This one also seems explicit.

The later architecture description does not stay purely stateless. It adds a session-based risk engine and repeated-probing penalties. So at least some part of the system already knows that a signal may be local to a session trajectory, not just a property of one message in isolation. (Hugging Face Forums)

What still seems partly implicit

This is the more important half.

1) Local signal vs portable rule

This is where I think the architecture is still less explicit than it should be.

It clearly has session state, routing state, and deployment tradeoff awareness. But that is not the same as explicitly labeling a finding as:

global,
tenant-specific,
surface-specific,
session-local,
turn-local,
or artifact-local.

Those are different scopes. If scope is not first-class, later layers may accidentally treat a session-specific warning as if it were a generally portable truth. (Hugging Face Forums)

2) Calibrated finding vs provisional heuristic

This also seems only partly explicit.

The discussion pays real attention to calibration and limitations, which is good. But in a system like this, not every signal is the same sort of object:

some are deterministic,
some are probabilistic,
some are experimental,
some are rollout-specific,
some are useful only as escalation hints.

That difference matters because calibration only makes sense for signals that genuinely behave like probabilistic estimates. OpenAI’s agent safety guidance and OWASP’s prompt injection guidance both push toward constrained workflows and system design, not blind faith in any one classifier score. (OWASP Gen AI Security Project)

3) Escalation trigger vs policy basis

This is the subtlest gap.

A signal can play different jobs:

“look deeper”
“add supporting evidence”
“hard veto”
“this is the actual reason for the decision”

Those are not the same.

If the architecture does not explicitly preserve that distinction, then later readers and later components can mistake “this appeared in the trace” for “this was the real basis of the decision.” That is exactly the sort of boundary confusion recent work on prompt injection calls out. The “role confusion” paper makes the stronger mechanistic claim that models infer authority from how text looks, not reliably from where it came from. That means the surrounding architecture has to preserve role and standing very clearly, or the model will not. (arXiv)

Why this matters in practice

Calibration

If different kinds of evidence get blended too early, calibration becomes muddied.

A calibrated probability, a hard rule hit, and a session-local anomaly are not the same kind of thing. If they all get mixed into one “risk score,” the number may still look precise while no longer having one clean meaning. That makes thresholding feel more scientific than it really is. (OWASP Gen AI Security Project)

Replay

Replay is not only about replaying inputs.

It is about replaying the meaning of findings under the same assumptions. If the logs tell you that something fired, but do not clearly say whether it was deterministic, local, calibrated, advisory, or final, then replay can reproduce the event while still failing to reproduce its real standing. That is why reproducibility and traceability are such a strong theme in the discussion. (Hugging Face Forums)

False-positive analysis

This is where implicit status hurts a lot.

If a block happened, you want to know whether it came from:

a hard rule,
an over-sensitive model,
session carry-over,
a provisional heuristic,
or a deeper escalation policy.

If those categories are not explicit, FP analysis becomes interpretive instead of diagnostic. You can still investigate it, but with more guesswork than you want in a serious system. (Hugging Face Forums)

Cross-layer leakage

This is the deepest systems risk.

A weak signal from one layer can silently become a strong assumption in another layer. That creates:

double counting,
hidden hardening,
confidence inflation,
and brittle behavior.

OpenAI’s recent agent-safety guidance is very relevant here: it argues that in real agent systems, the right goal is not perfect input detection, but constraining the impact of manipulation even when some attacks succeed. Anthropic says something similar for browser agents: even a 1% attack success rate is still meaningful risk, so you cannot rely on one layer’s judgment alone. (OpenAI)

The easiest way to fix it

The architecture would become much clearer if every finding carried a small, explicit status envelope.

Something like:

evidence kind
deterministic rule | calibrated model | provisional heuristic | stateful/session signal | latent probe
scope
global | tenant | surface | session | turn | artifact
decision role
advisory | escalation trigger | supporting evidence | hard veto | final policy basis
calibration state
calibrated | uncalibrated | not applicable | shadow-only
replay stability
deterministic | version-stable | session-dependent | nondeterministic

That would make the system preserve not just findings, but also the standing of findings.

The short conclusion

So the answer is:

Yes, the layers seem to preserve some status, not just content.
No, they do not yet seem to preserve all the status distinctions you named as fully explicit interface semantics.
The clearest explicit distinctions are:
- deterministic vs semantic/latent paths,
- stateless vs stateful risk,
- escalation vs simple one-shot filtering. (Hugging Face Forums)
The distinctions that still seem partly implicit are:
- local/session-specific vs globally portable,
- calibrated finding vs provisional heuristic,
- escalation trigger vs actual policy basis. (Hugging Face Forums)

And yes, if those remain implicit, they can absolutely affect calibration, replay, FP analysis, and subtle leakage of one layer’s assumptions into another. The clean next step is not necessarily more layers. It is making the status of each signal as explicit as the signal itself. (The NIST technical series.)

The easiest way to fix it

Make every layer output two things:

the finding
a small status card about the finding

Right now, many systems only pass forward something like:

score = 0.84
fired = true
reason = prompt_injection

That is not enough.

The next layer still does not know:

Is this a hard rule or a soft suspicion?
Is it local to this session or globally valid?
Is it calibrated or experimental?
Is it only a reason to escalate, or is it enough to block?

That missing information is what causes confusion later.

The 5 fields to add

1) `evidence_kind`

What kind of finding is this?

Examples:

deterministic_rule
calibrated_model
heuristic
integrity_violation
session_signal
experimental_probe

Why it matters:
A regex hit is not the same as a model score.

2) `scope`

How far does this finding apply?

Examples:

global
tenant
surface
session
turn
artifact

Why it matters:
A session-local warning should not quietly become a global rule.

3) `decision_role`

What is this finding allowed to do?

Examples:

advisory
escalation_trigger
supporting_evidence
hard_veto
final_policy_basis

Why it matters:
Some signals should only say “look deeper.” Others are strong enough to say “stop.”

4) `calibration_state`

How should the score be interpreted?

Examples:

calibrated
uncalibrated
not_applicable
shadow_only
drifted

Why it matters:
A calibrated probability and an experimental score should not look identical.

5) `replay_stability`

How stable should this be in replay?

Examples:

deterministic
version_stable
session_dependent
nondeterministic

Why it matters:
Replay should tell you what must match and what may vary.

Very simple example

{
  "layer": "semantic_gate",
  "score": 0.84,
  "reason_code": "INDIRECT_INJECTION_SUSPECTED",
  "evidence_kind": "calibrated_model",
  "scope": "turn",
  "decision_role": "escalation_trigger",
  "calibration_state": "calibrated",
  "replay_stability": "version_stable"
}

And a very different one:

{
  "layer": "tool_input_parser",
  "reason_code": "DUPLICATE_JSON_KEYS",
  "evidence_kind": "integrity_violation",
  "scope": "artifact",
  "decision_role": "hard_veto",
  "calibration_state": "not_applicable",
  "replay_stability": "deterministic"
}

Both are findings.
But they are not the same kind of finding.

Why this helps immediately

It improves four things fast:

Calibration: not every score gets treated like the same kind of probability.
Replay: you know what should match exactly and what may differ.
False-positive analysis: you can tell whether a bad decision came from a hard rule, a model, a session signal, or an experiment.
Cross-layer leakage: a weak signal stops silently turning into a strong one just because it moved deeper into the system.

The practical rule

A good rule is:

No finding should be allowed to do more decision work than its status card says it can do.

That means:

an advisory signal cannot block by itself
an escalation_trigger can only deepen routing
a hard_veto can stop execution immediately
a shadow_only signal cannot affect production decisions
a session-scoped signal cannot quietly become global

The easiest rollout

Do this in two steps:

First, keep the current logic the same and just add the 5 fields to every layer output.

Then, once those fields are present in traces, update fusion so it respects them.

That is the easiest fix because it does not require a new model, a new detector, or a new architecture. It only requires making the meaning of each finding explicit.

BalancedTiger · April 14, 2026, 6:30pm

This is a thoughtful response — thank you. I found the distinction between the finding itself and the standing of the finding especially useful.

What I’m taking from it into my own architecture work is the idea that extracted material may need to carry more than content. It may also need to carry its scope, authority level, decision role, and transfer stability so that local or advisory signals do not quietly harden into governing ones when they move across layers or contexts.

That part feels very relevant to the continuity / migration problems I’ve been thinking about in my own systems.

Your distinction between an escalation trigger and an actual policy basis also stood out to me. That seems like one of the easiest places for architectural confusion to creep in if the system preserves that something fired, but not what kind of work that signal is actually allowed to do.

One follow-up question I’d be curious about:

At what point in the pipeline do you think that status card becomes mandatory?

My instinct is that if those distinctions only become explicit late in the system, some of the most important boundary confusion has already happened upstream.

Thanks again — this gave me something real to think with.

Tiger

John6666 · April 15, 2026, 12:56pm

Seems like your intuition is right:

Yes. It becomes mandatory earlier than most systems want to admit.

The clean answer is:

A status card becomes mandatory at the first point where the system stops treating material as raw input and starts treating it as something that can change control flow, memory, or authority.

That usually means two different moments, not one:

At ingress / normalization, every piece of incoming material needs a source-status card.
At the first detector, parser, or scorer that emits a conclusion, every conclusion needs a finding-status card.

If you wait until fusion, policy, or final action selection, you are already late. By then, some of the most important boundary confusion may already have happened upstream. That is exactly the lesson behind zero-trust policy/enforcement separation and current agent-safety guidance: attach trust and authority early, not after the system has already reasoned with the material. (The NIST Tech series.)

The shortest practical rule

Use this rule:

Before any signal is allowed to influence routing, memory, tool planning, or execution, it must carry explicit status.

That is the rule that keeps advisory signals from quietly hardening into governing ones.

Why “late” is too late

A lot of systems think the right place for a status card is near the final decision. That is good for auditing, but weak for control.

Why weak? Because the dangerous upgrade often happens much earlier.

A system may already have:

routed down a more privileged path,
written something into session state,
treated retrieved text as instruction-like,
let a weak heuristic influence a tool plan,
or blended a provisional score with stronger evidence.

By the time a late-stage status card appears, the system may already be documenting a confusion it failed to prevent. This is why current prompt-injection guidance keeps emphasizing workflow design and constrained action boundaries rather than relying on end-of-pipeline judgment alone. (OpenAI)

The easiest way to think about it

There are really two different status cards.

1) Source-status card

This is attached to raw material as soon as it enters the system.

Examples of raw material:

user text
retrieved chunks
tool output
OCR text
browser page text
model output that will be reused downstream

This card answers:

What is this material allowed to count as?

It should include things like:

origin
trust class
scope
authority role
transfer policy

A simple version looks like this:

{
  "origin": "retrieved_context",
  "trust_class": "untrusted",
  "scope": "artifact",
  "authority_role": "data",
  "transfer_policy": ["may_route", "may_summarize", "may_not_execute"]
}

This card becomes mandatory at ingress normalization. Not later. Because once the system starts normalizing, segmenting, or source-tagging input, it is already interpreting it. If source and authority are not attached there, later layers are working on material whose standing is already blurred. That fits the core point in recent role-confusion work: models often infer “who is speaking” from style rather than source, so the system has to preserve source/role explicitly at the boundary. (arXiv)

2) Finding-status card

This is attached when a component emits a claim about the material.

Examples:

regex matched
parser detected duplicate keys
semantic model returned a risk score
entropy heuristic fired
session probe increased risk
latent probe signaled role confusion

This card answers:

What kind of claim is this, and how much decision work is it allowed to do?

A simple version looks like this:

{
  "evidence_kind": "calibrated_model",
  "scope": "turn",
  "decision_role": "escalation_trigger",
  "calibration_state": "calibrated",
  "replay_stability": "version_stable",
  "score": 0.84,
  "reason_code": "INDIRECT_INJECTION_SUSPECTED"
}

This card becomes mandatory at the first emitted conclusion. That means: the moment a parser, rule engine, detector, or probe says anything stronger than “here is raw content,” the status card should exist. From that point on, the system is no longer moving content alone. It is moving claims about content. (RFC Editor)

Where the “mandatory line” sits in a real pipeline

The cleanest pipeline version looks like this.

Stage A: Raw ingress

The system receives input.

At this point, you attach a source-status card immediately.

Why here? Because this is the first boundary where trust, origin, and authority can be lost. OpenAI’s agent guidance is explicit that untrusted instructions can arrive through external sources and influence tools or planning. Anthropic says the same for browser agents that constantly consume hostile or mixed-trust content. (OpenAI)

Stage B: Normalization and parsing

The system normalizes Unicode, strips obfuscation, parses JSON, segments content, or canonicalizes a tool payload.

The source-status card must already exist here.

Why? Because normalization is not neutral. It transforms the object. If you canonicalize or parse something, you are deciding what “the same object” means. RFC 8785 matters here because it creates a deterministic, hashable JSON representation for cryptographic uses. That is exactly the kind of boundary where identity and standing must stop drifting. (RFC Editor)

Stage C: First conclusion

A layer says:

this matched,
this is suspicious,
this is malformed,
this is high-risk,
this session pattern is escalating.

Now the finding-status card becomes mandatory.

Not optional. Not deferred.

This is the first moment when a signal can start doing control work inside the system. If the system still does not know whether the signal is deterministic, local, calibrated, advisory, or final, later layers will guess. That is where architecture starts quietly hardening without saying so.

Stage D: Routing and fusion

At this point, routing and fusion should consume only findings that already have status cards.

This is important. The routing layer should not have to infer:

whether a signal is advisory,
whether it is session-local,
whether it is calibrated,
or whether it is allowed to block.

If routing has to infer those things from score shape or log conventions, the architecture is already too implicit.

Stage E: Memory, tool planning, action

By the time a system updates memory, plans a tool, or approves execution, both source status and finding status must already be present and enforced.

This is where zero-trust logic becomes concrete. NIST’s model separates policy decision and policy enforcement precisely so control decisions are based on governed inputs, not on vague downstream assumptions. (The NIST Tech series.)

The key distinction: “mandatory” means different things at different stages

This is important.

When I say “mandatory,” I do not mean the same schema must exist in full from the first byte onward.

I mean:

source status is mandatory from first interpretation,
finding status is mandatory from first conclusion,
and both are mandatory before any control consequence.

That is the simplest clean rule.

Why your instinct is correct

You said:

if those distinctions only become explicit late in the system, some of the most important boundary confusion has already happened upstream.

That is right.

The upstream confusion usually happens when a system silently upgrades:

data into instruction,
local context into portable rule,
heuristic suspicion into policy basis,
advisory evidence into execution authority.

Recent work on role confusion is basically a mechanistic explanation of that same failure: the model can assign authority based on how text looks rather than where it came from. So if the architecture waits too long to attach authority and provenance, later components are already reasoning over blurred material. (arXiv)

The easiest implementation pattern

The smallest practical rollout is this.

First

Add a source-status card at ingress with just five fields:

origin
trust_class
scope
authority_role
transfer_policy

That alone prevents a lot of early confusion.

Second

Add a finding-status card to every emitted conclusion with:

evidence_kind
scope
decision_role
calibration_state
replay_stability

That prevents later layers from guessing what kind of signal they are seeing.

Third

Make this enforcement rule:

No component may use a signal to change routing, memory, or execution authority unless that signal already carries explicit status.

That is the whole fix in one line.

What this prevents

It prevents four common failures.

1) Advisory drift

A weak signal stops quietly becoming a stronger one just because it moved deeper into the stack.

2) Scope drift

A session-local signal stops silently turning into a generally portable rule.

3) Calibration drift

An experimental or heuristic score stops masquerading as a calibrated probability.

4) Provenance drift

Untrusted or derived content stops quietly inheriting the authority of trusted instructions.

That last point is especially important in agent systems. Prompt injection is hard partly because the model itself does not reliably preserve instruction/data boundaries. That is why external system structure matters so much. (OpenAI)

Why this also helps replay and auditing

Once status is attached early, replay becomes much more meaningful.

You can tell:

what was deterministic,
what was calibrated,
what was session-dependent,
what was advisory only,
and what was allowed to become a policy basis.

That is very similar in spirit to provenance systems and attestation systems. The in-toto attestation framework describes attestations as authenticated metadata intended for automated policy consumers. The same pattern applies here: not just “what happened,” but “what standing did this claim have?” (GitHub)

It also fits with OpenTelemetry’s direction for GenAI agent spans: richer structured semantics are needed once systems include tools, agents, and multi-stage execution, because plain logs are too lossy. (OpenTelemetry)

The one-sentence answer

If you want the answer in one sentence:

The status card becomes mandatory at the first point where the system interprets, transforms, or concludes anything in a way that can affect routing, memory, or action.

In practice:

source-status at ingress/normalization,
finding-status at first conclusion,
both enforced before fusion, memory, or execution.

Final view

So yes, I agree with your instinct.

If status only becomes explicit late in the system, it is already too late for the most important part of the job. Late status cards are still useful for replay and forensics. But early status cards are what prevent boundary confusion in the first place. That is the difference between “good logging” and “good architecture.”

Topic		Replies	Views
A Bidirectional LLM Firewall: Architecture, Failure Modes, and Evaluation Results Research	62	715	January 6, 2026
Thought Filtering vs. Text Filtering: Empirical Evidence of Latent Space Defense Supremacy Against Adversarial Obfuscation Research	3	93	January 18, 2026
Securing Large Vision-Language Models via Deterministic Orchestration Layers Awesome paper	2	143	December 30, 2025
Non tech individual vibe coding Beginners	7	143	January 15, 2026
AuditPlane: Signed Decision Receipts + Replay + Drift Diffs for LLM Safety Spaces	0	40	January 11, 2026

A Bidirectional LLM Firewall: Next Level X1 - help wanted!

The simple way to think about it

What seems explicit already

1) Deterministic hit vs probabilistic suspicion

2) Escalation vs final conclusion

3) Stateful vs stateless risk

What still seems partly implicit

1) Local signal vs portable rule

2) Calibrated finding vs provisional heuristic

3) Escalation trigger vs policy basis

Why this matters in practice

Calibration

Replay

False-positive analysis

Cross-layer leakage

The easiest way to fix it

The short conclusion

The easiest way to fix it

The 5 fields to add

1) evidence_kind

2) scope

3) decision_role

4) calibration_state

5) replay_stability

Very simple example

Why this helps immediately

The practical rule

The easiest rollout

The shortest practical rule

Why “late” is too late

The easiest way to think about it

1) Source-status card

2) Finding-status card

Where the “mandatory line” sits in a real pipeline

Stage A: Raw ingress

Stage B: Normalization and parsing

Stage C: First conclusion

Stage D: Routing and fusion

Stage E: Memory, tool planning, action

The key distinction: “mandatory” means different things at different stages

Why your instinct is correct

The easiest implementation pattern

First

Second

Third

What this prevents

1) Advisory drift

2) Scope drift

3) Calibration drift

4) Provenance drift

Why this also helps replay and auditing

The one-sentence answer

Final view

Related topics

1) `evidence_kind`

2) `scope`

3) `decision_role`

4) `calibration_state`

5) `replay_stability`