Custom semantic representation ("bryła") beats raw text in 24/27 configs — built solo on an RTX 2060, looking for feedback

Hi everyone,

I’m a self-taught builder (no CS degree — I learned this after night shifts, with AI as my tutor). I’ve been working on a custom semantic representation I call a “bryła” (Polish for “solid”), and I’d like to share results and get honest feedback.

The problem I started from: my hardware (RTX 2060, 12GB) couldn’t handle big models. LLMs kept recomputing the same meaning from scratch. So instead of waiting for better hardware, I tried a different path: what if I precompute meaning and pack it into the input?

The idea: a parser decomposes each sentence into ~28 “walls” (polarity, certainty, causality, negation, register, etc.), so the model reads structured meaning instead of guessing it from raw tokens.

What I measured (and where I was wrong first): my early results looked good, but I later found my corpus was hurting the representation — 9080 unique texts (87%) but only 483 unique structured inputs (5%). The representation saw 20× less variety than raw text. So I built a controlled, balanced test set and ran a full grid: groups {1,3,8} × model size {32,64,128} × parser noise {0,10,20%} = 27 configs.

Result: the structured input won 24/27 against raw text. At 0% noise: 100% vs 95%. Even at 20% parser noise it stayed ahead, except when the model was too small (dm=32) to learn to ignore noisy walls.

Honest caveat: this test set is synthetic — the walls are a clean signal. It’s a proof of concept that the architecture is faithful and noise-robust, NOT yet proof on natural language. That’s my next step.

My question to you: has anyone here worked on structured inputs or neurosymbolic approaches like this? How did you measure whether the structure actually adds information vs just summarizing the text? I’m especially curious how others separated “the structure helps” from “the corpus was just easy”.

Full write-up, code and measurements here: krzysiekpl/bryla-kris · Hugging Face

Thanks for reading — genuinely looking for people working on similar problems, not promotion.

Probably a big step forward. It looks like there is a recommended next procedure:


This looks like a much stronger proof-of-concept than the previous stage.

The important part is not only “Bryła wins in 24/27 configurations.” The more important part is that you found a concrete failure mode — low structured-input diversity — and then rebuilt the test around a controlled grid. That makes the result more credible.

I would still be careful with the claim, but the direction is good.

My direct answer would be:

Do not scale the synthetic setup much further yet.
The next high-value step is a small natural-data test with DOMAIN, shuffled-structure, and random-structure controls.

1. How I would read the current result

From what you describe, the result is now approximately:

Stage What it shows What it does not yet show
Earlier technical QA result Bryła can improve a tiny matched setup Maybe fragile / domain-specific
Field ablation compact fields are better than default-heavy FULL not yet general
Clean PPL / masked loss tags must be treated as context, not target PPL alone is still incomplete
Current 24/27 grid Bryła can transmit useful structured signal in a controlled setup not yet proven on messy natural Polish data

So the current claim I would make is:

In a controlled synthetic setting, Bryła appears to be a real conditioning signal rather than just noise. The next question is whether the same advantage survives natural data, real parser errors, and stronger controls.

That is already a good research position.

2. The next decisive experiment

I would run a very small natural-data test, not a larger synthetic one.

Use four or five conditions:

RAW
DOMAIN + RAW
BRYLA + RAW
SHUFFLED-BRYLA + RAW
RANDOM-BRYLA + RAW

The most important comparisons:

Comparison Meaning
BRYLA > RAW Bryła still helps outside the synthetic setup
BRYLA > DOMAIN Bryła adds more than a simple domain label
BRYLA > SHUFFLED-BRYLA field-value alignment matters
BRYLA > RANDOM-BRYLA result is not just prefix-format regularization
DOMAIN ≈ BRYLA current Bryła may mostly encode domain/topic
SHUFFLED-BRYLA ≈ BRYLA structure labels may not be semantically used
RANDOM-BRYLA helps possible regularization / format artifact

The core next question is:

Does Bryła beat DOMAIN-only and shuffled-structure controls on natural data?

If yes, the claim becomes much stronger.

3. Natural-data mini-benchmark

I would start small:

6 domains × 50 examples = 300 examples

Suggested domains:

Domain Why useful
technical / welding / materials original strongest area
geography / places tests templatic factual data
biographies tests people, dates, roles, events
science explanations tests definitions and causal relations
daily-life / practical QA tests intent, urgency, user-facing pragmatics
sports / events tests event structure and temporal facts

Report results by domain, not only aggregate.

Example table:

Domain RAW DOMAIN BRYLA SHUFFLED Best Note
technical
geography
biography
science
daily life
sports

This matters because Bryła may help one domain and hurt another. That would still be useful information.

4. Replace synthetic parser noise with real parser error

The 0/10/20% parser-noise grid is useful, but the next step should be real parser failure.

A good progression would be:

synthetic parser noise
→ real parser errors
→ real domain shift
→ real QA / generation metric

Synthetic noise tells you the model is robust to artificial corruption. Real parser errors tell you whether the full system works.

I would report:

% parsed
% partial
% OTHER
field default rate
field entropy
field/domain correlation

Example parser dashboard:

Domain Parsed % Partial % OTHER % Main failure mode
technical
geography
biography
science
daily life
sports

The parser is now part of the research object, not just preprocessing.

5. Keep structure-diversity metrics permanently

The best methodological insight in the new result may be the structured-diversity issue.

I would report these in every experiment:

unique raw texts
unique Bryła strings
Bryła/raw diversity ratio
field entropy
default-field ratio
parser OTHER%
average input tokens

Example:

Metric RAW BRYLA
unique text strings
unique structured strings
average source tokens
field entropy
default-field ratio
parser OTHER%

This helps distinguish:

Bryła is useful

from:

Bryła collapsed many examples into the same structure

or:

Bryła mostly encoded domain/template identity

6. Keep clean PPL / masked loss

For prefix-style experiments, full-sequence PPL can be misleading because the model may get rewarded for predicting easy deterministic tags.

So I would keep:

val_ppl_clean = only natural Polish target text
val_ppl_tags  = only Bryła tags
val_ppl_std   = full sequence, diagnostic only

Primary metric:

val_ppl_clean

The Hugging Face docs on fixed-length-model perplexity are useful here because they emphasize that PPL depends on the exact likelihood/evaluation setup:

For decoder-only prefix conditioning, I would use masked loss:

input:
  [BRYLA PREFIX] [SEP_BRYLA] [POLISH TEXT]

labels:
  [-100 ... -100] [-100]      [POLISH TEXT LABELS]

That matches the conceptual setup:

Bryła = context
Polish text = target

7. Try cooldown

The most interesting next experiment after the control ladder is cooldown.

This is close to the idea in MeCo: train with metadata, then cool down on raw text so the model can function without metadata at inference time.

Resource:

For Bryła:

Phase 1:
  train on BRYLA + text

Phase 2:
  short cooldown on RAW-only text

Eval:
  RAW-only

Controls:

RAW baseline
DOMAIN + text -> RAW cooldown
BRYLA + text -> RAW cooldown
RANDOM-BRYLA + text -> RAW cooldown

Interpretation:

Result Meaning
Bryła cooldown > RAW Bryła may work as a training scaffold
Bryła cooldown ≈ RAW no retained scaffold effect
DOMAIN cooldown ≈ Bryła cooldown domain metadata may explain much of the gain
random-prefix cooldown helps possible curriculum/regularization effect
Bryła requires Bryła at inference useful, but deployment depends on parser

If cooldown works, the story becomes stronger:

Bryła is not only an inference-time representation.
It may be a training scaffold for small models.

8. Test serialization format

Current Bryła looks like a compact symbolic representation. That may be best for tiny models, but it should be tested.

Structured-representation work suggests that code-like formats may be less model-friendly than natural-language descriptions in some settings.

Useful resource:

I would test:

BRYLA-symbolic
BRYLA-verbalized
BRYLA-hybrid
BRYLA-no-defaults

Example:

Symbolic:
[TYPE:fact] [POL:neutral] [SCOPE:general] [INTENT:inform] [CORE:yes]

Verbalized:
This is a neutral factual statement with general scope. The intent is to inform. The main content is central.

Hybrid:
[type: factual statement] [polarity: neutral] [scope: general] [intent: inform] [core: yes]

Also test field order, because sequence order can matter a lot for structured inputs.

Useful resource:

9. Polish datasets and resources

For natural Polish QA / MRC testing, I would look at these.

Resource Use
PolQA Polish OpenQA; useful for question/answer type analysis and evidence passages
PolQA dataset practical HF dataset
PoQuAD Polish SQuAD-like QA, includes impossible questions and generative answer layer
PoQuAD paper dataset background
PolEval 2024 QA task Polish reading-comprehension evaluation style
PolEval 2024 QA GitHub task data/code
PUGG Polish KBQA/MRC/IR construction pipeline
PUGG GitHub implementation
PUGG dataset HF dataset

I would not mix all of these into one training soup immediately.

Better:

small clean natural benchmark
+ controlled ablations
+ separate larger-data experiments later

10. Suggested next reporting table

A compact table like this would be very clear:

Setup Data Control type Clean PPL Task metric Tokens Wins/seeds Comment
RAW natural baseline
DOMAIN natural simple metadata
BRYLA natural real structure
SHUFFLED natural broken alignment
RANDOM natural format control

And by domain:

Domain BRYLA > RAW? BRYLA > DOMAIN? BRYLA > SHUFFLED? Parser OTHER% Note
technical
geography
biography
science
daily life
sports

11. What would make the claim much stronger

The result would become much harder to dismiss if the next stage shows:

BRYLA > RAW
BRYLA > DOMAIN
BRYLA > SHUFFLED-BRYLA
BRYLA > RANDOM-BRYLA

on small natural Polish data, with:

clean target-only loss
parser coverage reported
field entropy reported
token cost reported
domain-level breakdown

That would support the claim:

Bryła adds useful structure beyond domain conditioning and prefix-format effects.

12. What would weaken the claim

These would not kill the project, but they would change the interpretation:

Observation Interpretation
DOMAIN ≈ BRYLA Bryła may mostly encode domain/topic
SHUFFLED ≈ BRYLA field-value alignment may not matter
RANDOM helps prefix format may act as regularization
gains vanish on natural data synthetic setup may be too clean
gains only appear in full PPL tag-prediction artifact
parser outputs mostly [OTHER] structure is not reaching the model
Bryła works only in one domain still useful, but domain-specific

Short version

This is good progress.

The next step is not “make it bigger.”
The next step is:

small natural data
+ DOMAIN control
+ shuffled-structure control
+ random-prefix control
+ clean PPL
+ parser diagnostics

If Bryła still wins there, the result becomes much stronger.

Hi krzysiekpl,

Before anything else: the most valuable thing in your post isn’t the 24/27 — it’s that you caught your own confound (9,080 unique texts vs 483 unique structures, a 20× diversity collapse), reported it, and rebuilt the test around it. Most people bury that. That habit is worth more than any single result, and it’s the exact discipline that will carry this project wherever it goes.

Your diagnosis is correct, and I want to say that plainly because we started from the same sentence you did: LLMs keep recomputing the same meaning from scratch. A transformer re-derives polarity, negation, scope, and causality it has already derived — every token, every layer. Precompute-once should win. Your controlled grid showing structure transmits real signal is consistent with the metadata-conditioning literature (John6666’s control ladder above is excellent — I’d run exactly that next, especially SHUFFLED and RANDOM).

Where I think the implementation will fight you is not the idea — it’s the carrier. By routing structure through the token stream, you pay twice: every wall re-enters the O(N²) attention you were trying to relieve (your own earlier ablation thread showed that bill — heavy FULL prefixes costing far more tokens than they earned back), and the rule parser becomes a single point of failure standing between meaning and the model. Natural language will always be messier than the parser.

The thing we found building our system: the structure you’re extracting already exists in the model’s latent geometry — it doesn’t have to ride the token stream at all. Three concrete mechanisms from our project, each with a measured receipt rather than a claim:

1. Structure as sign-bits, not tags. Instead of parsing text into ~28 wall tokens, we take a frozen ±1 random projection of the latent vector and keep only the signs: 64 bits, one uint64 per position (SimHash — an angle-preserving hash, by Johnson–Lindenstrauss the same lemma that makes our retrieval router work). Cost: 8 bytes per position, zero extra tokens, zero parser. Comparing two positions’ structure is one XOR + one POPCNT — hardware-native on your 2060.

2. The honesty gate, because the estimator is lossy. We adversarially tested this: at 32 bits the signature retrieved a planted needle at 2× context compression at every depth — at one cell the decode was identical to the full-precision router — but it failed the hardest cell (4× budget, deep needle) where full precision still succeeded. A real resolution boundary, mapped, not hand-waved. 64 bits — the same 8 bytes — restored every cell. The principle transfers directly to your setup: the claim is only as strong as the control that could have killed it.

3. Meaning computed once, stored exactly, never recomputed. Our context cache stores attention keys as discrete integer residue blocks that come back byte-exact from an NVMe drive or a network socket — the stored object is the compute operand, so “precomputed meaning” never re-enters the model as extra input; it’s recalled, not re-derived. As I write this, the box behind me is mid-way through a 32,000-token run streaming ~1 GB per token off a small Optane drive to prove that end-to-end, with the entire routing index for 32k context occupying ~59 MB of RAM.

Why I’m telling you this on your thread: you’re building solo on an RTX 2060 12GB, and that exact card is the named next stage of our roadmap (we call it Stage Beta) — porting the whole discrete stack from CPU to those CUDA warps. The data layouts we built for AVX-512 map directly onto 32-thread warps, and a 0.6B-class model plus the full 32k context structure fits in a fraction of your VRAM. The repro will be public when it lands; if you get there first, I’d genuinely like to compare notes.

Everything is open — papers, gates, closure notes with the failures left in:

One concrete experiment for your next round, alongside John6666’s ladder: add a condition where the Bryła text prefix is replaced by a signature-conditioned prefix — the sign-bits of a small encoder’s hidden state (or a learned projection of your parser’s fields) injected as one or two soft vectors instead of ~28 wall tokens. If it matches or beats the text tags, you’ve eliminated the token tax and the parser dependency in a single move, and your representation becomes 8 bytes instead of a sentence.

You diagnosed the right disease on the right hardware. The cure isn’t more tokens — it’s fewer floats. Let the physics do the work.

Edit:

I just wanted to add that when we use SimHash the 64-bit signature actually improves perplexity (-0.97% and -0.12%) at the 2× and 4× bounds and proves that the SimHash isn’t just a lossy compression hack; it’s acting as a regularization filter, stripping out the low-magnitude noise from the attention matrix and forcing the model to focus on the dominant semantic geometry. But at 8× (+6.08%), the resolution collapses, the noise floor eats the signal, and we drop back to f32 on this machine, which is a Intel NUC Beast Canyon( Core i9-11900KB: 8-cores, 16-threads, 3.3 GHz base, up to 4.9 GHz Turbo (5.3 GHz Thermal Velocity Boost), 10MB L2, 24MB L3(SVM) Cache, 65W TDP, Memory: 2x DDR4-3200 SODIMM slots which currently due to a mix up when I purchased contains 32GB(2x16) 2666mhz , 1x CPU-attached slot (PCIe 4.0 x4) with a 32GB Optane drive, 3x PCH-attached slots (PCIe 3.0 x4) supporting 2242/2280 one contains a 16GB Optane, 2x1TB NVME’s and the RTX 2060 12GB .

Thank you both — genuinely.

@John6666 — your control ladder wasn’t just useful, it changed how I
test. This isn’t the first time your suggestions pushed me to a sharper
experiment, and each time the result got more trustworthy because of it.
The SHUFFLED / RANDOM / DOMAIN idea in particular saved me from publishing
a wrong conclusion. Thank you.

And thank you to the second reply too — the “structure already lives in
the model’s latent geometry, the cure is fewer floats not more tokens”
framing, plus the concrete signature-vector experiment, gave me a real
variant to measure. I added it (more below).

Here’s what the controls actually showed — including where I was wrong.

I ran the controls. First result was uncomfortable.

On my balanced synthetic set, with the model predicting the SAME walls it
received as input:

RAW       96.8%
DOMAIN    96.3%
BRYLA    100.0%
SHUFFLED  99.9%
RANDOM    96.2%

BRYLA beats RAW, DOMAIN and RANDOM — good. But BRYLA ≈ SHUFFLED. That was
the important signal: shuffling which value belongs to which wall barely
hurt. So the structure (the wall assignment) wasn’t being used — the model
was essentially copying the value set from input to output. The test was
measuring reconstruction, not meaning. Exactly the “predicting easy tags”
trap John warned about.

Diagnosis: the values were self-identifying

In that setup each value was unique to one wall (“formal” only ever in
REG, “negative” only in POL). So the model never needed the wall labels —
the value itself revealed its wall. A bag of values was enough. Shuffling
labels didn’t change the bag, so it didn’t hurt.

Redesigned test: hidden target + overlapping values

Two changes:

  1. The model predicts a HIDDEN wall that is NOT in the input — so it must
    infer, not copy.
  2. All walls share the same value space (A/B/C), so a value no longer
    reveals its wall. Now the model MUST use the assignment.

Target rule: cel = (S1 == S2) — requires distinguishing S1 from S2.
Text carries only a neutral topic, never the target.

Result (2700 pairs, balanced, with and without text — nearly identical):

          chance   RAW    BRYLA   SHUFFLED   RANDOM
w/ text   66.7%   68.6%   100%    79.0%      66.4%
no text   66.7%   68.6%   100%    81.0%      68.6%

Now SHUFFLED collapses ~20 points below BRYLA. RANDOM sits at chance.
RAW sits at chance. So: the gain comes from real values in their correct
wall positions — not from prefix format, not from text.

The actual conclusion (conditional, not triumphant)

Putting both tests together:

unique values per wall   -> SHUFFLED = BRYLA  (structure redundant)
shared values per wall   -> SHUFFLED < BRYLA  (structure necessary)

So wall structure carries information exactly when values are NOT self-
identifying. And overlapping values are precisely the natural-language
case — “high” can be certainty, urgency, or intensity; “not” can be
negation or something else. My first synthetic set was simply too clean,
which hid the value of structure. That’s the honest result.

On the signature-vector idea

I also added a variant where the walls enter as ONE projected feature
vector instead of ~28 tokens (the “fewer floats, no token tax” direction).
On the clean set it matched the token form. I want to re-run it on the
overlapping-values task before claiming anything — that’s the harder case
and the fair test. Will report back.

What I’m NOT claiming

Still synthetic. This shows structure CAN matter and WHEN. It does not yet
show that my parser, on real Polish text, produces enough value-overlap
for structure to help in practice. That’s the next step you both pointed
at: small natural dataset, clean target-only loss, parser coverage
reported. Now I have a reason to invest in it — I know structure isn’t
redundant a priori.

Thanks again. This thread moved the project further than the previous
month did.

This is the best kind of update — you ran the control that could have killed the claim, published the uncomfortable number, diagnosed why it happened, and rebuilt the test so the claim was falsifiable again. The two-test pair you ended up with is honestly more valuable than the original 24/27 result, because it isn’t a score anymore — it’s a boundary map:

values self-identifying  ->  structure redundant   (SHUFFLED ≈ BRYLA)
values overlapping       ->  structure necessary   (SHUFFLED << BRYLA)

That conditional is a real finding, and your reading of it is exactly right: natural language lives on the overlapping side (“high” can be certainty, urgency, or intensity), so your first clean corpus was hiding the value of structure rather than disproving it.

Three things that might sharpen the next round:

1. The SHUFFLED 79–81% is not noise — it’s probably the multiset leak, and you can confirm it cheaply. Your target is cel = (S1 == S2). Shuffling destroys which wall holds which value, but it can’t destroy the bag of values — and in some fraction of your pairs the bag alone determines the answer (e.g. whenever a value appears an even/odd number of times in a way that forces or forbids equality). Prediction: if you split SHUFFLED’s results by “answer recoverable from the multiset alone: yes/no,” you’ll find it near-perfect on the first bucket and near-chance on the second. That turns a mysterious 79% into a confirmed mechanism, and it gives you a stronger headline: BRYLA = 100% on exactly the examples where binding is the only available signal.

2. You can predict whether natural data will reward structure before training anything. Your boundary map says structure matters iff values overlap across walls. So measure the overlap directly on a natural corpus through your parser: for each value string, count how many distinct walls it appears in (value→wall entropy / collision rate). That’s a few hundred lines of counting, no GPU — and it tells you in advance whether the natural-data experiment can possibly show a BRYLA > SHUFFLED gap. If real Polish through your parser produces mostly wall-unique values, you’ll know to expect SHUFFLED ≈ BRYLA and why, instead of being ambushed by it. It also folds John’s diversity metrics into a causal story rather than a checklist.

3. One warning on the signature-vector variant before you run it on the hard task — make sure the projection preserves binding. The hidden-target task is solvable only by knowing which value sits in which wall. A projection of the slot-ordered concatenation (wall 1’s value embedding, then wall 2’s, …) preserves that assignment, because position in the input vector is the wall label. But anything that pools first — summing or averaging the value embeddings before projecting — produces a bag-of-values vector, structurally identical to SHUFFLED, and the variant will fail for a reason that has nothing to do with the idea. (This is an old and deep problem — “role-filler binding” in the vector-symbolic literature; in our system the equivalent property comes free because signatures are computed per-position, never pooled.) If the slot-faithful projection matches the ~28-token form on the overlapping-values task, you’ll have shown the structure can ride a single dense vector with zero token cost — which would be the strongest result in the thread so far.

The trajectory here — claim, control, refutation, diagnosis, sharper claim — is the whole method. Most people never make it past step two. Looking forward to the overlap statistics from real Polish; that number now decides everything downstream.