Scaling Study Construction Domain

Does Parameter Count Predict
Construction-Domain Performance?

Testing 29 models — 4 scaling families (Qwen, Gemma 4, Ministral 3, Nemotron 3, with cloud high-ends) plus 7 standalone state-of-the-art reference points (DeepSeek V4, GLM-5.1, Kimi K2.6, MiniMax, Xiaomi MiMo) — on contract analysis, delay analysis, and schedule generation/CPM tasks. No fine-tuning. Prompt-engineered prompts only.

29
Models Scored
174
Test Runs
6
Eval Tasks
4+SOTA
Families + Refs

Contents

  1. Methodology and full grading rubric
  2. Full results matrix grouped by family
  3. Per-task scaling curves across families
  4. Key discriminator heatmap

Executive Summary

The Fine-Tuning sister study showed that fine-tuning small construction-domain SLMs can backfire. Qwen3-4B scored 91 percent out of the box, dropped to 78.6 percent after fine-tuning, with catastrophic forgetting in schedule reasoning. This report asks the next question: if fine-tuning is risky, how far does raw parameter count alone take you on the same tasks?

We evaluated 29 models across 4 scaling families plus 7 standalone SOTA reference points on the identical six-task construction benchmark (T1 to T3 plus three curveball variants CB1 to CB3) using prompt-engineered prompts only. No fine-tuning. Manual rubric-based scoring against golden answers. The four families (Qwen, Gemma 4, Ministral 3, Nemotron 3) span from 0.8B local up to cloud high-ends; the seven standalone giants (DeepSeek V4 Pro/Flash, GLM-5.1, Kimi K2.6, MiniMax M2.7, Xiaomi MiMo v2.5 Pro/base) are plotted as isolated reference points, with any model above 300B total parameters drawn as a horizontal asymptote across the scaling charts.

Result: parameter count is a weak predictor of construction-domain performance. Architecture, training mix and reasoning support dominate. The best 27B dense model (Qwen 3.6 27B) reaches 91.3 of 100, outperforming a 35B mixture-of-experts variant from the same family (81.8) and a 31B dense Gemma 4 (83.0). Within Ministral 3, the 8B sibling beats the 14B by 3.8 points scaling is non-monotonic.

Top-Line Scaling Curve

Average Score (out of 100) vs Total Parameters (log scale) — family curves + SOTA points + >300B asymptotes

Top-Line Scoreboard Grouped by Family

ModelActive ParamsContractsDelaysSchedulesAvg / 100

Headline Findings

Diminishing returns at the frontier

Adding seven state-of-the-art giants — DeepSeek V4 Pro (1.6T), Xiaomi MiMo v2.5 Pro (1.02T), Kimi K2.6 (1T), GLM-5.1 (744B), MiniMax M2.7 (230B), DeepSeek V4 Flash (284B), Xiaomi MiMo (310B), plus the cloud high-ends of each family — does not break past the ceiling set by mid-size models. The standalone SOTA cluster lands between 74 and 90 average, the same band already occupied by Qwen 3.6 27B (91.3) and the cloud Gemini 3.5 Flash (91.1). The single clearest illustration: DeepSeek V4 Pro at 1.6 trillion parameters scores 85.7 — below its own 284B Flash sibling at 89.2. Kimi K2.6 (1T) scores 85.0; Xiaomi MiMo v2.5 Pro (1.02T) scores 83.5, below its own 310B base variant (85.4). The free GLM-5.1 (744B) reaches 90.1. The scaling charts make this visually explicit: the seven >300B asymptote lines all converge into the same narrow 85-91 band that the best sub-30B models already reach. For this construction benchmark, returns to scale flatten hard above roughly 25-40B active parameters, and total-parameter count beyond ~300B adds essentially nothing.

Parameter count is a weak predictor of performance

The largest model in the study (Qwen 3.6 35B-A3B, a 35 billion parameter mixture-of-experts variant with 3 billion active) averages 81.8 out of 100. The 27 billion dense sibling from the same family reaches 91.3. The 14 billion Ministral 3 dense model averages 72.5, while its 8 billion sibling scores 76.3. Within each family the curve is non-monotonic. Total parameter count alone explains less than half of the cross-model variance. Architecture, training mix, and reasoning support dominate raw parameter scaling.

Dense beats MoE at fixed active-parameter budget

Qwen 3.6 27B dense (91.3) outperforms the 35B-A3B mixture-of-experts variant (81.8) despite the latter having 30 percent more total parameters. Gemma 4 follows the same pattern: the 31B dense (83.0) and 26B-A4B (84.9) sit close to each other, but both trail Qwen 3.6 27B. For reasoning-heavy construction tasks the binding constraint appears to be active parameters, not total parameters. Mixture-of-experts variants pay a measurable accuracy cost on multi-step delay attribution and CPM reasoning.

Schedule generation universally overshoots

Task T3 asks the model to generate an 18-activity baseline schedule for a Cologne residential building from historical benchmark data. The golden answer arrives at 370 working days through a critical-path topology that uses start-to-start and finish-to-finish parallelism. Out of 16 models tested, only Gemma 4 E2B undershoots the golden duration (285 wd) and only Gemma 4 E4B comes within 35 percent (494 wd). The remaining models overshoot by factors ranging from 1.5 to 2.5, with most defaulting to long finish-to-start chains. This appears to be a structural weakness shared across all current base small language models: the models can list standard residential activities and can choose plausible durations from historical ranges, but they do not yet construct the parallel sub-schedules that real planners use.

CPM arithmetic is solvable; predecessor selection is not

Where T3 asks for schedule generation, CB3 provides a fixed 18-activity network and asks for the deterministic CPM forward and backward pass. The golden duration is 265 wd. Out of 16 models, six hit 265 exactly Qwen 3.6 27B, Qwen 3.6 35B-A3B, Qwen 3.5 9B, Qwen 3.5 4B, Gemma 4 26B-A4B, and Gemma 4 31B. These same models also correctly handle the SS+10 lag that makes Activity 8 critical with zero float, the rubric key discriminator for this task. The implication is that CPM computation on a given network is a tractable mechanical task for mid-size models, while predecessor-selection in schedule synthesis (T3) remains qualitatively harder.

Generational uplift within Qwen

Qwen 3.5 Flash, a hybrid linear plus sparse mixture-of-experts model with roughly 9 billion active parameters, averages 80.4. The next-generation Qwen 3.6 27B dense averages 91.3, an uplift of 10.9 points at similar effective compute. The mixture-of-experts variant from Qwen 3.6 (35B-A3B) gains only 1.4 points over the previous-generation flash. This suggests that the dense path in Qwen 3.6 captures most of the generational improvement, while the MoE path is closer to its predecessor in practical capability on this benchmark.

Reasoning capability has a lower size threshold

The Qwen 3.5 family spans 0.8B, 2B, 4B, 9B, and the cloud Flash variant. At 4B the model reasons cleanly through all six tasks and closes its think tags reliably. At 2B, reasoning still works but spirals on three of six tasks and recovers on the second attempt. At 0.8B, the thinking trigger spirals on all three attempts on the first task. After falling back to non-thinking mode the 0.8B model averages 36.2 out of 100, well below the 4B's 76.4. The practical reasoning floor for these construction tasks sits somewhere between 2B and 4B for Qwen-style models, and is lower for Matformer variants. Gemma 4 E2B at roughly 2 billion effective parameters completes all six tasks with thinking enabled and averages 74.75.

Curveball generalisation is uneven

CB1 (Finnish YSE 1998 contract, 21 clauses) is the hardest curveball. Only three models hit the rejection key discriminator on clause C-002 (lack of enforcement mechanism): Qwen 3.6 27B, Qwen 3.6 35B-A3B, and Nemotron 3 Nano 30B-A3B. Several mid-size models default to "Accepted" or "Requires Review" on clauses requiring modification under Finnish standards, suggesting that contract reasoning is heavily anchored in training data familiarity. CB2 (FIDIC offshore wind concurrent delay) is the easiest curveball: all sixteen models correctly classify the vessel breakdown as contractor-risk. The English-law Adyard concurrency principle is recognised by about half the field.

The champion model and the cost-effective alternative

Qwen 3.6 27B dense is the construction-domain champion at 91.3 of 100, hitting 9 of 10 key rubric discriminators across the six tasks. It achieves near-perfect contracts performance (96.5), strong delays handling (86.5), and reaches the ceiling on CPM analysis (98). Its only consistent weakness is the universal T3 schedule-generation overshoot. For local deployment on a 16GB consumer GPU, the strongest small model is Qwen 3.5 9B at 77.9 average; on Matformer architecture, Gemma 4 E4B at roughly 4 billion effective parameters reaches 77.6 with the fastest wall time of the local cohort. The Fine-Tuning study's headline that Qwen3-4B base hits 91 on construction tasks remains broadly consistent with the Qwen line's strong base performance demonstrated here.

Methodology

This study reuses the evaluation infrastructure of the Fine-Tuning report. Same six tasks, same grading rubric, same golden answers. Only the prompt path differs: the system prompt for every call is the prompt-engineered prompt for that task, never the vanilla training prompt. No model is fine-tuned for this study.

Inference Parameters (every model, every call)

ParameterThinking modelNon-thinking model
temperature0.60.15
top_p0.90.9
min_p0.060.06
max_tokens3276832768

Eval Task Inputs

TaskDomainSystem PromptUser Content
T1Contractscontracts_pe_system.txtHamburg Tower 14-article contract + 25-clause database
T2Delaysdelays_pe_system.txtAP vs AB Residential baseline + as-built CSV (20 activities)
T3Schedules (generation)schedules_pe_t3.txtCologne residential brief + 14 historical projects benchmark
CB1Contracts (curveball)contracts_pe_system.txtVAN-MIX Finnish YSE 1998 21-clause contract + same 25-clause DB
CB2Delays (curveball)delays_pe_system.txtGrim Tide offshore wind FIDIC Yellow Book, 3 delay events
CB3Schedules (CPM)schedules_pe_cb3.txtNorthbrook Solar 50MW EPC NEC3, 18 activities

Grading Rubric v1.0

Applies to: All models evaluated across Contracts, Schedules, and Delays domains.
Scope: Model-agnostic. Works for base models, fine-tunes, and third-party comparators.

This rubric is methodology only. Individual model scores live in the Results tab. Raw model outputs live in results per-model folders.

Scope What must be evaluated per base model

Every base model evaluation requires 6 tests, not 3:

TestDomainScenario
T1ContractsHamburg Tower 14-article NEC3 review
T2DelaysAB v AP Residential TIA + EOT
T3SchedulesCologne Residential 18-activity Schedule Generation
CB1ContractsVAN-MIX Finnish YSE 1998 21-clause review
CB2DelaysGrim Tide Offshore Wind FIDIC FM + concurrent delay
CB3SchedulesNorthbrook Solar 50MW EPC 18-activity CPM

T1 to T3 test competency on the training domain. CB1 to CB3 test generalisation to novel jurisdictions, project types, and contract forms unseen in training. A base model score requires all 6 tests complete. Do not report a domain score from T-only results.

Why a rubric?

Exact match scoring fails for generative LLMs. A model may produce the wrong label with correct reasoning (partial credit warranted), or produce a correct label via nonsense reasoning (partial credit withheld). This rubric rewards domain understanding, not string matching. Three principles:

  1. Label and Reasoning. Both must be scored independently.
  2. Key Discriminators. Each domain has one or two items that cleanly separate models that understand the domain from models that guess. These items carry extra weight.
  3. Structural Validity. Invalid output (malformed JSON, impossible values) is penalised at the output layer, not the reasoning layer.

CB Test Coverage What each curveball tests

CB tests apply the same rubric as T tests. Scenario changes; scoring criteria do not. Key discriminators differ because the curveball presents different traps and non-obvious cases.

TestScenarioJurisdiction / FormKey Trap
CB1VAN-MIX Finnish YSE 1998 21 clausesFinland, YSE 1998C-002: no enforcement mechanism = Rejected. Insurance undervalue. Missing bonds.
CB2Grim Tide Offshore Wind 3 delay eventsEngland, FIDIC Yellow BookConcurrent delay (Nov 1-26). FM vs Contractor-risk classification.
CB3Northbrook Solar 50MW EPC 18 activitiesNEC3 EPC, UKActivity 8 critical via SS+10. Activity 12 high float but external DNO constraint.

Combined base model score = T + CB weighted average. Do not report domain scores from T-only or CB-only results.

Domain 1 Contracts (Total: 100 pts)

A. Status Label Accuracy 50 pts

Each article is scored independently. Points per article set by weight (see table below).

Partial credit ladder (applies per article):

Predicted → / Golden ↓AcceptedModificationRequires ReviewRejected
Accepted100%0%25%0%
Modification25%100%50%50%
Requires Review25%50%100%25%
Rejected0%50%25%100%

Rationale: Modification-to-Rejected confusion is penalised less than Accepted-to-Rejected (the model at least flagged a problem). Accepted-to-Modification or better is a meaningful directional error.

Article weights:

ArticleWeight (pts)Reason
Art 108Key Discriminator. Only Rejected article. Rule 3 must be explicitly invoked.
Art 76Two-clause split (partial accept/modify). Non-trivial.
Art 64Requires cross-referencing DB annotation category.
Art 34Multi-condition clause.
Art 94Requires Review rare label; model must distinguish from Modification.
All other articles (9 articles)2 each = 18Standard single-condition clauses.
Total50
B. DB Clause ID Accuracy 20 pts

Score per article (20 pts total, ~1.4 pts each):

  • Correct DB ID = full marks
  • Correct category, wrong specific ID = 50%
  • Wrong category = 0%

Rationale: DB IDs are the RAG retrieval signal. Wrong IDs indicate failure to ground in provided database even when the label is correct.

C. Reasoning Quality 30 pts

Score only for 5 key articles: Art 10, Art 7, Art 6, Art 3, Art 9 (6 pts each).

ScoreCriteria
6/6Correct rule cited, correct clause element identified, correct DB entry referenced
4/62 of 3 above correct
2/61 of 3 above correct, or correct reasoning but wrong label
0/6No substantive reasoning or circular ("rejected because it should be rejected")

Art 10 special rule: If model outputs "Modification" but reasoning explicitly invokes Rule 3 ("completely unacceptable clause"), score C at 4/6 and A at 50% (reasoning correct, label wrong).

CB1 Key Discriminators VAN-MIX Finnish YSE 1998

Scenario: 21-clause Finnish mixed-use development contract (EUR 45M, Vantaa). Non-English jurisdiction. Tests RAG against same 25-clause DB used in T1.

ItemScore impactWhy
C-002 RejectedFull A weight for C-002No enforcement mechanism worse than DB3 (mediation/arbitration missing entirely). Rule 3 applies. DB3 must be matched.
C-017 CAR insurance flaggedA + B weightEUR 22M cover = 49% of EUR 45M contract value. Below DB8 standard. Model must detect value discrepancy vs DB entry.
Missing Performance Bond flaggedReasoning (C)DB11 category absent entirely from contract. Full-marks model lists in missing_db_categories.
C-012 LD cap at 3% vs DB18 10%A + B weightBelow DB18 standard. Same DB comparison logic as T1.
C-015 DLP 12 months vs YSE 24 monthsA + B weightFinnish YSE 1998 requires 24-month takuuaika. DB9 match required.

CB1 Golden: 21 clauses 1 Accepted (C-001), 10 Modification, 9 Requires Review, 1 Rejected (C-002). 7 missing DB categories. Apply Domain 1 rubric exactly. C-002 plays the Art 10 role (the sole Rejected clause, key discriminator).

Domain 2 Schedules

T3 and CB3 use different rubrics. T3 tests schedule generation; CB3 tests CPM analysis. Score each with its own rubric below.

T3 Schedule Generation (100 pts)

Tests whether the model creates a complete baseline schedule from historical project data. Cologne Residential Building, 2022, Sand soil, EUR 35M, 1500 sqm, 4 floors. Historical DB of 3 projects + benchmark table provided in test message.

A. Activity Completeness 25 pts

ScoreCriteria
25/25All 18 standard activities present, correctly named
18/2514-17 activities present with recognisable standard names
10/2510-13 activities present
0/25Fewer than 10 activities; or non-standard names

B. Duration Validity 30 pts

Accept if within historical benchmark range. Exact match to golden answer not required.

ScoreCriteria
30/30All 18 durations within benchmark ranges AND justifications reference project parameters (size, soil, floors)
22/3015-17 durations within range; or all within range but no justifications
14/3012-14 durations within range
0/30Fewer than 12 within range; or durations from wrong project type

C. Predecessor Logic 25 pts

ScoreCriteria
25/25Valid construction sequencing throughout; no circular dependencies; mix of FS/SS/FF relationships present
18/25Mostly logical; missing some constraints (all FS only); minor sequencing issues
10/25Several illogical sequences but no circular dependencies
0/25Circular dependency present; or completely illogical (elevator before concrete)

D. Output Format 20 pts

ScoreCriteria
20/20Valid JSON; all required fields present; project summary included
14/20Valid JSON but missing some fields; or minor formatting issues
7/20Parseable JSON but wrapped in markdown code fence; or missing major fields
0/20Not valid JSON; or plain text commentary instead of JSON

T3 Golden: Cologne project, 370 wd, critical path 1->2->3->5->7->9->11->13->12->14. Activity 7 (Exterior Plastering) critical via 5FF from Activity 5. Activity 6 (Exterior Walls, 72 wd) NOT critical (TF=158).

CB3 CPM Analysis (100 pts)

Tests CPM arithmetic on a given schedule (Northbrook Solar Farm 50MW EPC). All 18 activity durations and predecessors provided in the test message.

A. Project Duration 25 pts

Band scoring against golden answer (265 wd):

Error bandScore
≤ +/-20 wd25/25
≤ +/-50 wd17/25
≤ +/-100 wd10/25
≤ +/-150 wd5/25
> +/-150 wd0/25

B. Critical Path Identification 30 pts

Key Discriminators (binary, 6 pts each):

ItemPointsWhy
Activity 8 IS on Critical Path6Key Discriminator. SS+10 makes ES8=175, EF8=220 = EF7=220. Zero float. Models treating 8 as non-critical miss this entirely.
Activity 9 NOT on Critical Path6TF=136; models naively tracing longest path include it incorrectly

Remaining CP activities (2 pts each, up to 18 pts): Award 2 pts per correctly identified critical activity. Wrong inclusion penalty: -1 pt per activity incorrectly placed on CP. Minimum 0 pts for section B.

C. CPM Structural Validity 25 pts

Sub-criterionPointsPass condition
All activities present5Count matches input (no missing/hallucinated activities)
Input durations correct8Model uses stated durations, not hallucinated values
No negative Total Float5All TF >= 0 (negative TF = invalid CPM)
Complete backward pass4LS/LF populated for all activities
ES/EF internal consistency3ES + Duration = EF for all activities

If input durations are wrong, cap section C at 12/25 regardless of computational correctness.

D. Relationship Type Handling 20 pts

Sub-criterionPointsEvidence required
Finish-to-Finish (FF) chain recognised8Model output or reasoning shows awareness of FF lag between relevant activities
Start-to-Start (SS) chain recognised8Model output or reasoning shows awareness of SS dependency
Multi-predecessor merge logic4Late merge correctly identified (largest predecessor governs)
CB3 Key Discriminators Northbrook Solar 50MW EPC

Scenario: 18-activity utility-scale solar PV farm. SS+10 and SS+15 relationships. Non-building topology; tests generalisation from residential CPM training data.

Golden: Project duration = 265 wd (inside 280 wd target). Critical path: 1->2->3->5->6->7->8->14->15->17->18.

Domain 3 Delays (Total: 100 pts)

A. Delay Event Identification 35 pts

Four delay events (DEL-001 through DEL-004), up to 9 pts each (partial for correctly identified events; cap at 35 total). Per event:

Sub-criterionPoints
Activity correctly identified2
Duration (within +/-10 wd of golden)3
Responsibility correctly assigned (Employer/Contractor/Neutral/Concurrent)4
B. Critical Path Reasoning and EOT 40 pts

This is the highest-weight section because it is the core professional judgment the domain tests.

Sub-criterionPointsNotes
DEL-002 identified as on Critical Path10Contractor delay on CP; no employer EOT for this event
DEL-003 identified as NOT on Critical Path10Key Discriminator. Employer delay but non-critical = cost-only claim, zero EOT entitlement. Models that sum all employer delays fail here.
EOT = 0 recommendation12Correct conclusion requires correct CP analysis of both above items
Float consumption reasoning8Model shows awareness that float absorbs non-critical delays rather than granting EOT

DEL-003 special rule: If model identifies DEL-003 as employer-caused AND correctly states it does not extend the critical path (even if EOT calculation is wrong), award 8/10 for that sub-criterion.

C. Output Quality 25 pts
Sub-criterionPointsPass condition
Valid JSON (parseable)5No comments, trailing commas, or syntax errors
Delay cascade / concurrency analysis7Model checks for overlapping delays, does not double-count
Cost vs. EOT distinction8Non-critical delays correctly routed to cost claim, not time claim
Recovery / mitigation events noted5Model identifies any schedule recovery that offsets delay
CB2 Key Discriminators Grim Tide Offshore Wind

Scenario: FIDIC Yellow Book offshore wind foundation installation. 3 delay events with overlapping FM and Contractor-risk periods. Tests FIDIC jurisdiction and concurrent delay analysis under English law.

EOT golden range: 35-75 calendar days (35 cd minimum = weather FM; 75 cd recommended = weather + partial TP FM).

ItemPointsWhy
DEL-OW-001 classified as FM (FIDIC 19.1)10All four criteria met: exceptional (1-in-10-year MRA), beyond Contractor control, unforeseeable. EOT = 35 cd. Cost = None (FIDIC 19.4 time only).
DEL-OW-002 classified as Contractor risk (NOT FM)10Key Discriminator. FIDIC 4.15: equipment failure is Contractor responsibility. Not Force Majeure. Zero EOT, zero cost.
Concurrent period Nov 1-26 (26 cd) correctly treated8Both weather FM and vessel breakdown overlap. Under English law / Adyard: Contractor gets EOT for concurrent period (weather would have delayed regardless) but NO additional cost.
DEL-OW-003 TP supply chain arguable FM7Factory fire may meet FM criteria but FIDIC 4.4 makes Contractor responsible for supply chain. Correct answer: DISPUTED. Recommended 40 cd EOT as negotiated position.

Data Download Reproducibility Bundle

Everything needed to reproduce the findings independently. Each ZIP contains a README.md documenting layout, formats, and usage.

Eval Artefacts

All 6 task inputs, 4 PE system prompts (one per task family), the 6 golden answers used for manual scoring, and the full v1.0 grading rubric.

Contents: contracts/ (T1 input + DB), delays/ (T2 input + DB), schedules/ (T3 input), cb_contracts/ (CB1 input + DB), cb_delays/ (CB2 input), cb_schedules/ (CB3 input), pe_prompts/ (4 PE files), golden_answers/ (7 JSON files), GRADING_RUBRIC.md.

Download eval_artefacts.zip

Raw Model Outputs

All 96 raw responses (16 models × 6 tasks). Each file is the upstream chat-completions JSON plus a _meta block recording the inference parameters used. Reasoning traces preserved where available.

Contents: 16 per-model folders, each with T1_raw.json ... CB3_raw.json.

Download raw_outputs.zip

Scoring Sheets

Manual rubric-based score per model per task with full breakdown by section (A label accuracy, B DB clause ID, C reasoning quality, etc.), plus per-model summary and metadata.

Contents: 16 per-model folders, each with T1_score.md ... CB3_score.md, summary.md, and meta.json (model id, family, params, deploy target, inference parameters).

Download scoring.zip

Scripts

Python helpers used to call OpenRouter and LM Studio chat-completions APIs, with family-specific routing for thinking modes (Qwen chat_template_kwargs, Gemma 4 channel prefill, small Qwen <think> prefill, Nemotron reasoning effort), retry logic for thinking spirals, and a parallel launcher for the OpenRouter batch.

Contents: run_or_eval.py (OpenRouter caller), run_lm_eval.py (LM Studio caller with family routing), launch_or_batch.py (parallel launcher), poll_or_batch.py (status table).

Download scripts.zip

Report Source

The standalone HTML report file you are reading. Self-contained with embedded data and Chart.js from CDN. Open in any browser, no server required.

Contents: scaling_report.html.

Download report_source.zip

Full Score Matrix Grouped by Family

ModelT1T2T3CB1CB2CB3Avg

Per-Domain Stacked View

Contracts vs Delays vs Schedules (averaged per domain, models grouped by family)

Per-Task Scaling Curves By Family

Each chart shows score on one task vs active parameters (log scale). Each family appears as a curve. Tasks grouped by domain.

Contracts

T1 Hamburg Tower (14 articles)

CB1 VAN-MIX Finnish YSE (21 clauses)

Delays

T2 AP vs AB Residential TIA

CB2 Grim Tide Offshore FIDIC

Schedules

T3 Cologne Schedule Generation

CB3 Northbrook Solar CPM

Key Discriminator Heatmap Grouped by Family

Each row is a model, each column is a rubric key discriminator. Green = hit, red = miss, grey = not applicable or run failure.

Model T1 Art 10
Rejected
T2 DEL-003
non-CP
T2 DEL-002
on CP
CB1 C-002
Rejected
CB2 Vessel
=Contractor
CB2 Adyard
Concurrent
CB3 Act 8
Critical
CB3 Act 9
NOT Critical
Hits

Key Discriminators Hit Per Model

KD hits out of 8 (models grouped by family, smallest to largest)

Model Families

Each model in each family gets a per-task score breakdown and a short conclusion paragraph. Models sorted within family from smallest to largest total parameters.