Testing 29 models — 4 scaling families (Qwen, Gemma 4, Ministral 3, Nemotron 3, with cloud high-ends) plus 7 standalone state-of-the-art reference points (DeepSeek V4, GLM-5.1, Kimi K2.6, MiniMax, Xiaomi MiMo) — on contract analysis, delay analysis, and schedule generation/CPM tasks. No fine-tuning. Prompt-engineered prompts only.
The Fine-Tuning sister study showed that fine-tuning small construction-domain SLMs can backfire. Qwen3-4B scored 91 percent out of the box, dropped to 78.6 percent after fine-tuning, with catastrophic forgetting in schedule reasoning. This report asks the next question: if fine-tuning is risky, how far does raw parameter count alone take you on the same tasks?
We evaluated 29 models across 4 scaling families plus 7 standalone SOTA reference points on the identical six-task construction benchmark (T1 to T3 plus three curveball variants CB1 to CB3) using prompt-engineered prompts only. No fine-tuning. Manual rubric-based scoring against golden answers. The four families (Qwen, Gemma 4, Ministral 3, Nemotron 3) span from 0.8B local up to cloud high-ends; the seven standalone giants (DeepSeek V4 Pro/Flash, GLM-5.1, Kimi K2.6, MiniMax M2.7, Xiaomi MiMo v2.5 Pro/base) are plotted as isolated reference points, with any model above 300B total parameters drawn as a horizontal asymptote across the scaling charts.
Result: parameter count is a weak predictor of construction-domain performance. Architecture, training mix and reasoning support dominate. The best 27B dense model (Qwen 3.6 27B) reaches 91.3 of 100, outperforming a 35B mixture-of-experts variant from the same family (81.8) and a 31B dense Gemma 4 (83.0). Within Ministral 3, the 8B sibling beats the 14B by 3.8 points scaling is non-monotonic.
| Model | Active Params | Contracts | Delays | Schedules | Avg / 100 |
|---|
Adding seven state-of-the-art giants — DeepSeek V4 Pro (1.6T), Xiaomi MiMo v2.5 Pro (1.02T), Kimi K2.6 (1T), GLM-5.1 (744B), MiniMax M2.7 (230B), DeepSeek V4 Flash (284B), Xiaomi MiMo (310B), plus the cloud high-ends of each family — does not break past the ceiling set by mid-size models. The standalone SOTA cluster lands between 74 and 90 average, the same band already occupied by Qwen 3.6 27B (91.3) and the cloud Gemini 3.5 Flash (91.1). The single clearest illustration: DeepSeek V4 Pro at 1.6 trillion parameters scores 85.7 — below its own 284B Flash sibling at 89.2. Kimi K2.6 (1T) scores 85.0; Xiaomi MiMo v2.5 Pro (1.02T) scores 83.5, below its own 310B base variant (85.4). The free GLM-5.1 (744B) reaches 90.1. The scaling charts make this visually explicit: the seven >300B asymptote lines all converge into the same narrow 85-91 band that the best sub-30B models already reach. For this construction benchmark, returns to scale flatten hard above roughly 25-40B active parameters, and total-parameter count beyond ~300B adds essentially nothing.
The largest model in the study (Qwen 3.6 35B-A3B, a 35 billion parameter mixture-of-experts variant with 3 billion active) averages 81.8 out of 100. The 27 billion dense sibling from the same family reaches 91.3. The 14 billion Ministral 3 dense model averages 72.5, while its 8 billion sibling scores 76.3. Within each family the curve is non-monotonic. Total parameter count alone explains less than half of the cross-model variance. Architecture, training mix, and reasoning support dominate raw parameter scaling.
Qwen 3.6 27B dense (91.3) outperforms the 35B-A3B mixture-of-experts variant (81.8) despite the latter having 30 percent more total parameters. Gemma 4 follows the same pattern: the 31B dense (83.0) and 26B-A4B (84.9) sit close to each other, but both trail Qwen 3.6 27B. For reasoning-heavy construction tasks the binding constraint appears to be active parameters, not total parameters. Mixture-of-experts variants pay a measurable accuracy cost on multi-step delay attribution and CPM reasoning.
Task T3 asks the model to generate an 18-activity baseline schedule for a Cologne residential building from historical benchmark data. The golden answer arrives at 370 working days through a critical-path topology that uses start-to-start and finish-to-finish parallelism. Out of 16 models tested, only Gemma 4 E2B undershoots the golden duration (285 wd) and only Gemma 4 E4B comes within 35 percent (494 wd). The remaining models overshoot by factors ranging from 1.5 to 2.5, with most defaulting to long finish-to-start chains. This appears to be a structural weakness shared across all current base small language models: the models can list standard residential activities and can choose plausible durations from historical ranges, but they do not yet construct the parallel sub-schedules that real planners use.
Where T3 asks for schedule generation, CB3 provides a fixed 18-activity network and asks for the deterministic CPM forward and backward pass. The golden duration is 265 wd. Out of 16 models, six hit 265 exactly Qwen 3.6 27B, Qwen 3.6 35B-A3B, Qwen 3.5 9B, Qwen 3.5 4B, Gemma 4 26B-A4B, and Gemma 4 31B. These same models also correctly handle the SS+10 lag that makes Activity 8 critical with zero float, the rubric key discriminator for this task. The implication is that CPM computation on a given network is a tractable mechanical task for mid-size models, while predecessor-selection in schedule synthesis (T3) remains qualitatively harder.
Qwen 3.5 Flash, a hybrid linear plus sparse mixture-of-experts model with roughly 9 billion active parameters, averages 80.4. The next-generation Qwen 3.6 27B dense averages 91.3, an uplift of 10.9 points at similar effective compute. The mixture-of-experts variant from Qwen 3.6 (35B-A3B) gains only 1.4 points over the previous-generation flash. This suggests that the dense path in Qwen 3.6 captures most of the generational improvement, while the MoE path is closer to its predecessor in practical capability on this benchmark.
The Qwen 3.5 family spans 0.8B, 2B, 4B, 9B, and the cloud Flash variant. At 4B the model reasons cleanly through all six tasks and closes its think tags reliably. At 2B, reasoning still works but spirals on three of six tasks and recovers on the second attempt. At 0.8B, the thinking trigger spirals on all three attempts on the first task. After falling back to non-thinking mode the 0.8B model averages 36.2 out of 100, well below the 4B's 76.4. The practical reasoning floor for these construction tasks sits somewhere between 2B and 4B for Qwen-style models, and is lower for Matformer variants. Gemma 4 E2B at roughly 2 billion effective parameters completes all six tasks with thinking enabled and averages 74.75.
CB1 (Finnish YSE 1998 contract, 21 clauses) is the hardest curveball. Only three models hit the rejection key discriminator on clause C-002 (lack of enforcement mechanism): Qwen 3.6 27B, Qwen 3.6 35B-A3B, and Nemotron 3 Nano 30B-A3B. Several mid-size models default to "Accepted" or "Requires Review" on clauses requiring modification under Finnish standards, suggesting that contract reasoning is heavily anchored in training data familiarity. CB2 (FIDIC offshore wind concurrent delay) is the easiest curveball: all sixteen models correctly classify the vessel breakdown as contractor-risk. The English-law Adyard concurrency principle is recognised by about half the field.
Qwen 3.6 27B dense is the construction-domain champion at 91.3 of 100, hitting 9 of 10 key rubric discriminators across the six tasks. It achieves near-perfect contracts performance (96.5), strong delays handling (86.5), and reaches the ceiling on CPM analysis (98). Its only consistent weakness is the universal T3 schedule-generation overshoot. For local deployment on a 16GB consumer GPU, the strongest small model is Qwen 3.5 9B at 77.9 average; on Matformer architecture, Gemma 4 E4B at roughly 4 billion effective parameters reaches 77.6 with the fastest wall time of the local cohort. The Fine-Tuning study's headline that Qwen3-4B base hits 91 on construction tasks remains broadly consistent with the Qwen line's strong base performance demonstrated here.
This study reuses the evaluation infrastructure of the Fine-Tuning report. Same six tasks, same grading rubric, same golden answers. Only the prompt path differs: the system prompt for every call is the prompt-engineered prompt for that task, never the vanilla training prompt. No model is fine-tuned for this study.
| Parameter | Thinking model | Non-thinking model |
|---|---|---|
| temperature | 0.6 | 0.15 |
| top_p | 0.9 | 0.9 |
| min_p | 0.06 | 0.06 |
| max_tokens | 32768 | 32768 |
| Task | Domain | System Prompt | User Content |
|---|---|---|---|
| T1 | Contracts | contracts_pe_system.txt | Hamburg Tower 14-article contract + 25-clause database |
| T2 | Delays | delays_pe_system.txt | AP vs AB Residential baseline + as-built CSV (20 activities) |
| T3 | Schedules (generation) | schedules_pe_t3.txt | Cologne residential brief + 14 historical projects benchmark |
| CB1 | Contracts (curveball) | contracts_pe_system.txt | VAN-MIX Finnish YSE 1998 21-clause contract + same 25-clause DB |
| CB2 | Delays (curveball) | delays_pe_system.txt | Grim Tide offshore wind FIDIC Yellow Book, 3 delay events |
| CB3 | Schedules (CPM) | schedules_pe_cb3.txt | Northbrook Solar 50MW EPC NEC3, 18 activities |
Applies to: All models evaluated across Contracts, Schedules, and Delays domains.
Scope: Model-agnostic. Works for base models, fine-tunes, and third-party comparators.
This rubric is methodology only. Individual model scores live in the Results tab. Raw model outputs live in results per-model folders.
Every base model evaluation requires 6 tests, not 3:
| Test | Domain | Scenario |
|---|---|---|
| T1 | Contracts | Hamburg Tower 14-article NEC3 review |
| T2 | Delays | AB v AP Residential TIA + EOT |
| T3 | Schedules | Cologne Residential 18-activity Schedule Generation |
| CB1 | Contracts | VAN-MIX Finnish YSE 1998 21-clause review |
| CB2 | Delays | Grim Tide Offshore Wind FIDIC FM + concurrent delay |
| CB3 | Schedules | Northbrook Solar 50MW EPC 18-activity CPM |
T1 to T3 test competency on the training domain. CB1 to CB3 test generalisation to novel jurisdictions, project types, and contract forms unseen in training. A base model score requires all 6 tests complete. Do not report a domain score from T-only results.
Exact match scoring fails for generative LLMs. A model may produce the wrong label with correct reasoning (partial credit warranted), or produce a correct label via nonsense reasoning (partial credit withheld). This rubric rewards domain understanding, not string matching. Three principles:
CB tests apply the same rubric as T tests. Scenario changes; scoring criteria do not. Key discriminators differ because the curveball presents different traps and non-obvious cases.
| Test | Scenario | Jurisdiction / Form | Key Trap |
|---|---|---|---|
| CB1 | VAN-MIX Finnish YSE 1998 21 clauses | Finland, YSE 1998 | C-002: no enforcement mechanism = Rejected. Insurance undervalue. Missing bonds. |
| CB2 | Grim Tide Offshore Wind 3 delay events | England, FIDIC Yellow Book | Concurrent delay (Nov 1-26). FM vs Contractor-risk classification. |
| CB3 | Northbrook Solar 50MW EPC 18 activities | NEC3 EPC, UK | Activity 8 critical via SS+10. Activity 12 high float but external DNO constraint. |
Combined base model score = T + CB weighted average. Do not report domain scores from T-only or CB-only results.
Each article is scored independently. Points per article set by weight (see table below).
Partial credit ladder (applies per article):
| Predicted → / Golden ↓ | Accepted | Modification | Requires Review | Rejected |
|---|---|---|---|---|
| Accepted | 100% | 0% | 25% | 0% |
| Modification | 25% | 100% | 50% | 50% |
| Requires Review | 25% | 50% | 100% | 25% |
| Rejected | 0% | 50% | 25% | 100% |
Rationale: Modification-to-Rejected confusion is penalised less than Accepted-to-Rejected (the model at least flagged a problem). Accepted-to-Modification or better is a meaningful directional error.
Article weights:
| Article | Weight (pts) | Reason |
|---|---|---|
| Art 10 | 8 | Key Discriminator. Only Rejected article. Rule 3 must be explicitly invoked. |
| Art 7 | 6 | Two-clause split (partial accept/modify). Non-trivial. |
| Art 6 | 4 | Requires cross-referencing DB annotation category. |
| Art 3 | 4 | Multi-condition clause. |
| Art 9 | 4 | Requires Review rare label; model must distinguish from Modification. |
| All other articles (9 articles) | 2 each = 18 | Standard single-condition clauses. |
| Total | 50 |
Score per article (20 pts total, ~1.4 pts each):
Rationale: DB IDs are the RAG retrieval signal. Wrong IDs indicate failure to ground in provided database even when the label is correct.
Score only for 5 key articles: Art 10, Art 7, Art 6, Art 3, Art 9 (6 pts each).
| Score | Criteria |
|---|---|
| 6/6 | Correct rule cited, correct clause element identified, correct DB entry referenced |
| 4/6 | 2 of 3 above correct |
| 2/6 | 1 of 3 above correct, or correct reasoning but wrong label |
| 0/6 | No substantive reasoning or circular ("rejected because it should be rejected") |
Art 10 special rule: If model outputs "Modification" but reasoning explicitly invokes Rule 3 ("completely unacceptable clause"), score C at 4/6 and A at 50% (reasoning correct, label wrong).
Scenario: 21-clause Finnish mixed-use development contract (EUR 45M, Vantaa). Non-English jurisdiction. Tests RAG against same 25-clause DB used in T1.
| Item | Score impact | Why |
|---|---|---|
| C-002 Rejected | Full A weight for C-002 | No enforcement mechanism worse than DB3 (mediation/arbitration missing entirely). Rule 3 applies. DB3 must be matched. |
| C-017 CAR insurance flagged | A + B weight | EUR 22M cover = 49% of EUR 45M contract value. Below DB8 standard. Model must detect value discrepancy vs DB entry. |
| Missing Performance Bond flagged | Reasoning (C) | DB11 category absent entirely from contract. Full-marks model lists in missing_db_categories. |
| C-012 LD cap at 3% vs DB18 10% | A + B weight | Below DB18 standard. Same DB comparison logic as T1. |
| C-015 DLP 12 months vs YSE 24 months | A + B weight | Finnish YSE 1998 requires 24-month takuuaika. DB9 match required. |
CB1 Golden: 21 clauses 1 Accepted (C-001), 10 Modification, 9 Requires Review, 1 Rejected (C-002). 7 missing DB categories. Apply Domain 1 rubric exactly. C-002 plays the Art 10 role (the sole Rejected clause, key discriminator).
T3 and CB3 use different rubrics. T3 tests schedule generation; CB3 tests CPM analysis. Score each with its own rubric below.
Tests whether the model creates a complete baseline schedule from historical project data. Cologne Residential Building, 2022, Sand soil, EUR 35M, 1500 sqm, 4 floors. Historical DB of 3 projects + benchmark table provided in test message.
A. Activity Completeness 25 pts
| Score | Criteria |
|---|---|
| 25/25 | All 18 standard activities present, correctly named |
| 18/25 | 14-17 activities present with recognisable standard names |
| 10/25 | 10-13 activities present |
| 0/25 | Fewer than 10 activities; or non-standard names |
B. Duration Validity 30 pts
Accept if within historical benchmark range. Exact match to golden answer not required.
| Score | Criteria |
|---|---|
| 30/30 | All 18 durations within benchmark ranges AND justifications reference project parameters (size, soil, floors) |
| 22/30 | 15-17 durations within range; or all within range but no justifications |
| 14/30 | 12-14 durations within range |
| 0/30 | Fewer than 12 within range; or durations from wrong project type |
C. Predecessor Logic 25 pts
| Score | Criteria |
|---|---|
| 25/25 | Valid construction sequencing throughout; no circular dependencies; mix of FS/SS/FF relationships present |
| 18/25 | Mostly logical; missing some constraints (all FS only); minor sequencing issues |
| 10/25 | Several illogical sequences but no circular dependencies |
| 0/25 | Circular dependency present; or completely illogical (elevator before concrete) |
D. Output Format 20 pts
| Score | Criteria |
|---|---|
| 20/20 | Valid JSON; all required fields present; project summary included |
| 14/20 | Valid JSON but missing some fields; or minor formatting issues |
| 7/20 | Parseable JSON but wrapped in markdown code fence; or missing major fields |
| 0/20 | Not valid JSON; or plain text commentary instead of JSON |
T3 Golden: Cologne project, 370 wd, critical path 1->2->3->5->7->9->11->13->12->14. Activity 7 (Exterior Plastering) critical via 5FF from Activity 5. Activity 6 (Exterior Walls, 72 wd) NOT critical (TF=158).
Tests CPM arithmetic on a given schedule (Northbrook Solar Farm 50MW EPC). All 18 activity durations and predecessors provided in the test message.
A. Project Duration 25 pts
Band scoring against golden answer (265 wd):
| Error band | Score |
|---|---|
| ≤ +/-20 wd | 25/25 |
| ≤ +/-50 wd | 17/25 |
| ≤ +/-100 wd | 10/25 |
| ≤ +/-150 wd | 5/25 |
| > +/-150 wd | 0/25 |
B. Critical Path Identification 30 pts
Key Discriminators (binary, 6 pts each):
| Item | Points | Why |
|---|---|---|
| Activity 8 IS on Critical Path | 6 | Key Discriminator. SS+10 makes ES8=175, EF8=220 = EF7=220. Zero float. Models treating 8 as non-critical miss this entirely. |
| Activity 9 NOT on Critical Path | 6 | TF=136; models naively tracing longest path include it incorrectly |
Remaining CP activities (2 pts each, up to 18 pts): Award 2 pts per correctly identified critical activity. Wrong inclusion penalty: -1 pt per activity incorrectly placed on CP. Minimum 0 pts for section B.
C. CPM Structural Validity 25 pts
| Sub-criterion | Points | Pass condition |
|---|---|---|
| All activities present | 5 | Count matches input (no missing/hallucinated activities) |
| Input durations correct | 8 | Model uses stated durations, not hallucinated values |
| No negative Total Float | 5 | All TF >= 0 (negative TF = invalid CPM) |
| Complete backward pass | 4 | LS/LF populated for all activities |
| ES/EF internal consistency | 3 | ES + Duration = EF for all activities |
If input durations are wrong, cap section C at 12/25 regardless of computational correctness.
D. Relationship Type Handling 20 pts
| Sub-criterion | Points | Evidence required |
|---|---|---|
| Finish-to-Finish (FF) chain recognised | 8 | Model output or reasoning shows awareness of FF lag between relevant activities |
| Start-to-Start (SS) chain recognised | 8 | Model output or reasoning shows awareness of SS dependency |
| Multi-predecessor merge logic | 4 | Late merge correctly identified (largest predecessor governs) |
Scenario: 18-activity utility-scale solar PV farm. SS+10 and SS+15 relationships. Non-building topology; tests generalisation from residential CPM training data.
Golden: Project duration = 265 wd (inside 280 wd target). Critical path: 1->2->3->5->6->7->8->14->15->17->18.
Four delay events (DEL-001 through DEL-004), up to 9 pts each (partial for correctly identified events; cap at 35 total). Per event:
| Sub-criterion | Points |
|---|---|
| Activity correctly identified | 2 |
| Duration (within +/-10 wd of golden) | 3 |
| Responsibility correctly assigned (Employer/Contractor/Neutral/Concurrent) | 4 |
This is the highest-weight section because it is the core professional judgment the domain tests.
| Sub-criterion | Points | Notes |
|---|---|---|
| DEL-002 identified as on Critical Path | 10 | Contractor delay on CP; no employer EOT for this event |
| DEL-003 identified as NOT on Critical Path | 10 | Key Discriminator. Employer delay but non-critical = cost-only claim, zero EOT entitlement. Models that sum all employer delays fail here. |
| EOT = 0 recommendation | 12 | Correct conclusion requires correct CP analysis of both above items |
| Float consumption reasoning | 8 | Model shows awareness that float absorbs non-critical delays rather than granting EOT |
DEL-003 special rule: If model identifies DEL-003 as employer-caused AND correctly states it does not extend the critical path (even if EOT calculation is wrong), award 8/10 for that sub-criterion.
| Sub-criterion | Points | Pass condition |
|---|---|---|
| Valid JSON (parseable) | 5 | No comments, trailing commas, or syntax errors |
| Delay cascade / concurrency analysis | 7 | Model checks for overlapping delays, does not double-count |
| Cost vs. EOT distinction | 8 | Non-critical delays correctly routed to cost claim, not time claim |
| Recovery / mitigation events noted | 5 | Model identifies any schedule recovery that offsets delay |
Scenario: FIDIC Yellow Book offshore wind foundation installation. 3 delay events with overlapping FM and Contractor-risk periods. Tests FIDIC jurisdiction and concurrent delay analysis under English law.
EOT golden range: 35-75 calendar days (35 cd minimum = weather FM; 75 cd recommended = weather + partial TP FM).
| Item | Points | Why |
|---|---|---|
| DEL-OW-001 classified as FM (FIDIC 19.1) | 10 | All four criteria met: exceptional (1-in-10-year MRA), beyond Contractor control, unforeseeable. EOT = 35 cd. Cost = None (FIDIC 19.4 time only). |
| DEL-OW-002 classified as Contractor risk (NOT FM) | 10 | Key Discriminator. FIDIC 4.15: equipment failure is Contractor responsibility. Not Force Majeure. Zero EOT, zero cost. |
| Concurrent period Nov 1-26 (26 cd) correctly treated | 8 | Both weather FM and vessel breakdown overlap. Under English law / Adyard: Contractor gets EOT for concurrent period (weather would have delayed regardless) but NO additional cost. |
| DEL-OW-003 TP supply chain arguable FM | 7 | Factory fire may meet FM criteria but FIDIC 4.4 makes Contractor responsible for supply chain. Correct answer: DISPUTED. Recommended 40 cd EOT as negotiated position. |
Everything needed to reproduce the findings independently. Each ZIP contains a README.md documenting layout, formats, and usage.
All 6 task inputs, 4 PE system prompts (one per task family), the 6 golden answers used for manual scoring, and the full v1.0 grading rubric.
Contents: contracts/ (T1 input + DB), delays/ (T2 input + DB), schedules/ (T3 input), cb_contracts/ (CB1 input + DB), cb_delays/ (CB2 input), cb_schedules/ (CB3 input), pe_prompts/ (4 PE files), golden_answers/ (7 JSON files), GRADING_RUBRIC.md.
All 96 raw responses (16 models × 6 tasks). Each file is the upstream chat-completions JSON plus a _meta block recording the inference parameters used. Reasoning traces preserved where available.
Contents: 16 per-model folders, each with T1_raw.json ... CB3_raw.json.
Manual rubric-based score per model per task with full breakdown by section (A label accuracy, B DB clause ID, C reasoning quality, etc.), plus per-model summary and metadata.
Contents: 16 per-model folders, each with T1_score.md ... CB3_score.md, summary.md, and meta.json (model id, family, params, deploy target, inference parameters).
Python helpers used to call OpenRouter and LM Studio chat-completions APIs, with family-specific routing for thinking modes (Qwen chat_template_kwargs, Gemma 4 channel prefill, small Qwen <think> prefill, Nemotron reasoning effort), retry logic for thinking spirals, and a parallel launcher for the OpenRouter batch.
Contents: run_or_eval.py (OpenRouter caller), run_lm_eval.py (LM Studio caller with family routing), launch_or_batch.py (parallel launcher), poll_or_batch.py (status table).
The standalone HTML report file you are reading. Self-contained with embedded data and Chart.js from CDN. Open in any browser, no server required.
Contents: scaling_report.html.
| Model | T1 | T2 | T3 | CB1 | CB2 | CB3 | Avg |
|---|
Each chart shows score on one task vs active parameters (log scale). Each family appears as a curve. Tasks grouped by domain.
Each row is a model, each column is a rubric key discriminator. Green = hit, red = miss, grey = not applicable or run failure.
| Model | T1 Art 10 Rejected |
T2 DEL-003 non-CP |
T2 DEL-002 on CP |
CB1 C-002 Rejected |
CB2 Vessel =Contractor |
CB2 Adyard Concurrent |
CB3 Act 8 Critical |
CB3 Act 9 NOT Critical |
Hits |
|---|
Each model in each family gets a per-task score breakdown and a short conclusion paragraph. Models sorted within family from smallest to largest total parameters.