Competition Learnings Applied

This product is built on concrete learnings from NM i AI 2026 — Norway's National AI Championship. We built a working Tripletex agent (ranked 33rd of 369) and studied 22 competing teams' implementations. Every architectural decision traces back to evidence.

Architecture Decisions From Competition

1. LLM Agent with Function-Calling > Hardcoded Handlers

Evidence: 8 of 12 teams with working code used LLM agent loops. KreativKI (664 lines, Gemini function-calling) outperformed our 3800-line handler approach.

Applied: The Accounting API uses function-calling agent loop. The LLM decides which Tripletex/Fiken API calls to make, sees errors, and recovers. No hardcoded handler per task type.

Impact: ~800 lines instead of ~3800. Automatically handles new task types without code changes.

2. Conversation Layer Separate From Execution

Evidence: NikolaiPoverud (68.98, highest documented score) used pre-fetch + agent loop. hansjto used a two-pass approach (Opus executes, Sonnet verifies). Oddonline used a "think" tool before executing.

Applied: OpenClaw handles conversation (multi-turn, context, clarification). Accounting API handles execution (stateless, one task). Clean separation.

3. Pre-Fetch Common Data During LLM Call

Evidence: We implemented this during the competition — saved 3-5 seconds per complex task. NikolaiPoverud used ThreadPoolExecutor with 10 workers.

Applied: On each request, the Accounting API pre-fetches accounts, departments, VAT types, divisions in parallel while the LLM processes. Results cached per request.

4. Auto-Fix API Errors Between Turns

Evidence: NikolaiPoverud auto-fixed 422 errors by removing invalid fields and retrying. alexrajo used OpenAPI preflight validation to prevent errors. Oddonline blocked endpoints after 3 failures.

Applied: The Accounting API's agent loop includes error recovery: strip invalid fields, retry with corrections, block repeated failures. The LLM also sees the error and adjusts.

Tripletex API Learnings

Supplier Invoice — What Actually Works

Our experience: amountCurrency on POST body → 500 from proxy. voucherDate → 422.

hansjto's approach: Use /incomingInvoice?sendTo=ledger instead of /supplierInvoice.

NikolaiPoverud's approach: 2-posting with vatType (net on amount, gross on amountGross). Auto-fix net/gross math errors.

Applied: The TripletexProvider tries multiple approaches: 1. /incomingInvoice?sendTo=ledger (hansjto's method) 2. /supplierInvoice with inline voucher postings 3. Raw voucher fallback

Payroll — Skip the Salary API

Our experience: /salary/transaction always fails with "Arbeidsforholdet ikke knyttet mot virksomhet" (employment not linked to division).

NikolaiPoverud (68.98): System prompt says "STOP after voucher. Do NOT try /salary/transaction — it always fails." Uses manual voucher: debit 5000/5001, credit 2930.

Applied: TripletexProvider uses manual voucher for payroll. More reliable than the salary API.

Bank Reconciliation — No Reconciliation Endpoint

Our experience: We built POST /bank/reconciliation + PUT /bank/reconciliation/match/:suggest. Scored 2/10.

All other teams: None use the reconciliation endpoint. They parse CSV and register individual payments.

Applied: Parse bank statement, match transactions to invoices, register payments individually.

Year-End Closing — Separate Vouchers Per Asset

Our experience: One combined depreciation voucher → scored 4.5/10. Separate per asset → improvement.

hansjto: GET /balanceSheet BEFORE postings. Calculate tax from actual balances + new amounts.

Applied: One voucher per depreciation asset. Read actual P&L from ledger for tax calculation.

Travel Expense Per Diem

NikolaiPoverud: Does NOT set rateCategory. Just sets rate, count, location.

hansjto: Key trick: rateType.id = rateCategory.id (same value). count = overnight stays, NOT days.

Applied: Per diem with explicit rate from prompt. Count = overnight stays.

Stable IDs Across Sandboxes

Confirmed by natti1399 and Oddonline: VAT type IDs are stable:

VAT	ID
25% output	3
15% output	31
12% output	33
0% output	5
25% input	1

Applied: Hardcoded in TripletexProvider. Saves API lookup per request.

Fields That Break Things

From our testing and other teams:

Field	Endpoint	Problem
`amountCurrency`	POST /supplierInvoice	500 from proxy
`voucherDate`	POST /supplierInvoice	422 "field doesn't exist"
`isCustomer`	POST /customer	readOnly
`isSupplier`	POST /supplier	readOnly
`amount`	POST /supplierInvoice	readOnly
`departmentNumber`	various	Often causes validation errors

Applied: Provider strips known-bad fields before sending. OpenAPI preflight validation (from alexrajo's approach).

LLM Strategy Learnings

Model Choice

Model	Teams Using	Strengths
Gemini 2.5 Flash	KreativKI, alexrajo, natti1399	Fast (~2s), cheap, native function-calling
Gemini 2.5 Pro	maksdunajski, sth1712	Better reasoning, slower
Claude Opus 4.6	NikolaiPoverud, hansjto	Best reasoning, code execution
Claude Haiku	us, kleiven	Fast, cheap, good enough for parsing

Applied: Gemini 2.5 Flash for conversation (fast, cheap). Claude for complex tasks if needed (fallback).

System Prompt Design

Common across all teams: - Full API reference embedded (~5K chars) - Hardcoded stable IDs (VAT types, currency) - "Fresh sandbox" instruction - "Minimize write calls" instruction - Norwegian accounting conventions

Oddonline's innovation: Per-task prompt injection — only include the relevant task pattern, not all patterns. Keeps context focused.

Applied: Base system prompt + task-specific recipes injected dynamically.

Error Prevention

alexrajo: OpenAPI preflight validation — parse Tripletex schema, strip readOnly fields, validate required fields BEFORE hitting the API.

Oddonline: Invalid field blocklist in prompt — explicitly tell the LLM which field names DON'T exist.

Applied: Both approaches. Preflight validation in provider. Blocklist in system prompt.

Testing Learnings

Save Every Request

Our approach: Saved all 294 competition requests to disk. Audited every one against regex parser. Found 25 misclassifications.

Applied: Every request to the Accounting API is logged with full input/output. Replay testing against saved requests.

Test Before Deploy

Our mistake: Deployed --max-tokens CLI flag that broke ALL LLM calls. Took a submission to discover.

Applied: CLI/SDK validation test in CI. Integration test against real Tripletex sandbox before deploy.

Mutation Fuzzer (Pling-AS)

Innovation: When a task scores low, auto-generate mutation candidates (different VAT IDs, remove fields) and resubmit.

Applied: Not directly, but the auto-fix between agent turns achieves a similar effect dynamically.

Speed Learnings

Token TTL Is Not a Timer

Our discovery: 237-second task scored 7.5/10. Token TTL is >300s or not time-based.

Applied: No artificial timeout. Let tasks run to completion.

Reduce API Round-Trips

Our implementation: Account cache per request saved 10-20 duplicate lookups.

NikolaiPoverud: Pre-fetch with 10 parallel workers.

Applied: Pre-fetch + cache. Common accounts, departments, VAT types loaded once.

CLI Subprocess Is Slow

Our measurement: CLI = ~4.5s (0.9s overhead + 3.7s API). Direct SDK = ~2-3s.

Applied: Always use SDK directly. Never CLI subprocess.

What No Team Solved

Even the best teams had these unsolved:

Bank reconciliation entity — no team uses /bank/reconciliation. All just register payments.
Receipt expense — consistently low scores across teams. PDF OCR + correct account mapping is hard.
Year-end tax with empty sandbox — if no pre-existing income, tax = 0. Hard to know if sandbox has data.

Applied: These are known limitations documented for human accountant review.

Summary: Competition → Product

Competition Learning	Product Feature
Agent > handlers	Function-calling agent loop
Conversation needs context	OpenClaw with structured facts
Pre-fetch saves time	Parallel data loading
Auto-fix 422s	Error recovery between turns
`/incomingInvoice` for SI	Multi-strategy provider
Skip salary API	Manual voucher for payroll
Hardcode VAT IDs	Static config, not API lookup
Save all requests	Full audit trail + replay testing
Test before deploy	CI with sandbox integration tests
22 teams studied	Best practices from each applied