2026-04-20· 10 min read

Seven Principles I Learned Working with Claude: Spec-Driven, Boundary-First, Decisions That Stick

Three months of pair-building Joseph with Claude Code gave me seven principles that let the AI actually run the development loop on its own — not babysitting-style collaboration where you hover over every step.

#ai-collaboration#claude-code#spec-driven#methodology#joseph#series-2

TL;DR if you're in a hurry

Getting AI to develop independently comes down to one thing: write specs as contracts, not as wishes. That unpacks into seven principles: spec before code, Given/When/Then to kill ambiguity, start from the end, boundary cases first, don't mix questions with instructions, decide when the answer arrives, self-verify before closing. This post pairs each principle with a real moment from the Joseph build.

Series overview

[Part 1] Identity: why three projects had to die before Joseph survived
[Part 2 — this post] Methodology: the seven principles
[Part 3] Build log: nine DRY RUN failures, T+2 triple lock, 458 tests, 18 crons

Why "babysitter collaboration" doesn't scale

A lot of people use AI to code like this: throw a vague request → AI writes a blob → something feels off → tweak the prompt → AI writes another blob → it runs but breaks → tweak → repeat forever.

The problem isn't that the AI isn't strong enough. It's that no collaboration contract got established. You haven't told it what "done" looks like or how "correct" gets verified. It can only keep guessing.

One of my favorite docs in the Joseph repo is docs/AI協作到AI全自動開發攻略.md. Its opening says:

"A spec isn't a document. It's a contract. A spec written well enough lets AI run the whole dev loop on its own — coding, testing, debugging, reporting."

This post unpacks that sentence into seven executable principles.

Principle 1: Spec before code

How: Don't start by writing code. Write SPEC_*.md first. At minimum: purpose, acceptance criteria (Given/When/Then), things that must not happen.

Joseph example: Before going live, I wrote docs/DRY_RUN_FIX_SPEC.md, defining what "fixed" meant — every fix had an executable verify command, e.g., grep "BUY" logs/trading.log | tail -1 should return a specific format. With a verify command, "done" is a real thing you can discuss.

Why it works: If you can't write the spec, you haven't thought it through. Letting the AI code at that point just accelerates the mess. Make the spec force you to be clear. Then the AI can move.

Principle 2: Given / When / Then to kill ambiguity

How: Every acceptance criterion in this rigid format. Given = preconditions; When = trigger; Then = expected outcome. No natural-language prose.

Joseph example: The live Gate 10 T+2 guard acceptance was written like this:

Given: sold 10 lots yesterday, not yet settled
When:  today's allocator computes available_cash
Then:  exclude unsettled amounts still in pending_settlements
Verify: python -m core.capital_manager --dry-check

Why it works: Natural language says "the system should correctly handle T+2" — what does "correctly" mean? What does "handle" mean? The AI can interpret it a hundred ways. Given/When/Then has one interpretation.

Common failure: Writing a weak Then like "returns a result." Returns what? Format? Error handling? Write Then at the level where a shell command can verify it.

Principle 3: Start from the end (test-first thinking)

How: Before starting, ask "how will I prove this is right?" Write the verification method down first, then start coding.

Joseph example: Task #7 was to add _call_shioaji_place_order() real-order capability. Before moving, I defined "pass" as: 72 unit tests green + Shioaji login succeeds on paper account + polling + fallback covers all terminal states. Those three conditions decided which tests to write and which code to write.

Why it works: The verification method decides the implementation direction. If you don't know how to verify something's correct, you can't write something truly correct — you can only write something that looks correct. Code that looks correct teaches you humility in production.

Principle 4: Boundary cases first

How: The happy path goes last. List every boundary case first — no network, API error, empty data, missing permissions, timezone shift, integer overflow, concurrent request, partial failure.

Joseph example: The 4/10 DRY RUN blew up on three boundary bugs:

equity=0 made all allocations zero → added equity = cash + marked_value fallback
Scanner judged 08:30 as "closed" → boundary timestamp wrong
ShadowExecutor missing async get_balance() / get_positions() → no async fallback

All three were "can't happen in normal operation" — and all three happened in the first minute of the cron's first real start.

Why it works: Anyone can code the happy path. What actually kills systems is the boundary. Listing boundaries first is buying insurance.

Principle 5: Don't mix questions with instructions

How: When asking the AI a question, don't sneak in a command. When telling it to do something, don't sneak in a vague question.

Wrong: "Take a look at trader_live.py and see if there's anything to optimize, and also finish the polling logic."

Right, as two separate turns:

Question: "In trader_live.py, what order statuses does _call_shioaji_place_order() currently support? Is anything missing?"
Instruction: "Please add FILLED / CANCELLED / REJECTED / FAILED as exhaustive match on _call_shioaji_place_order(); missing any status should raise UnknownOrderStatusError."

Why it works: In question mode the AI gives information. In instruction mode it modifies code. The mental mode is different. Mix them and it oscillates between "I should explain" and "I should edit," and does neither well.

Principle 6: The answer arrives, the decision is now

How: After the AI gives you an executable answer, decide on the spot — take it, reject it, change this parameter. No "let me think about it later" or opening a new chat tomorrow. Decide while the context is hot.

Joseph example: On 4/17 during the T+2 diagnosis, Claude gave me six alternative designs: A (extend shadow_ledger), B (new pending_settlements module), C (hybrid), D (defer until after go-live)... I picked B on the spot, wrote the reason into PRELIVE_T2_DIAGNOSIS_20260417.md, and went straight to Task #6 implementation. No "sleep on it."

Why it works: The AI's time is cheap. Your attention is expensive. Every extra round of discussion on the same question raises cost and dilutes the answer. The moment the answer lands is the optimal decision window — all context is still warm. Decide later and you'll forget why those six options were ordered the way they were.

Important: "Decide now" doesn't mean "can never change the decision." It means don't stall. If it turns out wrong later, flip it openly — but don't do hindsight-driven second-guessing ("I knew it all along...").

Principle 7: Self-verify before closing

How: The AI finishing a task ≠ the task is closed. Require it to self-verify — run tests, check logs, compare against spec — and update progress docs. A task without self-verification is an unclosed task.

Joseph example: Every task closes with Claude running pytest tests/ -v, confirming all 458 tests green, then updating docs/DEV_PROGRESS.md. If tests didn't run or DEV_PROGRESS wasn't updated, I don't accept the task.

Why it works: The AI's "I'm done" isn't the same as "I got it right." It honestly reports what it changed, but doesn't always confirm what it changed is correct. Self-verification turns claims into evidence.

Specific file: The Joseph repo has docs/DEV_PROGRESS.md, which Claude maintains — a timeline of "what I did / verification result / next step." That doc has saved me several times. Before go-live when I forgot where a fix stood, checking that doc told me.

All seven in a single session

A real example:

I write the spec (P1): TASK7_DIAGNOSIS_20260418.md, describing the gap in _call_shioaji_place_order
Given/When/Then acceptance (P2): one line per terminal state
Verify command upfront (P3): pytest tests/test_trader_live_real.py -v, expecting 72 tests green
Boundary cases listed (P4): network drop, Shioaji login fail, order in "unknown state" — each a test
Instruction to Claude is clean (P5): pure instruction, no vague question bundled in
Pick on the spot (P6): Claude returns option B, I pick it, write the reason into IMPL_LOG
Close with tests + DEV_PROGRESS update (P7): 72 tests green, IMPL_LOG written, then close

I wasn't hovering. I went to work. Got food. Slept. Came back to check the commit log, read IMPL_LOG, ran the tests myself. Confirmed results matched spec. That's what "independent development" looks like.

A principle that doesn't exist: "three-role split"

Some articles say "let the AI role-play as strategist / engineer / UX judge." I've tried that prompt style for Joseph-type systems. It doesn't help. Reason:

The AI doesn't really switch roles — it switches tone
What actually separates roles is phases, not personas — spec phase (I write), execution phase (Claude executes), verification phase (tests run)

My actual division of labor:

Human (me) = writes spec, makes decisions, reviews PRs
Claude Code = reads spec, writes code, runs verification, reports
Test suite = the referee; listens to no one, only checks results

Boundaries between these three are hard. Role play is romantic. Hard boundaries are useful.

Why these matter more for non-engineers

Engineers can code without specs because they carry the model in their head. Non-engineers don't have that model — a fuzzy spec drags the AI somewhere unknown. A spec is a non-engineer's emergency brake, not bureaucracy.

Joseph surviving is largely because I accepted I'm not an engineer, so spec writing became the top priority. That's not humility. That's pragmatism.

FAQ

Q1: Do these apply to any AI tool or just Claude Code? The skeleton transfers. I used Claude Code to build Joseph, but Given/When/Then, spec-first, start-from-the-end are tool-agnostic. Claude Code has a nice advantage: it runs commands, runs tests, reads logs itself — so "self-verify" is seamless. With another tool you might have to run tests yourself and paste results back.

Q2: How long should a spec be? Length isn't the point. Verifiability is. A three-line Given/When/Then with a real verify command beats a 30-line natural-language description by 100x.

Q3: What if the AI's proposal doesn't match what I had in mind? Discuss on the spot. Don't open a new chat. Say "why did I want A, and what am I missing about your B?" The AI is good at explaining its own reasoning. One round and you'll know whether to go A or B.

Q4: How far does self-verification go? Far enough to answer "what is this code now actually correct at?" Minimum: all tests green. Then: run one real scenario (DRY RUN). Then: scan logs for errors. Three layers green = actually done.

Q5: Which of the seven is hardest? Principle 6 (decide on the spot). Decisions are tiring; you want to punt to tomorrow. But tomorrow you've lost 60% of the context; decision quality drops. Training myself to "decide when the answer lands" took the longest to stick.

Q6: When do these principles fail? When you don't know what you want. Spec-driven requires knowing the shape of "done" — even fuzzily. If you haven't even figured out "what I'm trying to solve," don't write a spec. Open a journal and write down the real problem. These principles can't help there.

Q7: Can Claude Code really run the whole loop alone? Yes, if the spec is clear, verify commands concrete, boundaries complete. Joseph's final two weeks — Task #6 and Task #7 — were me writing the spec and Claude independently implementing + testing + writing IMPL_LOG, with me only reviewing final results. Both Task spec docs were over 200 lines each. AI independence ≈ spec completeness.

Q8: Is there a template I can copy? Joseph's public docs/JOSEPH_FOUNDATIONAL_RULES.md and docs/AI協作到AI全自動開發攻略.md are the template. Format matters more than content.

Part 3: From nine DRY RUN failures to go-live →

Methodology done. Part 3 is pure data — nine DRY RUN failure modes, T+2 triple-lock design, Shioaji pitfalls, 18 crons. For anyone who wants to see what a real automated investment system actually looks like.