Stage 1 — Foundational Rebuild Plan¶

Status: archived — Stage 1 is delivered

Sub-stages A, B, C, D all landed. Kept as a record of the decisions and the order they were made in. For the current forward plan see Stage 2.

Date: 2026-04-12 (session 8) Supersedes: plan-forward-roadmap.md Stage 1 onwards (Stage 0 already landed in commit f529798)

Why this exists¶

Five parallel research sessions in session 8 (Q1-Q5) revealed that multiple foundational layers of our inject are built on wrong models. Fixing the surface symptoms (Stage 0 and earlier) only exposed the deeper model errors. A summary of what is wrong:

Wrong properties source (Q1). We run a Go relay (iosmux-relay) on the macOS VM and have CDS HTTP-fetch device properties from 127.0.0.1:62078/device-info. The properties needed are already in our process memory as g_rsd_handshake_properties (46 keys parsed from the RSD Handshake during connection setup). The HTTP fetch was architectural debt from before we had the handshake parser. When the relay isn't running, the fetch fails, the property dict is empty, MDRemoteServiceDeviceLoadBaseProperties fails, and the wrapper is created in a half-broken state ("Failed to load remote device properties" → "Failed to allocate RSD device" → developer mode becomes Unknown).
Wrong action-dispatch hook target (Q3). Step 19 (mov al, 1; ret at CDS+0xB896) was added in commit 6501219 to mask a SIGSEGV in PairAction's NULL self handler. The real root cause of that SIGSEGV was the SDR+104 read overflow that we have since fixed (59ca5cd). Step 19 is now masking a bug that no longer exists, while simultaneously blocking DeviceManagerCheckInRequest from reaching ServiceDeviceManager. Removing it (or replacing with passthrough) should restore check-in.
Wrong device-list intercept (Q5). We hook serviceDeviceRepresentations(forDeviceIdentifiedBy:) and assume devicectl list devices reads it. It does not. devicectl list devices is a pure XPC flow:

client → CDS:  DeviceManagerCheckInRequest { identifier }
CDS → client:  DeviceManagerCheckInCompleteEvent {
                   checkInRequestIdentifier,
                   initialDeviceSnapshots: [DeviceStateSnapshot],
                   serviceFullyInitialized: Bool
               }

The list comes from initialDeviceSnapshots, which is populated by ClientManager.publish(event:) driven by entries in managedServiceDevices. Our handleDiscoveredSDR call publishes an event but does NOT add anything to managedServiceDevices (see iosmux_inject.m:982 comment). Our hook on the serviceDeviceRepresentations getter is on the PairAction lookup path, which is a completely separate code path from list-devices.

The unifying lesson: shortcuts compound. Each layer we built on top of a wrong assumption made the next layer harder to verify. We are unwinding them in foundational order so each fix can be validated against a clean baseline before adding the next.

Operating principle¶

Always choose the correct, grounded, reliable solution. Never take shortcuts unless explicitly approved.

This means in particular: - Prefer "make the natural code path work" over "intercept the result" - Prefer "feed in-memory data we already parsed" over "spin up an out-of-process service" - Prefer "find and fix the root cause" over "mask the symptom"

Stage order rationale¶

Stages run strictly sequentially with a verification gate after each. We do not advance until the previous stage's exit criteria pass AND the new behavior matches the prediction. If reality diverges from prediction, STOP and run a diagnostic before adding more changes.

The order goes from foundational to surface:

S1.A — Properties source (Q1). Foundational because every later stage depends on the wrapper having a valid property dict.
S1.B — Action-dispatch hook (Q3). Restores check-in path before we touch list-devices, so we have an observable signal at the end of S1.C.
S1.C — Device registration in managedServiceDevices (Q5). The actual fix that makes devicectl list devices show the device.
Verification gate — devicectl list devices returns our iPhone end-to-end. From here we re-evaluate and write Stage 2.

Q4 (developerModeStatus) is deferred — research shows no runtime gate on it for the current target functionality. Will revisit when needed.

Stage S1.A — Build properties from in-memory RSD handshake¶

Goal: RSDDeviceWrapper init and _AMDeviceCreateWith... see a fully-populated property dict without any HTTP fetch. The Go relay is no longer required for property delivery.

S1.A.1 Analysis¶

Read inject/iosmux_xpc_proxy.m:557-602 (iosmux_build_device_properties) and inject/iosmux_md_proxy.m:30-44 (fetch_device_properties). Both issue NSData dataWithContentsOfURL: against http://127.0.0.1:62078/device-info. Both run during inject init.

Read inject/iosmux_xpc_proxy.m to find where g_rsd_handshake_properties is populated (during the RSD Handshake parse). Confirm it is in process memory by the time iosmux_build_device_properties runs.

Cross-reference the keys the consumer needs (per docs/research/rsd-wrapper-init-analysis.md and the inject log: 46 keys including ProductType, SerialNumber, UniqueDeviceID, OSVersion, BuildVersion, ChipID, etc.) against what the handshake dict contains. Note any deltas — the handshake dict may use slightly different key names than what MobileDevice expects.

S1.A.2 Implementation¶

Replace the HTTP fetch in iosmux_build_device_properties with: 1. Take g_rsd_handshake_properties (already an xpc_object_t dict) 2. Optionally translate / rename keys to match what MDRemoteServiceDeviceLoadBaseProperties expects (the audit in S1.A.1 will tell us if any translation is needed) 3. Synthesize LocationID (locally generated, currently set to 0) 4. Return the assembled dict

Same change in iosmux_md_proxy.m fetch_device_properties if it is still on the live code path. (Audit whether md_proxy is even invoked in the current flow — if it is dead code from an earlier architecture, flag it and consider #if 0 per S0.3 pattern.)

S1.A.3 Build, deploy, test¶

scp source to havoc, make iosmux_inject.dylib
Deploy to /Library/Developer/CoreDevice/iosmux_inject.dylib (gated approval)
Trigger CDS (devicectl list devices is fine as the trigger, irrespective of what it returns at this stage)
Read /tmp/iosmux_inject.log and the system log filtered to CoreDeviceService

S1.A.4 Exit criteria¶

Inject log shows the property dict assembled from handshake (add a log line listing key count and a sample of keys)
System log NO LONGER shows Failed to load remote device properties
System log NO LONGER shows Failed to allocate RSD device
System log shows Successfully resolved developer mode status from device: <Enabled|Disabled> (a real value, not Unknown)
No new errors introduced
CDS still alive after the test

S1.A.5 Fallback¶

If MobileDevice is still unhappy after the fix: - Log the property dict at the point of return to see what's actually in it - Compare against pymobiledevice3's view of the same iPhone properties (read-only via pymobiledevice3 lockdown info) - The deltas are the missing keys; add translation as needed

If the dict is populated but _AMDeviceCreateWith... still fails for a different reason, STOP and investigate before continuing to S1.B.

S1.A.6 Commit¶

Single commit, gated. Title pattern: Stage 1.A: build wrapper properties from in-memory RSD handshake

Stage S1.B — Replace Step 19 with passthrough trampoline¶

Goal: Restore the natural Mercury.XPCMessageDispatcher flow so DeviceManagerCheckInRequest reaches ServiceDeviceManager. Keep the hook infrastructure in place so we can later add envelope-based filtering, but make it functionally a no-op.

S1.B.1 Analysis¶

Read inject/iosmux_inject.m:1515-1576 (Step 19, commit 6501219). Current implementation: 3-byte patch at CDS+0xB896 = b0 01 c3 (mov al, 1; ret).

Original target is the function entry of CoreDeviceUtilities.invoke(anyOf:usingContentsOf:). We want to replace those 3 bytes with a 14-byte trampoline that jumps to the original function. Since we are replacing the function entry, we need the original function's first few bytes preserved somewhere or we need to compute the original entry address from another source.

Per Stage 0 finding: read the rel32 of any callsite that calls this function before we patch (we already do something similar at CDS+0x5E2D0). The call from CDS+0xB871 (or wherever Mercury invokes it) gives us the rel32 → original target address. Save that address as a global and use it in the trampoline.

Alternative: hook a callsite instead of the function entry. The function is small and called from one place per inject/iosmux_inject.m comments. Hooking the callsite is cleaner because we don't have to preserve any of the original entry bytes.

S1.B.2 Implementation¶

Two options, gated by the analysis in S1.B.1:

Option I — Function-entry passthrough: - Compute g_invoke_orig_target from a known callsite's rel32 - Allocate a hook page near CDS (within ±2GB of CDS+0xB896) - Write trampoline: 48 b8 <orig> ff e0 (movabs rax, orig; jmp rax) — 12 bytes - Patch CDS+0xB896 with e9 <rel32-to-hookpage> (5 bytes), pad remaining bytes with 90 if we replaced more - Wait — function entries are typically the first instruction of a function, not a callsite. So we cannot just jmp away because callers expect to return into the function. The right pattern is: patch entry with a jmp to our hook page that then jumps to original+<saved-prologue-len> after running the saved prologue bytes. This is more delicate.

Option II — Callsite passthrough (PREFERRED if a single callsite exists): - Find the callsite from disassembly (commit 6501219 history / inject comments may already say where) - Treat it the same way we treat CDS+0x5E2D0 (Stage 0 S0.1 pattern): read rel32, compute orig_target, write a hook page that calls our C decision function (currently always returns 1) THEN passes through to orig_target if we want to passthrough - For Stage 1.B the C function just always-passthroughs (no logging, no decision). This is functionally equivalent to no hook at all, which is what we want.

The choice between Option I and II depends on whether invoke(anyOf:) has exactly one callsite. We already believe it does (per action-interception-full-picture.md E+F section, "single callsite"). Option II is the correct/grounded choice. Use it.

S1.B.3 Build, deploy, test¶

Same procedure as S1.A.3.

S1.B.4 Exit criteria¶

System log shows Handling DeviceManagerCheckInRequest (was absent in session 7+8 logs)
System log shows Published DeviceManagerCheckInCompleteEvent
devicectl list devices returns within ~2s (not 15s timeout)
Result may still be "No devices found." because S1.C hasn't run yet
CDS still alive
No SIGSEGV in CDS log (this is the canary for the masked Session-5 bug — if the crash returns, it means SDR fix wasn't enough or we have another latent bug. STOP and investigate before S1.C.)

S1.B.5 Fallback¶

If check-in still fails to reach ServiceDeviceManager: - Verify the trampoline is installed (read patched bytes back, log them) - Verify g_invoke_orig_target is correct (compare to disassembly) - Look for other interpose / hook that might be in the way

If SIGSEGV reappears: - That's a bigger finding than this stage. Document the crash and back out S1.B.

S1.B.6 Commit¶

Single commit. Stage 1.B: replace Step 19 mov-al-1 with passthrough

Stage S1.C — Fix SDR identity (LANDED session 9)¶

Status: DONE. devicectl list devices now returns our iPhone with the correct UUID E8A190DD-64F5-44A4-8D57-28E99E316D60, state connected, model iPhone SE (3rd generation). The Q-C enum tag fix landed and the whole identity chain is consistent end-to-end. Known issue: a race on the very first devicectl after a CDS relaunch because the inject's registration runs inside a dispatch_after(3s) block. See Known issues section below. Plan revised in session 8 after six clarifying research questions (Q-A..Q-F) overturned the original assumption that handleDiscoveredSDR doesn't register the SDR.

S1.C.1 — Original premise: WRONG¶

The original S1.C.1 assumed handleDiscoveredSDR did not result in our SDR being added to managedServiceDevices, so we planned to find the canonical registration function and call it ourselves (or hook install(browser:) to capture the discovery callback closure).

Runtime verification with the current S1.A + S1.B build proved this wrong. Decisive log line from a fresh CDS run:

ServiceDeviceManager - New device representation added to ecid_11836855534199284200:
  <ServiceDeviceRepresentation 0x...>
  { id = (ecid_11836855534199284200, uuid: B9BE8F31-6FD1-5ED4-83B0-4DD1CD9B0265),
    name = Optional("iPhone (iosmux)") }

Per Q-3 research (docs/research/s1c-static-browser-disasm.md Q3 section), this log message is emitted from inside the async ServiceDeviceManager._offer(discoveredDeviceRepresentation:to:) body, after the dict insert at CoreDevice + 0x2829e0. So our SDR is in managedServiceDevices. The chain handleDiscoveredSDR → 0x27e850 → 0x507f0 → async _offer → dict insert already works. We are not missing a registration call.

S1.C.1' — Six clarifying questions (Q-A..Q-F)¶

After verifying the SDR was registered, six parallel research agents ran on havoc to find the actual root cause of devicectl list devices returning empty. Full findings appended to docs/research/s1c-static-browser-disasm.md under sections Q-A..Q-F.

Q-A — initialDeviceSnapshots construction: the snapshot builder is 0x289060, called from handle(clientCheckInRequest:from:) at CoreDevice + 0x286470. It iterates the same dict the managedServiceDevices getter projects (no separate cache). The per-entry builder is CoreDevice + 0x284030 and is the most likely site of any rejection / filter on a per-SDR basis.

Q-B — external identity management: the updateIdentifier(devId, sdr, sdm) call at inject/iosmux_inject.m:1011 is dead code with no side effects on failure. Identity management is disjoint from list visibility. There is no standalone "register external identity" function — entry is only via DeviceRepresentationProvider.consider (offer:). Recommendation: delete the updateIdentifier call.

Q-C — SDR UUID assignment: smoking gun. Our dev_id_buf[32] = 0 is mislabeled in inject as "enum tag = UUID variant", but per CoreDevice.DeviceIdentifier enum layout:

enum DeviceIdentifier {
    case ecid(UInt64)                       // tag 0, 8-byte payload
    case uuid(Foundation.UUID, Swift.String) // tag 1, 32-byte payload
}

With tag = 0, CDS interprets the first 8 bytes of our UUID buffer as a UInt64 ECID. Verified by arithmetic:

Our UUID E8A190DD-64F5-44A4-8D57-28E99E316D60
First 8 bytes little-endian: 0xA444F564DD90A1E8
= 11836855534199284200
= the ECID in the runtime log, exact match

B9BE8F31-6FD1-5ED4-83B0-4DD1CD9B0265 then comes from DeviceIdentifier.uuidRepresentation.getter at CoreDevice + 0x25b650, which for the .ecid case calls into an AMDevice keypath lookup synthesizing a deterministic UUID from the ECID.

Cascade of consequences from this single byte:

SDR enters managedServiceDevices as .ecid(...) instead of .uuid(...)
The snapshot per-entry builder (0x284030) likely rejects ECID entries that have no real backing AMDevice — explaining why initialDeviceSnapshots is empty even though the dict is non-empty
PairAction lookups query by B9BE8F31-... (the AMDevice-derived UUID), but our serviceDeviceRepresentations(forDeviceIdentifiedBy:) hook only matches E8A190DD-... — mismatch
Hostname-manager keeps using E8A190DD-... because it reads deviceInfo.serviceDeviceIdentifier, a separate field on DeviceInfo (one we DO set correctly) — explaining why two different UUIDs appear in our logs simultaneously

Q-D — serviceFullyInitialized: time-based (asyncAfter), not predicate-based. Does NOT gate the snapshot list. Snapshot reads the same dict as managedServiceDevices getter, no separate cache. Race hypothesis is dead in the latest test run (SDR added at +0.371s, check-in handled at +0.476s — 105 ms margin) but was alive in an earlier test where dispatch_after(3s) deferred handleDiscoveredSDR until after check-in. This is a separate symptom.

Q-E — reading managedServiceDevices from inject: getter symbol confirmed at CoreDevice + 0x27e5c0 with mangled name $s10CoreDevice07ServiceB7ManagerC07managedC7DevicesSDyAA0B10IdentifierOSayAA0cB14RepresentationCGGvg. Type: [DeviceIdentifier : [ServiceDeviceRepresentation]]. Key is a DeviceIdentifier enum (NOT UUID/String — note this for the lookup side). ABI: self in %r13, return in %rax. C-helper recipe in the disasm doc.

Q-F — closure body validation: none. CDS+0x286b70 is pure "log and forward" with zero SDR field reads or rejection branches. The only branch is os_log_type_enabled(info) for log gating. 0x27e850 also has no rejection of our SDR — only an early exit on a single byte-flag we are not setting. Validation is NOT the bug.

S1.C.2 (REVISED) — minimal fix, two changes in one file¶

Effort: ~10 lines of code, single file inject/iosmux_inject.m.

Q-C fix — switch DeviceIdentifier enum tag from .ecid to .uuid. At inject/iosmux_inject.m:857-861 (the dev_id_buf construction):

// BEFORE
memset(dev_id_buf, 0, 33);
memcpy(dev_id_buf, uuid, 16);
dev_id_buf[32] = 0;  // wrong: tag 0 is .ecid(UInt64)

// AFTER
memset(dev_id_buf, 0, 33);
memcpy(dev_id_buf, uuid, 16);                          // .uuid payload[0]: Foundation.UUID
*(uint64_t *)(dev_id_buf + 16) = 0;                    // .uuid payload[1]: Swift.String _countAndFlagsBits = 0
*(uint64_t *)(dev_id_buf + 24) = 0xE000000000000000ULL; // tagged-small-string empty marker
dev_id_buf[32] = 1;                                     // tag = .uuid

The 0xE000000000000000 empty-string pattern is verified canonical per the disasm doc — it's the value the SDR's own description.getter emits for empty Swift.String literals.

Q-B fix — delete the dead updateIdentifier block. Remove the asm-emit + call site at inject/iosmux_inject.m:986-1054. Also remove the surrounding logging that references it. Update the comment block at line 974-985 to note that handleDiscoveredSDR alone is the canonical entry and updateIdentifier was a no-op (per Q-B disasm).

S1.C.3 — Build, deploy, test¶

scp + build on havoc
Deploy + force CDS reload
Run devicectl list devices
Read CDS system log + inject log

S1.C.4 — Exit criteria¶

System log shows New device representation added to <our UUID E8A190DD-...> (NOT ecid_...) — i.e. the id line uses .uuid(E8A190DD-..., "") formatting
devicectl list devices returns our iPhone in the list with name "iPhone (iosmux)" and UDID matching our config
No Received identity update request ERROR (we deleted the call)
CDS still alive
Cross-check with the Q-E helper: managedServiceDevices getter count > 0 (optional, only if we add the helper for verification)

S1.C.5 — Fallback if test fails¶

If devicectl still returns empty after the Q-C+Q-B fix:

Most likely cause: the per-entry snapshot builder at CoreDevice + 0x284030 has ANOTHER rejection predicate beyond .ecid vs .uuid discrimination. Disassemble it (Q-A's lead).
Second-likely: race — handleDiscoveredSDR is called from inside the existing dispatch_after(3 * NSEC_PER_SEC) block, so on the FIRST devicectl invocation after CDS startup, the check-in fires before our SDR is added. Mitigations:
Move the registration to the synchronous part of the inject ctor, OR
Hook handle(clientCheckInRequest:from:) at CoreDevice + 0x286470 and run our registration synchronously on entry, then tail-call the original. This guarantees ordering and auto-repeats per request.

S1.C.6 — Commit¶

Single commit if S1.C.2 is enough. Title: Stage 1.C: fix DeviceIdentifier enum tag (.ecid → .uuid) and drop dead updateIdentifier

Followup commits if S1.C.5 fallbacks are needed.

Why this is a single change instead of the original multi-stage plan¶

The original S1.C plan had three sub-stages (research → ABI work → implementation) because we believed we needed to build a new path into managedServiceDevices. Q3 + verification + Q-A..F showed that the path already exists and works — we were corrupting the SDR's identity at the entry point. Once the identity is correct, the canonical chain handles everything else. This is exactly the "correct/grounded solution" pattern: find the actual root cause and fix it at its source, not layer more workarounds on top.

Known issue after landing: first-call race on CDS relaunch¶

On the very first devicectl list devices after a CoreDeviceService relaunch (fresh process, e.g. right after killall CoreDeviceService), the result is still "No devices found." The second and all subsequent calls correctly list the iPhone.

Root cause: our inject's full registration flow — build DeviceInfo, build SDR, call handleDiscoveredSDR — lives inside a dispatch_after(3 * NSEC_PER_SEC) block in iosmux_register_device(). By the time our SDR is added to managedServiceDevices, the first DeviceManagerCheckInRequest from the fresh devicectl has already been served from an empty dict.

This is cosmetic — once the script has run once, the dict has our SDR and every subsequent call works. But for a clean-boot UX we should fix it. Two options:

Move the SDR construction out of dispatch_after into the synchronous ctor body. The 3s delay exists for a reason we need to re-audit first (probably waiting for CDS's own init to finish) — removing it blindly risks racing against something else.
Hook ServiceDeviceManager.handle(clientCheckInRequest:from:) at CoreDevice + 0x286470 and run our registration synchronously on entry before tail-calling the original. Guarantees ordering and works regardless of what dispatch_after was protecting against.

Scheduled as S1.D — separate commit/stage, not a blocker for S1.C acceptance.

S1.C.6 commit¶

Single commit for the Q-C + Q-B fix landed as part of the session 9 commit that also includes the architecture doc and the rewritten restore script.

Verification gate¶

After S1.C exits successfully:

Run devicectl list devices 5 times in a row from a clean state. All 5 must show our iPhone consistently.
Open Xcode → Window → Devices and Simulators. Confirm the device row appears.
Capture the Devices window's view of our device — what's its state? (connecting, connected, ready, paired, etc.)
Do NOT click Pair yet. Capture system log during step 2-3.
Document findings in docs/research/session-8-stage-1-results.md.

After this gate we re-plan Stage 2 based on what Xcode does. Likely candidates: - Stage 2: address whatever Xcode complains about in step 3 - Stage 2: handle developerModeStatus if it actually matters now - Stage 2: capture Mercury envelope catalog as old Stage 3 planned

Risk register (Stage 1)¶

Risk	Impact	Mitigation
S1.A: handshake dict has different keys than MobileDevice expects	Wrapper init still fails	S1.A.1 catalogs the deltas before code change; translation table added if needed
S1.B: SIGSEGV from masked Session-5 crash returns	CDS dies	The SDR fix should have addressed it; if not we learn that during S1.B.4 with no further code added
S1.C.1: discovers there is no natural path and we have to hook	Forces a "intercept result" approach	Pause and confirm with user before falling back to that pattern
Stage order coupling: S1.A failure blocks S1.B test, S1.B failure blocks S1.C test	Single point of failure	Strictly sequential gating means we'll catch each independently

What we're NOT doing in Stage 1¶

developerModeStatus setter (Q4): deferred, no observed runtime gate
Mercury envelope catalog: deferred, only needed for actions not list-devices
Code-audit medium findings (M1-M10): Stage 5
DYLD_INTERPOSE rework: not needed for Stage 1 scope
Pair button work: not until Xcode shows the device

Stopping rule (unchanged from old roadmap)¶

At any stage, if reality diverges from prediction: 1. STOP. Do not layer more changes. 2. Run a minimal diagnostic. 3. Document in a research doc. 4. Update this plan. 5. Only then resume.