Stage 1 — Foundational Rebuild Plan¶
Status: archived — Stage 1 is delivered
Sub-stages A, B, C, D all landed. Kept as a record of the decisions and the order they were made in. For the current forward plan see Stage 2.
Date: 2026-04-12 (session 8)
Supersedes: plan-forward-roadmap.md Stage 1 onwards (Stage 0 already
landed in commit f529798)
Why this exists¶
Five parallel research sessions in session 8 (Q1-Q5) revealed that multiple foundational layers of our inject are built on wrong models. Fixing the surface symptoms (Stage 0 and earlier) only exposed the deeper model errors. A summary of what is wrong:
-
Wrong properties source (Q1). We run a Go relay (
iosmux-relay) on the macOS VM and have CDS HTTP-fetch device properties from127.0.0.1:62078/device-info. The properties needed are already in our process memory asg_rsd_handshake_properties(46 keys parsed from the RSD Handshake during connection setup). The HTTP fetch was architectural debt from before we had the handshake parser. When the relay isn't running, the fetch fails, the property dict is empty,MDRemoteServiceDeviceLoadBasePropertiesfails, and the wrapper is created in a half-broken state ("Failed to load remote device properties" → "Failed to allocate RSD device" → developer mode becomes Unknown). -
Wrong action-dispatch hook target (Q3). Step 19 (
mov al, 1; retat CDS+0xB896) was added in commit6501219to mask a SIGSEGV in PairAction's NULL self handler. The real root cause of that SIGSEGV was the SDR+104 read overflow that we have since fixed (59ca5cd). Step 19 is now masking a bug that no longer exists, while simultaneously blockingDeviceManagerCheckInRequestfrom reachingServiceDeviceManager. Removing it (or replacing with passthrough) should restore check-in. -
Wrong device-list intercept (Q5). We hook
serviceDeviceRepresentations(forDeviceIdentifiedBy:)and assumedevicectl list devicesreads it. It does not.devicectl list devicesis a pure XPC flow:
client → CDS: DeviceManagerCheckInRequest { identifier }
CDS → client: DeviceManagerCheckInCompleteEvent {
checkInRequestIdentifier,
initialDeviceSnapshots: [DeviceStateSnapshot],
serviceFullyInitialized: Bool
}
The list comes from initialDeviceSnapshots, which is populated by
ClientManager.publish(event:) driven by entries in
managedServiceDevices. Our handleDiscoveredSDR call publishes an
event but does NOT add anything to managedServiceDevices (see
iosmux_inject.m:982 comment). Our hook on the
serviceDeviceRepresentations getter is on the PairAction lookup
path, which is a completely separate code path from list-devices.
The unifying lesson: shortcuts compound. Each layer we built on top of a wrong assumption made the next layer harder to verify. We are unwinding them in foundational order so each fix can be validated against a clean baseline before adding the next.
Operating principle¶
Always choose the correct, grounded, reliable solution. Never take shortcuts unless explicitly approved.
This means in particular: - Prefer "make the natural code path work" over "intercept the result" - Prefer "feed in-memory data we already parsed" over "spin up an out-of-process service" - Prefer "find and fix the root cause" over "mask the symptom"
Stage order rationale¶
Stages run strictly sequentially with a verification gate after each. We do not advance until the previous stage's exit criteria pass AND the new behavior matches the prediction. If reality diverges from prediction, STOP and run a diagnostic before adding more changes.
The order goes from foundational to surface:
- S1.A — Properties source (Q1). Foundational because every later stage depends on the wrapper having a valid property dict.
- S1.B — Action-dispatch hook (Q3). Restores check-in path before we touch list-devices, so we have an observable signal at the end of S1.C.
- S1.C — Device registration in
managedServiceDevices(Q5). The actual fix that makesdevicectl list devicesshow the device. - Verification gate —
devicectl list devicesreturns our iPhone end-to-end. From here we re-evaluate and write Stage 2.
Q4 (developerModeStatus) is deferred — research shows no runtime gate on it for the current target functionality. Will revisit when needed.
Stage S1.A — Build properties from in-memory RSD handshake¶
Goal: RSDDeviceWrapper init and _AMDeviceCreateWith... see a
fully-populated property dict without any HTTP fetch. The Go relay is
no longer required for property delivery.
S1.A.1 Analysis¶
Read inject/iosmux_xpc_proxy.m:557-602 (iosmux_build_device_properties)
and inject/iosmux_md_proxy.m:30-44 (fetch_device_properties). Both
issue NSData dataWithContentsOfURL: against
http://127.0.0.1:62078/device-info. Both run during inject init.
Read inject/iosmux_xpc_proxy.m to find where
g_rsd_handshake_properties is populated (during the RSD Handshake
parse). Confirm it is in process memory by the time
iosmux_build_device_properties runs.
Cross-reference the keys the consumer needs (per
docs/research/rsd-wrapper-init-analysis.md and the inject log: 46
keys including ProductType, SerialNumber, UniqueDeviceID,
OSVersion, BuildVersion, ChipID, etc.) against what the handshake
dict contains. Note any deltas — the handshake dict may use slightly
different key names than what MobileDevice expects.
S1.A.2 Implementation¶
Replace the HTTP fetch in iosmux_build_device_properties with:
1. Take g_rsd_handshake_properties (already an xpc_object_t dict)
2. Optionally translate / rename keys to match what
MDRemoteServiceDeviceLoadBaseProperties expects (the audit
in S1.A.1 will tell us if any translation is needed)
3. Synthesize LocationID (locally generated, currently set to 0)
4. Return the assembled dict
Same change in iosmux_md_proxy.m fetch_device_properties if it is
still on the live code path. (Audit whether md_proxy is even invoked
in the current flow — if it is dead code from an earlier architecture,
flag it and consider #if 0 per S0.3 pattern.)
S1.A.3 Build, deploy, test¶
scpsource to havoc,make iosmux_inject.dylib- Deploy to
/Library/Developer/CoreDevice/iosmux_inject.dylib(gated approval) - Trigger CDS (
devicectl list devicesis fine as the trigger, irrespective of what it returns at this stage) - Read
/tmp/iosmux_inject.logand the system log filtered to CoreDeviceService
S1.A.4 Exit criteria¶
- Inject log shows the property dict assembled from handshake (add a log line listing key count and a sample of keys)
- System log NO LONGER shows
Failed to load remote device properties - System log NO LONGER shows
Failed to allocate RSD device - System log shows
Successfully resolved developer mode status from device: <Enabled|Disabled>(a real value, not Unknown) - No new errors introduced
- CDS still alive after the test
S1.A.5 Fallback¶
If MobileDevice is still unhappy after the fix:
- Log the property dict at the point of return to see what's actually
in it
- Compare against pymobiledevice3's view of the same iPhone properties
(read-only via pymobiledevice3 lockdown info)
- The deltas are the missing keys; add translation as needed
If the dict is populated but _AMDeviceCreateWith... still fails for
a different reason, STOP and investigate before continuing to S1.B.
S1.A.6 Commit¶
Single commit, gated. Title pattern:
Stage 1.A: build wrapper properties from in-memory RSD handshake
Stage S1.B — Replace Step 19 with passthrough trampoline¶
Goal: Restore the natural Mercury.XPCMessageDispatcher flow so
DeviceManagerCheckInRequest reaches ServiceDeviceManager. Keep the
hook infrastructure in place so we can later add envelope-based
filtering, but make it functionally a no-op.
S1.B.1 Analysis¶
Read inject/iosmux_inject.m:1515-1576 (Step 19, commit 6501219).
Current implementation: 3-byte patch at CDS+0xB896 = b0 01 c3
(mov al, 1; ret).
Original target is the function entry of
CoreDeviceUtilities.invoke(anyOf:usingContentsOf:). We want to
replace those 3 bytes with a 14-byte trampoline that jumps to the
original function. Since we are replacing the function entry, we need
the original function's first few bytes preserved somewhere or we need
to compute the original entry address from another source.
Per Stage 0 finding: read the rel32 of any callsite that calls this function before we patch (we already do something similar at CDS+0x5E2D0). The call from CDS+0xB871 (or wherever Mercury invokes it) gives us the rel32 → original target address. Save that address as a global and use it in the trampoline.
Alternative: hook a callsite instead of the function entry. The
function is small and called from one place per inject/iosmux_inject.m
comments. Hooking the callsite is cleaner because we don't have to
preserve any of the original entry bytes.
S1.B.2 Implementation¶
Two options, gated by the analysis in S1.B.1:
Option I — Function-entry passthrough:
- Compute g_invoke_orig_target from a known callsite's rel32
- Allocate a hook page near CDS (within ±2GB of CDS+0xB896)
- Write trampoline: 48 b8 <orig> ff e0 (movabs rax, orig; jmp rax)
— 12 bytes
- Patch CDS+0xB896 with e9 <rel32-to-hookpage> (5 bytes), pad
remaining bytes with 90 if we replaced more
- Wait — function entries are typically the first instruction of a
function, not a callsite. So we cannot just jmp away because
callers expect to return into the function. The right pattern is:
patch entry with a jmp to our hook page that then jumps to
original+<saved-prologue-len> after running the saved prologue
bytes. This is more delicate.
Option II — Callsite passthrough (PREFERRED if a single callsite
exists):
- Find the callsite from disassembly (commit 6501219 history /
inject comments may already say where)
- Treat it the same way we treat CDS+0x5E2D0 (Stage 0 S0.1 pattern):
read rel32, compute orig_target, write a hook page that calls our
C decision function (currently always returns 1) THEN passes through
to orig_target if we want to passthrough
- For Stage 1.B the C function just always-passthroughs (no logging,
no decision). This is functionally equivalent to no hook at all,
which is what we want.
The choice between Option I and II depends on whether invoke(anyOf:)
has exactly one callsite. We already believe it does (per
action-interception-full-picture.md E+F section, "single callsite").
Option II is the correct/grounded choice. Use it.
S1.B.3 Build, deploy, test¶
Same procedure as S1.A.3.
S1.B.4 Exit criteria¶
- System log shows
Handling DeviceManagerCheckInRequest(was absent in session 7+8 logs) - System log shows
Published DeviceManagerCheckInCompleteEvent devicectl list devicesreturns within ~2s (not 15s timeout)- Result may still be "No devices found." because S1.C hasn't run yet
- CDS still alive
- No SIGSEGV in CDS log (this is the canary for the masked Session-5 bug — if the crash returns, it means SDR fix wasn't enough or we have another latent bug. STOP and investigate before S1.C.)
S1.B.5 Fallback¶
If check-in still fails to reach ServiceDeviceManager:
- Verify the trampoline is installed (read patched bytes back, log
them)
- Verify g_invoke_orig_target is correct (compare to disassembly)
- Look for other interpose / hook that might be in the way
If SIGSEGV reappears: - That's a bigger finding than this stage. Document the crash and back out S1.B.
S1.B.6 Commit¶
Single commit. Stage 1.B: replace Step 19 mov-al-1 with passthrough
Stage S1.C — Fix SDR identity (LANDED session 9)¶
Status: DONE. devicectl list devices now returns our iPhone with
the correct UUID E8A190DD-64F5-44A4-8D57-28E99E316D60, state
connected, model iPhone SE (3rd generation). The Q-C enum tag fix
landed and the whole identity chain is consistent end-to-end.
Known issue: a race on the very first devicectl after a CDS
relaunch because the inject's registration runs inside a
dispatch_after(3s) block. See Known issues section below. Plan
revised in session 8 after six clarifying research questions
(Q-A..Q-F) overturned the original assumption that
handleDiscoveredSDR doesn't register the SDR.
S1.C.1 — Original premise: WRONG¶
The original S1.C.1 assumed handleDiscoveredSDR did not result in
our SDR being added to managedServiceDevices, so we planned to find
the canonical registration function and call it ourselves (or hook
install(browser:) to capture the discovery callback closure).
Runtime verification with the current S1.A + S1.B build proved this wrong. Decisive log line from a fresh CDS run:
ServiceDeviceManager - New device representation added to ecid_11836855534199284200:
<ServiceDeviceRepresentation 0x...>
{ id = (ecid_11836855534199284200, uuid: B9BE8F31-6FD1-5ED4-83B0-4DD1CD9B0265),
name = Optional("iPhone (iosmux)") }
Per Q-3 research (docs/research/s1c-static-browser-disasm.md Q3
section), this log message is emitted from inside the async
ServiceDeviceManager._offer(discoveredDeviceRepresentation:to:)
body, after the dict insert at CoreDevice + 0x2829e0. So our SDR
is in managedServiceDevices. The chain
handleDiscoveredSDR → 0x27e850 → 0x507f0 → async _offer → dict insert
already works. We are not missing a registration call.
S1.C.1' — Six clarifying questions (Q-A..Q-F)¶
After verifying the SDR was registered, six parallel research agents
ran on havoc to find the actual root cause of devicectl list devices
returning empty. Full findings appended to
docs/research/s1c-static-browser-disasm.md under sections Q-A..Q-F.
Q-A — initialDeviceSnapshots construction: the snapshot builder
is 0x289060, called from handle(clientCheckInRequest:from:) at
CoreDevice + 0x286470. It iterates the same dict the
managedServiceDevices getter projects (no separate cache). The
per-entry builder is CoreDevice + 0x284030 and is the most likely
site of any rejection / filter on a per-SDR basis.
Q-B — external identity management: the updateIdentifier(devId,
sdr, sdm) call at inject/iosmux_inject.m:1011 is dead code with
no side effects on failure. Identity management is disjoint from list
visibility. There is no standalone "register external identity"
function — entry is only via DeviceRepresentationProvider.consider
(offer:). Recommendation: delete the updateIdentifier call.
Q-C — SDR UUID assignment: smoking gun. Our dev_id_buf[32] = 0
is mislabeled in inject as "enum tag = UUID variant", but per
CoreDevice.DeviceIdentifier enum layout:
enum DeviceIdentifier {
case ecid(UInt64) // tag 0, 8-byte payload
case uuid(Foundation.UUID, Swift.String) // tag 1, 32-byte payload
}
With tag = 0, CDS interprets the first 8 bytes of our UUID buffer
as a UInt64 ECID. Verified by arithmetic:
- Our UUID
E8A190DD-64F5-44A4-8D57-28E99E316D60 - First 8 bytes little-endian:
0xA444F564DD90A1E8 - =
11836855534199284200 - = the ECID in the runtime log, exact match
B9BE8F31-6FD1-5ED4-83B0-4DD1CD9B0265 then comes from
DeviceIdentifier.uuidRepresentation.getter at CoreDevice + 0x25b650,
which for the .ecid case calls into an AMDevice keypath lookup
synthesizing a deterministic UUID from the ECID.
Cascade of consequences from this single byte:
- SDR enters
managedServiceDevicesas.ecid(...)instead of.uuid(...) - The snapshot per-entry builder (
0x284030) likely rejects ECID entries that have no real backing AMDevice — explaining whyinitialDeviceSnapshotsis empty even though the dict is non-empty - PairAction lookups query by
B9BE8F31-...(the AMDevice-derived UUID), but ourserviceDeviceRepresentations(forDeviceIdentifiedBy:)hook only matchesE8A190DD-...— mismatch - Hostname-manager keeps using
E8A190DD-...because it readsdeviceInfo.serviceDeviceIdentifier, a separate field onDeviceInfo(one we DO set correctly) — explaining why two different UUIDs appear in our logs simultaneously
Q-D — serviceFullyInitialized: time-based (asyncAfter), not
predicate-based. Does NOT gate the snapshot list. Snapshot reads the
same dict as managedServiceDevices getter, no separate cache. Race
hypothesis is dead in the latest test run (SDR added at +0.371s,
check-in handled at +0.476s — 105 ms margin) but was alive in an
earlier test where dispatch_after(3s) deferred handleDiscoveredSDR
until after check-in. This is a separate symptom.
Q-E — reading managedServiceDevices from inject: getter symbol
confirmed at CoreDevice + 0x27e5c0 with mangled name
$s10CoreDevice07ServiceB7ManagerC07managedC7DevicesSDyAA0B10IdentifierOSayAA0cB14RepresentationCGGvg.
Type: [DeviceIdentifier : [ServiceDeviceRepresentation]]. Key is a
DeviceIdentifier enum (NOT UUID/String — note this for the lookup
side). ABI: self in %r13, return in %rax. C-helper recipe in
the disasm doc.
Q-F — closure body validation: none. CDS+0x286b70 is pure
"log and forward" with zero SDR field reads or rejection branches.
The only branch is os_log_type_enabled(info) for log gating.
0x27e850 also has no rejection of our SDR — only an early exit on a
single byte-flag we are not setting. Validation is NOT the bug.
S1.C.2 (REVISED) — minimal fix, two changes in one file¶
Effort: ~10 lines of code, single file inject/iosmux_inject.m.
- Q-C fix — switch DeviceIdentifier enum tag from
.ecidto.uuid. Atinject/iosmux_inject.m:857-861(thedev_id_bufconstruction):
// BEFORE
memset(dev_id_buf, 0, 33);
memcpy(dev_id_buf, uuid, 16);
dev_id_buf[32] = 0; // wrong: tag 0 is .ecid(UInt64)
// AFTER
memset(dev_id_buf, 0, 33);
memcpy(dev_id_buf, uuid, 16); // .uuid payload[0]: Foundation.UUID
*(uint64_t *)(dev_id_buf + 16) = 0; // .uuid payload[1]: Swift.String _countAndFlagsBits = 0
*(uint64_t *)(dev_id_buf + 24) = 0xE000000000000000ULL; // tagged-small-string empty marker
dev_id_buf[32] = 1; // tag = .uuid
The 0xE000000000000000 empty-string pattern is verified canonical
per the disasm doc — it's the value the SDR's own
description.getter emits for empty Swift.String literals.
- Q-B fix — delete the dead
updateIdentifierblock. Remove the asm-emit + call site atinject/iosmux_inject.m:986-1054. Also remove the surrounding logging that references it. Update the comment block at line 974-985 to note thathandleDiscoveredSDRalone is the canonical entry andupdateIdentifierwas a no-op (per Q-B disasm).
S1.C.3 — Build, deploy, test¶
- scp + build on havoc
- Deploy + force CDS reload
- Run
devicectl list devices - Read CDS system log + inject log
S1.C.4 — Exit criteria¶
- System log shows
New device representation added to <our UUID E8A190DD-...>(NOTecid_...) — i.e. theidline uses.uuid(E8A190DD-..., "")formatting devicectl list devicesreturns our iPhone in the list with name "iPhone (iosmux)" and UDID matching our config- No
Received identity update requestERROR (we deleted the call) - CDS still alive
- Cross-check with the Q-E helper:
managedServiceDevicesgetter count > 0 (optional, only if we add the helper for verification)
S1.C.5 — Fallback if test fails¶
If devicectl still returns empty after the Q-C+Q-B fix:
- Most likely cause: the per-entry snapshot builder at
CoreDevice + 0x284030has ANOTHER rejection predicate beyond.ecidvs.uuiddiscrimination. Disassemble it (Q-A's lead). - Second-likely: race —
handleDiscoveredSDRis called from inside the existingdispatch_after(3 * NSEC_PER_SEC)block, so on the FIRSTdevicectlinvocation after CDS startup, the check-in fires before our SDR is added. Mitigations: - Move the registration to the synchronous part of the inject ctor, OR
- Hook
handle(clientCheckInRequest:from:)atCoreDevice + 0x286470and run our registration synchronously on entry, then tail-call the original. This guarantees ordering and auto-repeats per request.
S1.C.6 — Commit¶
Single commit if S1.C.2 is enough. Title:
Stage 1.C: fix DeviceIdentifier enum tag (.ecid → .uuid) and drop dead updateIdentifier
Followup commits if S1.C.5 fallbacks are needed.
Why this is a single change instead of the original multi-stage plan¶
The original S1.C plan had three sub-stages (research → ABI work →
implementation) because we believed we needed to build a new path
into managedServiceDevices. Q3 + verification + Q-A..F showed that
the path already exists and works — we were corrupting the SDR's
identity at the entry point. Once the identity is correct, the
canonical chain handles everything else. This is exactly the
"correct/grounded solution" pattern: find the actual root cause and
fix it at its source, not layer more workarounds on top.
Known issue after landing: first-call race on CDS relaunch¶
On the very first devicectl list devices after a CoreDeviceService
relaunch (fresh process, e.g. right after killall CoreDeviceService),
the result is still "No devices found." The second and all subsequent
calls correctly list the iPhone.
Root cause: our inject's full registration flow — build DeviceInfo,
build SDR, call handleDiscoveredSDR — lives inside a
dispatch_after(3 * NSEC_PER_SEC) block in iosmux_register_device().
By the time our SDR is added to managedServiceDevices, the first
DeviceManagerCheckInRequest from the fresh devicectl has already
been served from an empty dict.
This is cosmetic — once the script has run once, the dict has our SDR and every subsequent call works. But for a clean-boot UX we should fix it. Two options:
- Move the SDR construction out of
dispatch_afterinto the synchronous ctor body. The 3s delay exists for a reason we need to re-audit first (probably waiting for CDS's own init to finish) — removing it blindly risks racing against something else. - Hook
ServiceDeviceManager.handle(clientCheckInRequest:from:)atCoreDevice + 0x286470and run our registration synchronously on entry before tail-calling the original. Guarantees ordering and works regardless of what dispatch_after was protecting against.
Scheduled as S1.D — separate commit/stage, not a blocker for S1.C acceptance.
S1.C.6 commit¶
Single commit for the Q-C + Q-B fix landed as part of the session 9 commit that also includes the architecture doc and the rewritten restore script.
Verification gate¶
After S1.C exits successfully:
- Run
devicectl list devices5 times in a row from a clean state. All 5 must show our iPhone consistently. - Open Xcode → Window → Devices and Simulators. Confirm the device row appears.
- Capture the Devices window's view of our device — what's its state? (connecting, connected, ready, paired, etc.)
- Do NOT click Pair yet. Capture system log during step 2-3.
- Document findings in
docs/research/session-8-stage-1-results.md.
After this gate we re-plan Stage 2 based on what Xcode does. Likely candidates: - Stage 2: address whatever Xcode complains about in step 3 - Stage 2: handle developerModeStatus if it actually matters now - Stage 2: capture Mercury envelope catalog as old Stage 3 planned
Risk register (Stage 1)¶
| Risk | Impact | Mitigation |
|---|---|---|
| S1.A: handshake dict has different keys than MobileDevice expects | Wrapper init still fails | S1.A.1 catalogs the deltas before code change; translation table added if needed |
| S1.B: SIGSEGV from masked Session-5 crash returns | CDS dies | The SDR fix should have addressed it; if not we learn that during S1.B.4 with no further code added |
| S1.C.1: discovers there is no natural path and we have to hook | Forces a "intercept result" approach | Pause and confirm with user before falling back to that pattern |
| Stage order coupling: S1.A failure blocks S1.B test, S1.B failure blocks S1.C test | Single point of failure | Strictly sequential gating means we'll catch each independently |
What we're NOT doing in Stage 1¶
- developerModeStatus setter (Q4): deferred, no observed runtime gate
- Mercury envelope catalog: deferred, only needed for actions not list-devices
- Code-audit medium findings (M1-M10): Stage 5
- DYLD_INTERPOSE rework: not needed for Stage 1 scope
- Pair button work: not until Xcode shows the device
Stopping rule (unchanged from old roadmap)¶
At any stage, if reality diverges from prediction: 1. STOP. Do not layer more changes. 2. Run a minimal diagnostic. 3. Document in a research doc. 4. Update this plan. 5. Only then resume.