Part 12 of RF Front End. We’ve spent eleven posts inside a single dongle — USB transport, the RTL2832U register dance, tuners, sample conversion. Now we zoom out to the fleet: how the daemon holds several radios at once, gives each a job, and keeps them alive across the USB drops that real hardware throws at you.
TL;DR — The SDR pool owns every opened dongle behind one interface, assigning each a role (control, voice, wideband) and keeping the fleet alive through USB hotplug via a 30-second watchdog that reacquires devices under new bus addresses. A scanner retune that raced stream teardown (issue #686) is fixed by serializing re-open behind an idempotent stop.
In this post
- The pool (
internal/sdr/pool.go) — a fleet of opened devices, each with a role: control, voice, or wideband. - Strict / allowlist mode and serial-alias matching so an operator’s
config.yamlselects exactly the dongles they named. - The USB watchdog that re-enumerates every ~30 s, publishes
KindSDRAttached/KindSDRDetached, and reacquires a device after the kernel re-enumerates it with a new address. - The re-open race (issue #686): a scanner retune raced USB stream teardown, and how serializing behind an idempotent stop fixed it.
What the pool does
A single dongle is one Device. A trunked system needs more than one: a radio
camped on the control channel decoding grants, and one or more radios that follow
those grants onto voice frequencies. A wideband Airspy might cover a whole site’s
worth of channels at once. The pool is the thing that owns all of them.
Its job is narrow but load-bearing. At boot it enumerates every registered driver, opens the devices the operator selected, programs a known-good sample rate on each, and assigns each one a role. After boot it answers a single question for the rest of the engine — “give me the device with role X” — and it keeps that fleet healthy while USB does what USB does: drop a stick mid-stream, re-enumerate it under a new device number, and expect the software to cope.
Roles matter because the engine never reaches for a specific dongle. The
control-channel decoder asks for RoleControl; the voice composer asks the pool
to find a RoleVoice device by serial when the engine binds a call. That
indirection is what lets the same code run on a one-stick hobby setup and a
four-stick site without a branch anywhere in the engine.
How GopherTrunk implements it in Go
A Pool is a slice of opened entries behind a mutex, plus an optional event bus:
// internal/sdr/pool.go
type Pool struct {
mu sync.RWMutex
entries []*PoolEntry
log *slog.Logger
bus *events.Bus
}
type PoolEntry struct {
Driver Driver
Device Device
Info Info
Role Role
Hint Hint
}
OpenWith is the heart of bring-up. It sweeps every registered driver, opens the
selected devices, programs the IQ rate, and assigns roles. Role assignment is one
simple rule: the first opened device that isn’t otherwise claimed takes
RoleControl; everything after it defaults to RoleVoice. A Hint can override
that per serial.
// internal/sdr/pool.go (shape)
role := RoleAuto
if hinted {
role = hint.Role
}
if role == RoleAuto {
if !controlClaimed {
role = RoleControl
controlClaimed = true
} else {
role = RoleVoice
}
}
Programming the sample rate at open time isn’t optional housekeeping — it’s a
fix for issue #275. Without an explicit SetSampleRate, the chip streams at
whatever rate its resampler powered up at, while the decoder runs its
symbol-timing math against the configured rate. The result is a silent failure
to lock, the worst kind of bug in a radio. So a device whose SetSampleRate
fails is closed and skipped: a wrong-rate radio is worse than no radio at all.
Strict mode and serial aliases
By default the pool opens every dongle it finds. The moment an operator lists
specific devices in config, that’s their signal that they want only those —
so the daemon engages strict mode, where Hints becomes an allowlist:
// internal/sdr/pool.go (shape)
if opts.Strict && !hinted {
p.log.Info("skipping non-configured SDR; add its serial to sdr.devices to use it",
"driver", d.drv.Name(), "serial", d.info.Serial)
continue
}
Matching a hint to a device means matching serials, and serials aren’t always
clean. Airspy reports a legacy form — AIRSPY SN:35ac63dc2d701c4f — that an
operator might write a dozen ways. serialKey normalizes them so the config and
the wire agree:
// internal/sdr/pool.go
func serialKey(s string) string {
s = strings.TrimSpace(s)
s = strings.ToLower(s)
switch {
case strings.HasPrefix(s, "airspy sn:"):
return strings.TrimPrefix(s, "airspy sn:")
case strings.HasPrefix(s, "airspy_sn:"):
return strings.TrimPrefix(s, "airspy_sn:")
default:
return s
}
}
TestPoolMatchesAirspySerialAliases pins this: a hint written
AIRSPY SN:35ac63dc2d701c4f opens the device whose raw serial is
35AC63DC2D701C4F, and FindBySerial resolves all three spellings to the same
entry.
The USB watchdog
The pool also runs a supervisor loop. RunWatchdog ticks every interval — 30
seconds by default — re-enumerates every driver, and acts only on transitions:
// internal/sdr/watchdog.go
const DefaultWatchdogInterval = 30 * time.Second
func (p *Pool) RunWatchdog(ctx context.Context, interval time.Duration, sampleRateHz uint32) error {
if interval <= 0 {
<-ctx.Done()
return ctx.Err()
}
tick := time.NewTicker(interval)
defer tick.Stop()
missing := map[string]bool{}
for {
select {
case <-ctx.Done():
return ctx.Err()
case <-tick.C:
p.watchdogTick(missing, sampleRateHz)
}
}
}
The missing map is the state machine. A pool serial that the enumerate stops
seeing flips to missing and emits one KindSDRDetached — the API, TUI, and web
snapshot all show the gap. When that same serial reappears in a later enumerate,
the watchdog deletes it from missing and calls Reacquire:
// internal/sdr/watchdog.go (shape)
if missing[serial] {
delete(missing, serial)
p.log.Info("sdr: watchdog: device reappeared; reacquiring", "serial", serial)
if _, err := p.Reacquire(serial, sampleRateHz); err != nil {
p.log.Warn("sdr: watchdog: reacquire failed", "serial", serial, "err", err)
}
}
Reacquire is where the hotplug story gets real. When a dongle browns out and
comes back, the kernel assigns it a new device number — but it reports the same
serial. So Reacquire closes the (likely dead) handle best-effort, re-enumerates
the driver, finds the serial under its new index, opens a fresh handle,
re-programs the rate, and re-applies the original Hint (PPM, gain, bias-tee).
Crucially it swaps the Device in place on the existing PoolEntry —
Role, serial identity, and any pointer a consumer is holding all survive; only
Info.Index updates to the new enumeration. TestPoolReacquireSwapsDeviceHandleInPlace
asserts exactly that: same PoolEntry, new *fakeDevice, stale handle closed,
bias-tee re-applied, index refreshed to 7.
The problem we hit: the retune-vs-teardown re-open race (issue #686)
The watchdog handles the idle case — a device nobody is streaming. The in-use case is harder, and it bit us in scanner mode.
Symptom. In scanner mode a fast retune cancels the IQ stream’s context and immediately
re-opens it on the new frequency. But USB drivers don’t tear a stream down
synchronously — the bulk-IN reaper goroutine runs cancelStream asynchronously,
draining URBs and closing the consumer channel on its own schedule. So the
sequence that should have been “stop, then start” became “start while the
previous stop is still in flight.” The second StreamIQ found the bulk-IN
endpoint still claimed and failed with stream already active — surfacing to the
operator as conv: StreamIQ failed and a dead capture.
Root cause. The race was structural, not a missing lock. The teardown path is idempotent via
a sync.Once:
// internal/sdr/rtlsdr/purego/device.go
func (d *Device) cancelStream() {
d.stopOnce.Do(func() {
_ = d.transport.StopBulkIn()
// ...close the consumer channel
})
}
stopOnce guarantees teardown runs exactly once — but it didn’t guarantee the
next StreamIQ waited for it. The fix was to make re-open serialize behind the
in-flight teardown: a new stream resets stopOnce only after the previous stop
has actually completed, so a retune can never out-run the reaper.
// internal/sdr/rtlsdr/purego/stream.go (shape)
out := make(chan []complex64, streamChanDepth)
d.out = out
d.stopOnce = sync.Once{} // only reachable once the prior teardown finished
The lesson is a recurring one in this series: with USB, “stop” is a request, not an event. Anything that re-opens has to wait on the teardown completing, not on having asked for it.
The design principle: supervisor + observer
Two patterns share the load here. The pool is a supervisor (a fleet manager):
it owns the lifecycle of every device, restarts the ones that fail, and presents
the survivors as a roster the engine can query by role. The watchdog is the
supervisor’s health check, and Reacquire is its restart strategy.
The second pattern is observer. The pool never calls into the daemon, the
API, or the TUI. It Publishes KindSDRAttached / KindSDRDetached to an
optional bus and lets whoever cares subscribe:
// internal/sdr/pool.go
func (p *Pool) publish(kind events.Kind, payload any) {
if p.bus == nil {
return
}
p.bus.Publish(events.Event{Kind: kind, Payload: payload})
}
How that principle shaped the Go code
- The bus is optional.
NewPooltakes only a logger;SetBusis a separate, idempotent step. Thegophertrunk sdr listCLI and every unit test run the pool withbus == nil, andpublishshort-circuits — the same fleet code, no daemon required. - State lives in one goroutine. The watchdog’s
missingmap is owned solely by the watchdog goroutine and passed in by value-reference, so attach/detach transitions need no extra lock. Only the pool’sentriesslice is shared, and that’s behindsync.RWMutex. - Identity is stable across reacquisition. Because
Reacquireswaps theDeviceinside an existingPoolEntryrather than replacing the entry, consumers that cached a*PoolEntrykeep working across a USB cycle. Role and serial are the identity; the handle is just an attribute. - Recovery is best-effort and idempotent. Closing a dead handle may error; re-enumerate may miss the serial; the in-stream retry loop may beat the watchdog to it. Every path logs and moves on, because the next tick — or the next consumer — will try again.
Where this goes next
The pool assigns roles and keeps devices alive, but we’ve leaned on tests
throughout this post — TestPoolReacquireSwapsDeviceHandleInPlace,
TestPoolMatchesAirspySerialAliases — without explaining how you test a fleet of
radios in CI where there are no radios at all. That’s
Part 13:
replaying captured USB control-transfer sequences, bit-identical conversion
golden masters, and an opt-in real-hardware tier.
FAQ
Why poll every 30 seconds instead of listening for kernel hotplug events? Polling is portable. The same re-enumerate loop works on Linux USBDEVFS, Windows WinUSB, and macOS IOKit without three platform-specific hotplug listeners. 30 s is short enough to recover a transient drop inside one failure cycle and long enough not to load a slow hub.
What happens to an in-use device that drops? The watchdog owns the idle
case. A device that’s actively streaming surfaces its death through the stream
itself — the reaper closes the channel, the consumer (ccdecoder retry loop,
VoicePool.Bind) sees EOF and drives its own Reacquire. The watchdog is the
backstop for radios nobody is currently touching.
Why does strict mode skip a device that’s physically present? Because an
allowlist is an allowlist, not a preference. If you named your control stick in
config and an unrelated dongle is on the bus, opening that dongle could let it
win RoleControl and bind the decoder to a radio that never got your PPM
correction — the original issue #264 failure. Strict mode refuses to guess.
Series navigation
Part 12 of 14 · ← Part 11 · Next → Part 13: Testing radios without radios