Architecture Refactor: Move device state to registry, treat Proton as OLAP warehouse #639
Closed
opened 2026-03-28 04:26:53 +00:00 by mfreeman451
·
11 comments
No Branch/Tag specified
staging
demo/prod-release
add-dashboard-srql-service-views
enhance-dashboard-creator-visual-builder
fix-release-1-2-78-ci-failures
refactor-agent-plugin-runtime
add-dashboard-creator
update/plugin-system
add-service-monitoring-foundation
fix/cli-auth-settings-navigation
fix/device-unknown-facet
fix-plugin-assignment-upgrades
fix/sweep-profile-mode-status
fix/armis-northbound-fixes
fix-proxmox-console-react-client-render
fix-openbao-signing-job-token-mount
fix-services-plugin-status-read-model
fix-stream-status-final-chunk
codex/fix-prod-armis-sync-icmp
fix-demo-cnpg-operator-networkpolicy
update/docs-cleanup
fix-armis-northbound-raw-token-auth
fix-armis-northbound-stuck-running
update-manual-device-hostname-readd
fix/observability-severity-card-links
renovate/debian_testing_slim-testing-slim
docs/kubernetes-ingestion-gateway
fix/flow-collector-external-network-policy
fix/armis-availability-and-stream-config
refactor-fast-fresh-db-bootstrap
fix/cnpg-not-found
fix/docker-update/cnpg
renovate/debian_bookworm_slim-bookworm-slim
add-socket-firewall-ci
add-ipv6-sweep-scanners
fix-armis-northbound
add-streamed-agent-config
fix/sweep-ip-family-routing
codex/fix-core-migration-hook-upgrade
renovate/arc_runner
renovate/actions_runner
codex/remote-access-desktop-rdp
updates/missing-fixes
fix/batched-sweep-targets
fix/agent-sysmon-memory-growth
fix/armis-northbound-single-run-and-timestamps
release/1.2.67
fix/wasm-plugin-service-status
fix/armis-names-string
codex/harden-remote-access-app-tcp-followup
codex/harden-remote-access-app-tcp
codex/harden-gateway-proxy-auth
codex/harden-remote-access-destroy-rbac
codex/harden-ssh-ca-remote-access
codex/harden-remote-access-broker-registry
codex/harden-remote-access-approval-policy
codex/harden-remote-access-file-transfer
codex/harden-remote-access-attach-ticket
fix-release-plugin-list-cleanup
codex/harden-grpc-logger-payloads
propose-interface-action-target-context
add-signed-northbound-action-callbacks
fix/reap-stale-self-scheduled-oban-jobs
fix/device-logs-tab-async
fix/device-metadata-summary
fix/northbound-launch-target-auth
fix/northbound-provider-atomic-read
proposal/add-sample-northbound-wasm-plugin
proposal/northbound-action-integrations
fix/ansible-run-job-device-launch
codex/fix-audit-events-shell-theme
codex/fix-sweep-tcp-availability
codex/fix-armis-names-string
fix/update-nats-2-14
fix/elixir-quality-release-blockers
fix/latest-release-device-ingest-ui-bugs
fix/release-1.2.61-ci-failures
fix/web-ng-precommit-format
fix/wasm-tinygo-go125
fix/agent-file-transfer-bazel-src
fix/release-key-stamp-root
fix/release-1.2.59-staging
main
fix/agent-accept-deprecated-remote-access-config
fix/device-results-count-facets
fix/bazelisk-installer-retries
fix/tinygo-host-toolchain-fetch
add-per-agent-availability
fix/forgejo-release-multipart-assets
fix/agent-config-stale-session
fix/mtr-hop-dns-resolution
fix/hostname-only-device-create
fix/otlp-log-metadata-sanitization
fix/sweep-icmp-legacy-mode-classification
add-nats-object-store-retention
fix/helm-serviceradar-state-pvc
harden/forgejo-ci-nonroot
codex/expand-remote-access-teleport-parity
fix/sweep-port-history-consistency
fix/remote-access-ssh-feature-flag
fix/devices-refresh-artifacts
fix/identity-ingestion-sweep-availability
spec/identity-cache-ingestion-correctness
fix/sweep-mapper-promotion-stale-cache
fix/sweep-provisional-duplicate-ip
fix/sweep-target-invalid-ip-order
fix/armis-large-sync-streaming
fix/docusaurus-blog-build-date
fix/armis-sync-compat-deep-dive
fix/awx-controller-credential-secret
fix/demo-release-source-branch
demo/release-v1.2.44-source-fix
codex/teleport-agent-routed-remote-access
feature/agent-config-dependency-catalog
proposal/agent-config-dependency-catalog
bugfix/armis-credentials-save-display
bugfix/armis-integration-credentials
fix/web-ng-precommit-formatting
bugfix/armis-secret-config-push
feature/close-controllers-to-pipelines
carverauto/extract-palisade
feature/migrate-dashboard-cli-to-plugs
feature/audit-history-page
feature/migrate-controllers-to-security-pipelines
fix/security-events-test-and-retention-worker
feature/add-platform-security-hardening
demo-rollout-proxmox-bazel-fix
bug/armis-sync-issues
add-virtualization-srql-queries
add-proxmox-ingestion-hardening-tests
add-ssh-private-key-credential-rules
redact-plugin-credential-material
add-credential-rules-settings-entry
add-credential-rules-settings-flows
require-network-credential-broker-grants
preview-credential-rule-target-scope
add-proxmox-credential-secret-preset
harden-credential-rules-live-tests
clarify-proxmox-plugin-credential-modes
docs-proxmox-credential-operations
fingerprint-proxmox-candidates
proxmox-resource-efficiency-dashboard
proxmox-console-security-docs
proxmox-focused-quality-validation
proxmox-metric-baseline-alerts
proxmox-device-scoped-logs
proxmox-defer-vector-log-forwarding
proxmox-console-session-tickets
proxmox-console-xterm-shell
proxmox-console-websocket-broker
proxmox-console-control-frames
proxmox-console-agent-session-manager
proxmox-console-plugin-pty-bridge
proxmox-console-assignment-materializer
proxmox-console-ssh-connector
proxmox-console-agent-local-broker
proxmox-console-device-actions
proxmox-console-stream-timeouts
proxmox-console-guest-mode-gating
proxmox-console-ci-race-fix
fix/falco-alert-routing-datasvc-channel
fix-agent-release-page-bugs
add-proxmox-device-details-summary
add-proxmox-virtualization-ingestor
add-virtualization-inventory-schema
add-proxmox-infrastructure-inventory
add-proxmox-plugin-live-smoke
add-proxmox-local-api-smoke
add-proxmox-credential-test-dispatch
add-proxmox-credential-test-plan
add-network-credential-rule-preview
add-proxmox-plugin-inventory-details
add-proxmox-credential-reconcile-worker
add-proxmox-credential-assignment-materializer
add-plugin-input-template-secret-refs
add-proxmox-plugin-policy-inputs
add-proxmox-wasm-plugin-scaffold
add-network-credential-rules-model
add-proxmox-plugin-credential-rules
fix/rperf-rustls-provider-demo
fix/dashboard-template-sdk-014
fix/tinygo-go126-release
fix/reqsign-provider-bazel-deps
fix/release-bazel-rust-crates
fix/core-coordinator-connection-leak
chore/security-updates
update/readme-versions-update
docs/readme-dashboard-sdk
chore/cli-0.1.5
fix/dashboard-cli-local-map-dev
fix/dashboard-cli-hmr-map-libraries
chore/bump-serviceradar-cli-0.1.2
fix/dashboard-cli-hmr-harness
fix/helm-contour-liveview-websocket
fix/helm-cnpg-pooler-defaults
updates/helm-fixes
update/fix-light-mode-analytics
ual/dashboard-sdk-dx
security/postgres-update
update-falco-alert-diagnostics
ual/react-dashboard-sdk
fix/cnpg-saturation-fk-and-bootstrap
ual/wifi-site-map
fix-coraza-log-db-writer
fix-log-viewer-syslog-processed
plan-fieldsurvey-spatial-selection
plan-envoy-coraza-waf
plan-alienvault-otx-integration
plan-fieldsurvey-sidekick-daemon
cleanup-openspec-archive-closed-proposals
bug-core-elx-ip-enrichment-reap
fix-release-libcap2-pin-v125
fix-camera-stream-gateway-route
add-core-elx-prometheus-metrics
add-serviceradar-observability-dashboards
add-pgbouncer-helm-cnpg
renovate/ghcr.io-actions-actions-runner
feat/cluster-agent-runtime-metadata-stability
feat/observability-shell-standardization
feat/observability-live-log-toggle
fix/mtr-bulk-queue-and-srql-targets
fix/mtr-profile-protocol-keyerror
fix/mtr-diagnostics-keyerror
add-demo-rollout-skills
fix-web-ng-test-support-dialyzer
fix-web-ng-dialyzer-findings
add-bulk-queued-mtr-diagnostics
fix-serviceradar-core-integration-failures
harden-agent-updater-exec-arguments
harden-agent-release-trust-boundaries
fix-compose-hermetic-nats-datasvc-bootstrap
fix/openbao-release-issues
elixir/formatting-updates
codex/demo-cnpg-signing-release-fixes
chore/lint-fixes
fix/bazel-alpine-bump-and-cosign-skip
update-event-alert-dedup-and-suppression
armis-northbound-events
armis-northbound-availability-updates
push-owvypksrmooo
codex/topology-endpoint-evidence-investigation
codex/topology-bootstrap-and-layout-simplification
codex/remove-ingress-nginx-edge
fix/forgejo-ci-snmp-cache-and-ubuntu24
bug/cnpg-mtls-failure
bug/log-collector
fix/forgejo-runner-labels
fix/cargo-lock-sync
remove-arc-runner-from-push-all
fix-push-all-cosign-preflight
fix-go-ci
add-versioned-openapi-publish
chore/forgejo-hardening
security/k8s-hardening
update/cluster-page-agents
updates/helm-security-updates
2406-feat-agent-fleet-management-secure-self-update-system
bug/k8s-helm-deployments
chore/k8s-arc-update
rust-fix
2371-analytics-stats-cards-should-abbreviate-numbers
chore/perl-cleanup
2942-featweb-ng-add-logs-tab-to-device-details-page
192-feat-tftp-server
mikemiles-dev/feature/netflow_collection
testing
dependabot/cargo/hostname-0.4.1
dependabot/cargo/redis-1.0.1
dependabot/cargo/bb8-0.9.0
dependabot/cargo/rcgen-0.14.6
dependabot/cargo/hyper-1.8.1
dependabot/cargo/hyper-util-0.1.19
dependabot/cargo/clap-4.5.51
dependabot/cargo/thiserror-2.0.17
dependabot/cargo/time-tz-2.0.0
dependabot/cargo/tonic-build-0.14.2
backup/main-pre-staging-sync-2026-04-02
dependabot/npm_and_yarn/docs/mdast-util-to-hast-13.2.1
815-feat-support-win32-for-agentpoller
gh-pages
v1.2.79
v1.2.78
v1.2.77
v1.2.76
v1.2.75
v1.2.74
v1.2.73
v1.2.72
v1.2.71
v1.2.70
v1.2.69
v1.2.68
v1.2.67
v1.2.66
v1.2.65
v1.2.64
v1.2.63
v1.2.62
v1.2.61
v1.2.60
v1.2.59
v1.2.58
v1.2.57
v1.2.54
v1.2.53
v1.2.52
v1.2.51
v1.2.50
v1.2.49
v1.2.48
v1.2.47
v1.2.46
v1.2.45
v1.2.44
v1.2.43
v1.2.42
v1.2.41
v1.2.40
v1.2.39
v1.2.38
sha-de6d1025d59f039188754b895ff7fe65db9b306b
sha-8006b6105635acf43060fab2613eab3bccb1efcf
v1.2.37
v1.2.36
v1.2.35
v1.2.34
v1.2.33
v1.2.32
v1.2.31
v1.2.30
v1.2.29
v1.2.28
v1.2.27
v1.2.26
v1.2.25
v1.2.24
v1.2.23
v1.2.22
v1.2.21
v1.2.20
v1.2.19
v1.2.18
v1.2.17
v1.2.16
v1.2.15
v1.2.14
v1.2.13
v1.2.12
v1.2.11
v1.2.6
v1.2.10
v1.2.9
v1.2.8
v1.2.7
v1.2.5
v1.2.4
v1.2.3
v1.2.2
v1.2.1
v1.2.0
v1.1.2
v1.1.0
v1.0.92
v1.0.91
v1.0.90
v1.0.89
v1.0.88
v1.0.87
v1.0.86
v1.0.85
v1.0.84
v1.0.83
v1.0.82
v1.0.81
v1.0.78
v1.0.77
v1.0.76
v1.0.70
v1.0.69
v1.0.68
v1.0.67
v1.0.66
v1.0.65
v1.0.64
v1.0.63
v1.0.62
v1.0.61
v1.0.60
v1.0.59
v1.0.58
1.0.57
v1.0.56
v1.0.55
v1.0.54-pre5
v1.0.53
v1.0.53-pre19
v1.0.53-pre18
v1.0.53-pre17
v1.0.53-pre15
1.0.53-pre10
1.0.53-pre9
1.0.53-pre8
1.0.53-pre7
1.0.53-pre6
1.0.53-pre5
1.0.53-pre4
1.0.53-pre3
1.0.53-pre2
1.0.53-pre1
1.0.52
1.0.51
1.0.50
1.0.49
1.0.49-pre5
1.0.49-pre4
1.0.49-pre3
1.0.49-pre2
1.0.48
1.0.48-rc2
1.0.48-rc1
1.0.48-pre8
1.0.48-pre7
1.0.48-pre6
1.0.48-pre5
1.0.48-pre4
1.0.48-pre3
1.0.48-pre2
1.0.48-pre1
1.0.47
1.0.47-pre8
1.0.47-pre7
1.0.47-pre6
1.0.47-pre5
1.0.47-pre4
1.0.47-pre3
1.0.47-pre2
1.0.47-pre1
1.0.46
1.0.46-pre9
1.0.46-pre8
1.0.46-pre7
1.0.46-pre6
1.0.46-pre5
1.0.46-pre4
1.0.46-pre3
1.0.46-pre2
1.0.46-pre1
1.0.45
1.0.44
1.0.44-pre12
1.0.44-pre11
1.0.44-pre10
1.0.44-pre9
1.0.44-pre8
1.0.44-pre7
1.0.44-pre6
1.0.44-pre5
1.0.44-pre4
1.0.44-pre3
1.0.44-pre2
1.0.44-pre1
1.0.43
1.0.42
1.0.41
1.0.41-pre1
1.0.40
1.0.40-pre11
1.0.40-pre10
1.0.40-pre9
1.0.40-pre8
1.0.40-pre7
1.0.40-pre6
1.0.40-pre5
1.0.40-pre4
1.0.40-pre3
1.0.40-pre2
1.0.40-pre1
1.0.39
1.0.38
1.0.37
1.0.36
1.0.36-pre5
1.0.36-pre4
1.0.36-pre3
1.0.36-pre2
1.0.35
1.0.35-pre3
1.0.35-pre2
1.0.35-pre
1.0.34-pre3
1.0.34-pre2
1.0.34-pre1
1.0.33
1.0.33-pre2
1.0.33-pre
1.0.32
1.0.31
1.0.30
1.0.29
1.0.28
1.0.27
1.0.26
1.0.25
1.0.24
1.0.23
1.0.22
1.0.21
1.0.20
1.0.19
1.0.18
1.0.17
1.0.16
1.0.15
1.0.14
1.0.13
1.0.11
1.0.10
1.0.9
1.0.8
1.0.7
1.0.6
1.0.5
1.0.4
1.0.3
1.0.2
1.0.1
1.0.0
Labels
Clear labels
1week
2weeks
Failed compliance check
IP cameras
NATS
NATS JetStream
Possible security concern
Review effort 1/5
Review effort 2/5
Review effort 3/5
Review effort 4/5
Review effort 5/5
UI
aardvark
accessibility
amd64
api
arm64
auth
back-end
bgp
blog
bug
Something isn't working
build
checkers
ci-cd
continuous integration-continuous deployments
cleanup
cnpg
cloud-native postgres
codex
core
core service
dependencies
Pull requests that update a dependency file
device-management
documentation
Improvements or additions to documentation
duplicate
This issue or pull request already exists
dusk
ebpf
enhancement
New feature or request
eta 1d
eta 1hr
eta 3d
eta 3hr
feature
fieldsurvey
github_actions
Pull requests that update GitHub Actions code
go
Pull requests that update Go code
good first issue
Good for newcomers
help wanted
Extra attention is needed
invalid
This doesn't seem right
javascript
Pull requests that update Javascript code
k8s
log-collector
mapper
mtr
multi traceroute
needs-triage
netflow
network-sweep
observability
oracle
Oracle Linux related issues
otel
opentelemetry logs, traces, metrics
plug-in
proton
timeplus proton streaming database
python
question
Further information is requested
reddit
redhat
research
rperf
rperf-checker
rust
Pull requests that update rust code
sdk
security
serviceradar-agent
serviceradar-agent-gateway
serviceradar-web
serviceradar-web-ng
siem
snmp
sysmon
topology
ubiquiti
wasm
wontfix
This will not be worked on
zen-engine
No labels
1week
2weeks
Failed compliance check
IP cameras
NATS
Possible security concern
Review effort 1/5
Review effort 2/5
Review effort 3/5
Review effort 4/5
Review effort 5/5
UI
aardvark
accessibility
amd64
api
arm64
auth
back-end
bgp
blog
bug
build
checkers
ci-cd
cleanup
cnpg
codex
core
dependencies
device-management
documentation
duplicate
dusk
ebpf
enhancement
eta 1d
eta 1hr
eta 3d
eta 3hr
feature
fieldsurvey
github_actions
go
good first issue
help wanted
invalid
javascript
k8s
log-collector
mapper
mtr
needs-triage
netflow
network-sweep
observability
oracle
otel
plug-in
proton
python
question
reddit
redhat
research
rperf
rperf-checker
rust
sdk
security
serviceradar-agent
serviceradar-agent-gateway
serviceradar-web
serviceradar-web-ng
siem
snmp
sysmon
topology
ubiquiti
wasm
wontfix
zen-engine
Milestone
Clear milestone
No items
No milestone
Projects
Clear projects
No items
No project
Assignees
Clear assignees
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".
No due date set.
Dependencies
No dependencies set.
Reference
carverauto/serviceradar#639
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Imported from GitHub.
Original GitHub issue: #1924
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1924
Original created: 2025-11-05T03:32:25Z
Problem Statement
ServiceRadar currently treats Proton (a stream processing database) as the primary source of truth for device state, causing performance issues that don't scale beyond tens of thousands of devices. While the tactical CTE query fix (#1921) reduced Proton CPU from 3986m to ~1000m, we're still fundamentally doing the wrong thing: hitting Proton for every device lookup, stats query, and inventory search.
Current issues:
count()queries on 50k devicesVision
Establish a proper layered data architecture:
Proton should only answer questions like:
Proton should not answer questions like:
Detailed Plan
See `newarch_plan.md` for comprehensive implementation details including:
Implementation Phases
Phase 1: Device Registry Service (Week 1-2)
Goal: Canonical in-memory device graph
Success: Registry hydrates from Proton, stays in sync with new updates
Phase 2: First-Class Collector Capabilities (Week 3-4)
Goal: Stop deriving capability from metadata
Success: Collector status from explicit records, not metadata inference
Phase 3: Stats Aggregator (Week 5)
Goal: Pre-aggregate dashboard metrics
Success: Dashboard loads in <10ms, no Proton queries for stats
Phase 4: Search Index (Week 6-7)
Goal: Fast inventory search without table scans
Success: Inventory search returns in <50ms for any query
Phase 5: Capability Matrix (Week 8-9)
Goal: Model Device ⇄ Service ⇄ Capability explicitly
Success: Can answer "when did device X last have successful ICMP?" without manual queries
Phase 6: Proton Boundary Enforcement (Week 10)
Goal: Ensure all state queries hit registry, not Proton
Success: Proton CPU <200m under normal load
Success Metrics
Performance Targets
Data Quality
Developer Experience
Rollback Plan
Each phase is independently deployable with feature flags:
```go
const (
UseRegistry = true // Phase 1
UseCapabilityIndex = true // Phase 2
UseStatsCache = true // Phase 3
UseSearchIndex = true // Phase 4
)
```
If any phase has issues, disable the flag and fall back to Proton queries (slower but functional).
Related Issues
Open Questions
References
85733a09` - Tactical CTE query fix65e5d947` - Architecture plan documentationImported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1924#issuecomment-3489467643
Original created: 2025-11-05T05:41:42Z
Phase 1 progress recap
DeviceRecordand in-memory store (pkg/registry/device.go,device_store.go) with ID/IP/MAC indexes and cache snapshot helpers.pkg/registry/hydrate.go,pkg/core/server.go).ProcessBatchDeviceUpdateskeeps the hot cache in sync on every update/tombstone and exposes cache-backed getters used by API/device manager code paths.pkg/core/api/server.go).pkg/registry/trigram_index.go,pkg/registry/registry.go).What’s left (next engineer hand-off)
web/src/components/Devices/DeviceList(and any inventory/search routes) should call the new registry search endpoint instead of fan-out SRQL queries. Surfacingmetrics_summary,alias_history, and collector capability blobs that the API now attaches will require mapping the new fields in the React data loader and updatingDeviceRowrenderers.score), display an inline badge for exact hostname/IP hits, and preserve the existing status filters.web/src/lib/api.tsto use the registry-backed/api/deviceslist/search endpoints.pkg/core/api/server.go) capturing query length, match count, and latency so we can validate the <50 ms target under load.db.GetUnifiedDevices...(e.g., identity lookup, mapper publisher) to rely onDeviceRegistry.SearchDevices/GetDeviceRecordto avoid Proton reads.pkg/core/server.go+pkg/core/api/server.goand note it indocs/docs/agents.md.Once those are in place, we can iterate on Phase 2 (capability index) with a warmed-up UI and telemetry to prove the search latency/success metrics.
Imported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1924#issuecomment-3489640042
Original created: 2025-11-05T06:52:40Z
Update after Phase 1/2 rollout:\n\n- Proton returned to ~4 cores consumed within 10 minutes of redeploy (NAME CPU(cores) MEMORY(bytes)
serviceradar-proton-654fbcbcbf-bqdxs 3993m 3018Mi ).\n- shows the dominant query is , issued once per minute by the SRQL/Proton OCaml client. Each run scans ~15.6M rows / 5.9GB, so the Observability dashboard still hammers Proton.\n- The CTE-based device lookups introduced in Phase 1 are still invoked hundreds of times per half hour (totalling ~3.7e8 rows read). They're better than the old pattern but remain an expensive fallback because SRQL routes keep hitting Proton instead of the registry cache.\n- There are still Code 210 exceptions for giant clauses generated from SRQL filters (e.g. 100+ IPs or Armis IDs), which cause retries and more table scans.\n\nTo address the remaining load we extended with Phase 3b (Critical Log Rollups) so the web dashboards consume a dedicated log digest instead of the raw scan, and we tightened Sprint 6 tasks to force SRQL/device lookups through the registry and search index. That should eliminate the hot queries once Phases 3-6 are complete.
Imported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1924#issuecomment-3489724548
Original created: 2025-11-05T07:26:04Z
Status update from the architecture refactor work:
• Log digest cache landed. Added pkg/core/log_digest.go with a capped ring buffer + 1h/24h counters, hydrated every 30s from Proton via the new DBLogDigestSource helper. Core start-up now wires the aggregator and keeps it refreshed until shutdown.
• New critical log APIs. Exposed
/api/logs/criticaland/api/logs/critical/counters(protected routes); the handlers serve the in-memory digest so fatal/error widgets no longer hit SRQL.• Frontend wired to cache. web/src/services/dataService.ts fetches the new endpoints and supplies CriticalLogsWidget with typed data + counters; accompanying unit coverage mocks the API responses.
• Plan/doc cleanup. Phase 3b items are checked off in newarch_plan.md to reflect the cache + API + UI work.
• Validation.
go test ./pkg/core/...andnpm run lintare green.Remaining for Phase 3b: stream-driven hydration (instead of snapshots) and feature-flag plumbing once we’re ready to roll this out broadly.
Imported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1924#issuecomment-3492120109
Original created: 2025-11-05T16:17:11Z
Phase 3b status update:\n- Landed the log digest aggregator, tailer, and persistence plumbing; feature flag () is now available in config.\n- Built/published ghcr.io/carverauto/serviceradar-core@sha256:4124f3f298f13c1d2425725bbca80c8bc2e902a93074e2e3849a24103b6e1be9 and rolled the demo deployment to that image.\n- During the rollout, enabling in the demo cluster prevented the HTTP listener from ever becoming ready (readiness probe stayed red). For now the flag is set to false in the runtime config so the new build can serve traffic.\n- Follow-up: debug why enabling the log digest stream blocks readiness before we flip the flag back on.
Imported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1924#issuecomment-3492120711
Original created: 2025-11-05T16:17:20Z
Phase 3b status update:
features.use_log_digest) is now exposed in config.UseLogDigestin the demo cluster prevented the HTTP listener from ever becoming ready (readiness probe stayed red). For now the flag is set to false in the runtime config so the new build can serve traffic.Imported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1924#issuecomment-3492222622
Original created: 2025-11-05T16:36:04Z
Re-enabled the log digest path in demo and rolled the cluster:
serviceradar-configto setfeatures.use_log_digest=true, then rebuilt/pushedghcr.io/carverauto/serviceradar-core(sha256:ab992d84af2ad9500ce0c4d37c2f7b3231eb76a145c267acdb0a205388c0bb9b, tagsha-057b69fdcc8cb45a3d1e46ffb395d910474d897a).serviceradar-core-c8cf58f59-dcgvbreached 1/1 ready in ~70s).Follow-up: Proton is rejecting the streaming tail with
code: 62 ... Syntax error ... EMIT CHANGES; the aggregator is retrying with exponential backoff. We’ll need to adjust the tail query so the digest keeps up-to-date once the flag stays on.Imported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1924#issuecomment-3492446609
Original created: 2025-11-05T17:20:38Z
Validated the streaming log tailer end-to-end:
ghcr.io/carverauto/serviceradar-core@sha256:c587c6cadf6b1e26182ae93641c42d75d236e93a3c0d76b41267140cee379355and rolled the demo core deployment.Phase3b log-digest test) to exercise the digest path./api/logs/criticaland/api/logs/critical/counterswith an admin JWT; the API served the new entry directly from the in-memory digest, confirming the stream keeps up without relapsing to Proton.EMIT CHANGESsyntax errors.Follow-up: none for Phase 3b tailer; next we can look at trimming that bootstrap timeout if it shows up in SLOs.
Imported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1924#issuecomment-3492559591
Original created: 2025-11-05T17:46:51Z
Follow-up cleanup from the streaming rollout:
COUNT(*)scan type in the service registry (uint64instead ofint), rebuilt/pushedghcr.io/carverauto/serviceradar-core@sha256:8170567691819242005bddd711f6c7635ed49b2f02ce66704ead70b8d210f278, and rolled the demo core deployment.converting UInt64 to *int is unsupported) are gone;/api/logs/criticalstill returns the latest fatal log from the digest stream.With the log digest tailer feeding cleanly and the poller cache check fixed, Phase 3b is fully green. Next up is only ongoing monitoring.
Imported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1924#issuecomment-3492579691
Original created: 2025-11-05T17:51:45Z
Proton connection pressure is cleared up:
sha256:85a9f7f4860f99b1ce0bd182a44880af4505c712f28a63e8c89eb1a60363c78aand rolledserviceradar-corein demo.proton: acquire conn timeouterrors during edge onboarding / poller cache refresh are no longer appearing after the redeploy; log tailer and registry operations now run without starving the pool.Remaining noisy log is the legacy poller DELETE syntax (tracked separately). Otherwise the new connection ceiling keeps the registry + onboarding flows happy.
Imported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1924#issuecomment-3493548590
Original created: 2025-11-05T21:24:17Z
Updates from today:
table(unified_devices)we now log both counts plus a sample of missing device_ids (pkg/registry/hydrate.go,pkg/registry/diagnostics.go,pkg/core/stats_aggregator.go). No mismatches yet; hydration is reporting 50,007 devices while Proton currently reports 50,009.ALTER STREAMinstead ofALTER TABLE) so the new diagnostics would not spam with Proton errors./api/statsfor its top-line device counts and only falls back to SRQL if the cache is empty. The tile is still bouncing between ~49.5k and 50k because Kong is rejecting internal SRQL calls with 401 (“Unauthorized”), so the fallback path only succeeds intermittently. That explains the eventual consistency we were seeing earlier.serviceradar-web→serviceradar-kongrather than the registry cache itself.Next steps:
/api/queryviaserviceradar-kong:8000is unauthorised and either fix the auth headers or point the internal client straight at the OCaml SRQL service.Imported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/1924#issuecomment-3493900823
Original created: 2025-11-05T22:39:57Z
Observed another skew in the analytics "Total Devices" tile after today's core deploy. The value climbed to ~72k even though Proton and the registry both still report ~50k devices.
What we have already done:
pkg/core/stats_aggregator.goand rolled the new core image across the demo namespace./api/statsis live and the analytics dashboard queries it first, only falling back to SRQL when the cache turns up empty or zero.Current working theories:
0, and the SRQL query (in:devices time:last_7d stats:"count() as total") over-counts versioned rows.registry.SnapshotRecords()length vs. Proton again).Next actions before another roll-out:
/api/statsresponses alongside the fallback SRQL payload when the UI shows the inflated number (e.g. log both in the browser console or add telemetry indataService.fetchAllAnalyticsData)./api/statsis GA, or change the SRQL to respect_merged_into/_deletedso the count matches Proton.I updated
newarch_plan.mdto capture these investigations so we do not repeat the same fixes.