feat(flow-templates): inspect NetFlow/IPFIX templates in web UI #3243

Open
mikemiles-dev wants to merge 5 commits from mikemiles-dev/flow-templates-ui into staging AGit
Collaborator

Adds an operator-facing view of NetFlow v9 / IPFIX templates the
flow-collector replicas have learned, decoded from the shared
flow_templates NATS KV bucket and grouped per exporter (router) with
per-template field lists.

Why: confirms multi-replica template propagation through the KV bucket
is working in deployed clusters and gives ops a way to see what each
router is sending without shelling into a collector pod.

Surfaces:

  • Standalone page: /observability/flow-templates
  • Inline tab card: /observability?tab=netflows&view=templates (5th
    card alongside Traffic Analysis, Talkers, Topology, Flow Explorer)
  • JSON API: GET /api/flow-templates and /api/flow-templates/:scope

Plumbing:

  • proto/kv.proto: optional bucket field on Get/BatchGet/ListKeys/Info
    so datasvc clients can read non-default KV buckets.
  • go/pkg/datasvc: new GetEntryFromBucket / ListKeysInBucket interface
    methods, read-only and require the bucket to already exist (datasvc
    does not auto-create non-default buckets). nats_security mode "none"
    no longer requires a creds file (was rejecting plain unauth NATS dev
    setups).
  • elixir/datasvc: bucket: opt on KV.list_keys/2, get/2, info/1.
  • elixir/web-ng: pure-Elixir wire-format decoder for the netflow_parser
    template payloads (V9 / V9 options / IPFIX / IPFIX options + V9-style
    IPFIX), curated v9 + IPFIX field-name registry copied from the rust
    netflow_parser IANA enums (~158 v9 + ~500 IPFIX entries with
    field_NNN / enterprise_EEE_field_NNN fallbacks).
  • rust/kvutil: GetRequest literal updated for the new bucket field.
  • rust/flow-collector: dev config flips on metrics_addr + template_store
    by default; Dockerfile pinned to rust:1.88-bookworm so produced
    binaries match the bookworm-slim runtime GLIBC.
  • docker/compose: simple-compose dev stack lands as tracked files
    (datasvc.simple.json + the relaxed validation), and the web-ng
    Dockerfile copies elixir_uuid which was a previously-missing path
    dep.

Tests: parser, decoder (round-trips rust encoder layouts incl
enterprise + scope fields), API controller error paths.

Adds an operator-facing view of NetFlow v9 / IPFIX templates the flow-collector replicas have learned, decoded from the shared flow_templates NATS KV bucket and grouped per exporter (router) with per-template field lists. Why: confirms multi-replica template propagation through the KV bucket is working in deployed clusters and gives ops a way to see what each router is sending without shelling into a collector pod. Surfaces: - Standalone page: /observability/flow-templates - Inline tab card: /observability?tab=netflows&view=templates (5th card alongside Traffic Analysis, Talkers, Topology, Flow Explorer) - JSON API: GET /api/flow-templates and /api/flow-templates/:scope Plumbing: - proto/kv.proto: optional bucket field on Get/BatchGet/ListKeys/Info so datasvc clients can read non-default KV buckets. - go/pkg/datasvc: new GetEntryFromBucket / ListKeysInBucket interface methods, read-only and require the bucket to already exist (datasvc does not auto-create non-default buckets). nats_security mode "none" no longer requires a creds file (was rejecting plain unauth NATS dev setups). - elixir/datasvc: bucket: opt on KV.list_keys/2, get/2, info/1. - elixir/web-ng: pure-Elixir wire-format decoder for the netflow_parser template payloads (V9 / V9 options / IPFIX / IPFIX options + V9-style IPFIX), curated v9 + IPFIX field-name registry copied from the rust netflow_parser IANA enums (~158 v9 + ~500 IPFIX entries with field_NNN / enterprise_EEE_field_NNN fallbacks). - rust/kvutil: GetRequest literal updated for the new bucket field. - rust/flow-collector: dev config flips on metrics_addr + template_store by default; Dockerfile pinned to rust:1.88-bookworm so produced binaries match the bookworm-slim runtime GLIBC. - docker/compose: simple-compose dev stack lands as tracked files (datasvc.simple.json + the relaxed validation), and the web-ng Dockerfile copies elixir_uuid which was a previously-missing path dep. Tests: parser, decoder (round-trips rust encoder layouts incl enterprise + scope fields), API controller error paths.
Wire the netflow_parser 1.0.3 TemplateStore extension point into the
flow collector and switch the K8s deployment from a single Deployment
to a per-node DaemonSet behind `Service: LoadBalancer` with
`externalTrafficPolicy: Local`. Together these let the collector scale
horizontally with node count while preserving exporter source IP and
sharing learned NetFlow / IPFIX templates across pods via NATS
JetStream KV.

Why
---
A single flow-collector Deployment was a singleton bottleneck for UDP
ingest. Scaling required either source-IP-affinity routing (so every
exporter always lands on the same pod and templates stay local) or a
shared template store so any pod could decode any flow. We picked the
shared-store path because it doesn't depend on the LB layer.

The original plan was to front this with Envoy Gateway UDPRoute, but a
research spike found UDPRoute does not preserve client source IP today
— upstreams see the gateway pod's IP, which would collapse
AutoScopedParser's per-(source_ip,source_id) scoping. Service +
DaemonSet + `externalTrafficPolicy: Local` is the workable alternative
on MetalLB (L2 mode preserves source IP through to the pod). Gateway
API stays available for the rest of the stack (HTTP/gRPC/mTLS).

Code
----
* `Cargo.toml`: bump netflow_parser 1.0.0 -> 1.0.3 (TemplateStore API);
  add `bytes` for the NATS KV payload type.

* `src/nats_client.rs` (new): Shared `connect_once` and
  `connect_with_retry` helpers. Both publisher and template-store
  bootstrap go through these, applying TLS root CA + client cert +
  creds file from `Config.security` and `Config.nats_creds_file`.
  Eliminates the easy-to-miss bug where one path used bare
  `async_nats::connect(url)` and silently broke mTLS deployments.
  Bounded backoff (60 attempts, 0.5s -> 30s) so neither path
  crash-loops the pod when NATS is slow to come up.

* `src/template_store.rs` (new): `NatsKvTemplateStore` implements the
  parser's sync `TemplateStore` trait. async_nats is async-only so we
  bridge via `tokio::task::block_in_place` + `Handle::block_on`. NATS
  KV keys are restricted to `[A-Za-z0-9._=/-]`; the scope produced by
  AutoScopedParser (`v9:1.2.3.4:2055/0`, IPv6 with brackets, etc.)
  is sanitized via underscore replacement. `kind_tag` wildcard arm
  warns once via `std::sync::Once` if a future netflow_parser version
  ships a new `TemplateKind` variant.

* `src/netflow/mod.rs`: `NetflowHandler::new` accepts an optional
  `Arc<dyn TemplateStore>` and threads it through
  `NetflowParserBuilder::with_template_store`. New `TemplateEvent::
  Restored` arm in the event-callback (logs at info). When a store
  is configured, spawns a 1Hz background ticker that aggregates
  per-source `CacheMetrics` into the listener-level Prometheus
  counters via a `DeltaState` that tracks per-source last-known
  values plus a `retired` total — so listener counters never decrease
  when sources are LRU-evicted. Off the parse_datagram hot path.

* `src/listener.rs`: `build_handler` takes
  `Option<Arc<dyn TemplateStore>>`. sFlow ignores it (template-less).

* `src/config.rs`: new optional `TemplateStoreConfig` with
  `kv_bucket`, `kv_history` (u8, validated `1..=64`), `kv_ttl_secs`,
  and optional `nats_url` override for split-fault-domain setups.

* `src/main.rs`: `bootstrap_template_store(&Config, &TemplateStoreConfig)`
  uses `nats_client::connect_with_retry` (so TLS/creds/retry match
  the publisher), then `create_or_update_key_value` for genuinely
  idempotent KV bucket bootstrap across config drift. Independent
  NATS connection from the publisher's so KV failures cannot stall
  publishing and vice versa.

* `src/metrics.rs`: `ListenerMetrics` gains `template_store_restored`,
  `template_store_codec_errors`, `template_store_backend_errors`,
  `source_count` atomics. New `render_prometheus()` produces text
  exposition format with HELP/TYPE headers and properly escaped
  label values (`\` -> `\\`, `"` -> `\"`, `\n` -> `\n`). sFlow
  listeners are filtered out of `template_store_*` and
  `flow_collector_sources` rows since those are structurally zero
  for sFlow.

  New `run_prometheus_server()` is a hand-rolled HTTP/1.1 server on
  tokio's TcpListener — `GET /metrics` returns the exposition;
  anything else 404s. No new deps. Defensive measures against
  slowloris-style attacks: per-line read timeout (3s), response
  write timeout (5s), 8 KiB max-request-bytes cap via
  `AsyncReadExt::take`, and a `Semaphore`-bounded concurrency cap
  (64) that fast-fails excess connections rather than queueing them
  so an attacker cannot exhaust file descriptors.

  Spawned from main.rs when `metrics_addr` is set; runs
  independently of the log reporter and the publisher so a scrape
  failure can never stall ingestion.

K8s manifests
-------------
* `k8s/demo/base/serviceradar-flow-collector.yaml`: `Deployment` ->
  `DaemonSet`, replicas dropped, unused PVC removed (the Rust binary
  writes nothing to disk). ConfigMap JSON gains the `template_store`
  block. Service was already `externalTrafficPolicy: Local` — kept.
  Added `RollingUpdate` with `maxUnavailable: 1`. HTTP probes on the
  `/metrics` endpoint replace the old `pgrep` exec probes (catches
  deadlocked-but-running scenarios).

* `helm/serviceradar/templates/flow-collector.yaml`: same kind flip,
  PVC removal, plus `nodeSelector` / `tolerations` knobs.
  `templates/NOTES.txt` warns operators about the `replicaCount` and
  `data.*` value removals and prints the explicit
  `kubectl delete deployment / pvc` commands needed to clean up
  orphaned objects from a pre-DaemonSet install.

* `helm/serviceradar/values.yaml`: drop `replicaCount` and `data.*`,
  add `flowCollector.config.template_store` block, default
  `service.type: LoadBalancer` and
  `service.externalTrafficPolicy: Local`, add the IPFIX (4739) and
  metrics (50046) ports.

Tests
-----
55 tests pass; `clippy --all-targets -D warnings` clean. New tests:

* 5 `NatsKvTemplateStore` key-render tests (IPv4, IPv6, empty scope,
  distinct scopes, distinct kinds).
* 2 `TemplateStoreConfig` deserialization tests.
* 2 `DeltaState` monotonic-counter tests covering source eviction and
  re-emergence and multiple-sources / multiple-kinds aggregation.
* `escape_label_handles_quotes_backslashes_newlines`,
  `template_store_metrics_only_emit_for_netflow_listeners`.
* `http_server_serves_metrics_and_404` — end-to-end including a
  deliberate split-read of the request line that the old
  single-`read()` code would have failed.
* `http_server_drops_idle_connection_within_read_timeout` — slow
  client closed within 2x READ_TIMEOUT (was: would hang forever).
* `http_server_concurrency_cap_fast_fails_excess_connections` —
  saturate with 64 idle conns, verify a fresh request doesn't hang
  past 2s (was: would queue and starve real scrapers).

Docs
----
`docs/docs/flow-collector-scaling.md`: deployment model with diagram
and rationale for `externalTrafficPolicy: Local`; configuration knobs
table; small / medium / large / very-large sizing rules of thumb;
Prometheus metrics with healthy-vs-trouble interpretations and
alerting suggestions; signals you've outgrown the design plus the
sharding strategy when you do; "Upgrading from a Pre-DaemonSet
Install" section with explicit `kubectl delete deployment / pvc`
commands; "Rolling Upgrade Behavior" section explaining how
`externalTrafficPolicy: Local` interacts with DaemonSet rolling
updates and why `maxUnavailable: 1` is the floor; deploy verification
checklist (LB IP, source-IP smoke test, cross-pod template share
check, Prometheus scrape).

Deliberately avoids speculative throughput numbers — flagged as
"benchmark on your hardware" — until real load tests fill them in.
- Add a default netflow listener on 0.0.0.0:4739 in Helm values and the
  demo manifest so the advertised IPFIX Service port has a real UDP
  socket behind it instead of silently dropping datagrams.
- Gate the flow-collector liveness/readiness probes on
  service.ports.metrics.enabled AND config.metrics_addr; fall back to
  a pgrep exec probe when metrics are off so disabling /metrics no
  longer crashloops the pod. Document the implication near the demo
  values' metrics.enabled toggle.
- Replace non-ASCII characters (em/en dashes, arrows, box drawing,
  less-than-or-equal) in docs/docs/flow-collector-scaling.md with
  ASCII equivalents to satisfy the docs ASCII-only constraint.
feat(flow-templates): inspect NetFlow/IPFIX templates in web UI
Some checks are pending
lint / lint (pull_request) Blocked by required conditions
CI / build (pull_request) Blocked by required conditions
Secret Scan / gitleaks (pull_request) Blocked by required conditions
7089194ad1
Adds an operator-facing view of NetFlow v9 / IPFIX templates the
flow-collector replicas have learned, decoded from the shared
flow_templates NATS KV bucket and grouped per exporter (router) with
per-template field lists.

Why: confirms multi-replica template propagation through the KV bucket
is working in deployed clusters and gives ops a way to see what each
router is sending without shelling into a collector pod.

Surfaces:
- Standalone page: /observability/flow-templates
- Inline tab card: /observability?tab=netflows&view=templates (5th
  card alongside Traffic Analysis, Talkers, Topology, Flow Explorer)
- JSON API: GET /api/flow-templates and /api/flow-templates/:scope

Plumbing:
- proto/kv.proto: optional bucket field on Get/BatchGet/ListKeys/Info
  so datasvc clients can read non-default KV buckets.
- go/pkg/datasvc: new GetEntryFromBucket / ListKeysInBucket interface
  methods, read-only and require the bucket to already exist (datasvc
  does not auto-create non-default buckets). nats_security mode "none"
  no longer requires a creds file (was rejecting plain unauth NATS dev
  setups).
- elixir/datasvc: bucket: opt on KV.list_keys/2, get/2, info/1.
- elixir/web-ng: pure-Elixir wire-format decoder for the netflow_parser
  template payloads (V9 / V9 options / IPFIX / IPFIX options + V9-style
  IPFIX), curated v9 + IPFIX field-name registry copied from the rust
  netflow_parser IANA enums (~158 v9 + ~500 IPFIX entries with
  field_NNN / enterprise_EEE_field_NNN fallbacks).
- rust/kvutil: GetRequest literal updated for the new bucket field.
- rust/flow-collector: dev config flips on metrics_addr + template_store
  by default; Dockerfile pinned to rust:1.88-bookworm so produced
  binaries match the bookworm-slim runtime GLIBC.
- docker/compose: simple-compose dev stack lands as tracked files
  (datasvc.simple.json + the relaxed validation), and the web-ng
  Dockerfile copies elixir_uuid which was a previously-missing path
  dep.

Tests: parser, decoder (round-trips rust encoder layouts incl
enterprise + scope fields), API controller error paths.
mikemiles-dev force-pushed mikemiles-dev/flow-templates-ui from 7089194ad1
Some checks are pending
lint / lint (pull_request) Blocked by required conditions
CI / build (pull_request) Blocked by required conditions
Secret Scan / gitleaks (pull_request) Blocked by required conditions
to 1c47a0eaff
Some checks failed
Secret Scan / gitleaks (pull_request) Successful in 21s
lint / lint (pull_request) Failing after 1m26s
CI / build (pull_request) Has been cancelled
2026-05-08 04:15:33 +00:00
Compare
docs(architecture): document scaling philosophy
Some checks failed
Secret Scan / gitleaks (pull_request) Successful in 21s
lint / lint (pull_request) Failing after 46s
CI / build (pull_request) Failing after 5m5s
3ab87a22e9
Captures the rationale for ServiceRadar not shipping HPA/VPA resources
and instead using manual replicaCount + cluster autoscaler at the
infra layer.

The question keeps coming up: most services are JetStream-driven (not
request-driven), so HPA doesn't speed work up — JetStream consumer
fan-out is the natural concurrency control. Per-node services run as
DaemonSets where source-IP preservation rules out HPA outright. This
section formalises the convention so future services know when an HPA
is/isn't appropriate.

Slots in under the existing "Kubernetes HA Profile" section.
fix(flow-templates): address PR review comments
Some checks failed
Secret Scan / gitleaks (pull_request) Successful in 21s
lint / lint (pull_request) Failing after 1m25s
CI / build (pull_request) Failing after 5m50s
3cbbdc9f5a
- LiveView load is now async via start_async/3 so a slow datasvc can
  never block the WebSocket. Both the standalone page and the inline
  netflows view show a "loading…" state while the round-trip is in
  flight, and an explicit "Loaded HH:MM:SS UTC" stamp afterwards so
  staleness is visible to the operator.

- Hoisted the netflow_view enum to a single @netflow_views module
  attribute. Was duplicated in three places (parse, normalize, params)
  and a missed update in this PR caused view=templates to be silently
  rewritten to "overview" mid-review.

- datasvc kvByDomainBucket cache now caps at 32 buckets per domain.
  Real workloads use a handful; this is a guardrail against unbounded
  growth from a misbehaving caller spamming bucket names.

- Added a regen recipe to fields.ex so future drift against
  netflow_parser's IANA enums is fixable in minutes (with the grep/sed
  pipelines that produced the current 158 v9 + 500 IPFIX entries).

- datasvc logs a WARN at startup when nats_security.mode="none" so a
  misconfigured production cluster is visible in standard log scrapers
  instead of silently running unauth.

Note: per-template KV.get is still sequential. Adding BatchGet to the
elixir datasvc client (it already exists in the proto) would collapse
N+1 round-trips to 2 — left as a follow-up to keep this PR focused.
mikemiles-dev force-pushed mikemiles-dev/flow-templates-ui from 3cbbdc9f5a
Some checks failed
Secret Scan / gitleaks (pull_request) Successful in 21s
lint / lint (pull_request) Failing after 1m25s
CI / build (pull_request) Failing after 5m50s
to d8ebafe923
Some checks failed
Secret Scan / gitleaks (pull_request) Successful in 23s
lint / lint (pull_request) Successful in 1m9s
CI / build (pull_request) Failing after 5m6s
2026-05-09 15:11:59 +00:00
Compare
mfreeman451 left a comment

lgtm

lgtm
Some checks failed
Secret Scan / gitleaks (pull_request) Successful in 23s
lint / lint (pull_request) Successful in 1m9s
CI / build (pull_request) Failing after 5m6s
This pull request has changes conflicting with the target branch.
  • Cargo.lock
  • rust/flow-collector/Cargo.toml
View command line instructions

Manual merge helper

Use this merge commit message when completing the merge manually.

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin +refs/pull/3243/head:mikemiles-dev/flow-templates-ui
git switch mikemiles-dev/flow-templates-ui

Merge

Merge the changes and update on Forgejo.

Warning: The "Autodetect manual merge" setting is not enabled for this repository, you will have to mark this pull request as manually merged afterwards.

git switch staging
git merge --no-ff mikemiles-dev/flow-templates-ui
git switch mikemiles-dev/flow-templates-ui
git rebase staging
git switch staging
git merge --ff-only mikemiles-dev/flow-templates-ui
git switch mikemiles-dev/flow-templates-ui
git rebase staging
git switch staging
git merge --no-ff mikemiles-dev/flow-templates-ui
git switch staging
git merge --squash mikemiles-dev/flow-templates-ui
git switch staging
git merge --ff-only mikemiles-dev/flow-templates-ui
git switch staging
git merge mikemiles-dev/flow-templates-ui
git push origin staging
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
carverauto/serviceradar!3243
No description provided.