feat: NG visualizations / deepcausality integration #1028
Closed
opened 2026-03-28 04:30:57 +00:00 by mfreeman451
·
2 comments
No Branch/Tag specified
staging
demo/prod-release
add-dashboard-srql-service-views
enhance-dashboard-creator-visual-builder
fix-release-1-2-78-ci-failures
refactor-agent-plugin-runtime
add-dashboard-creator
update/plugin-system
add-service-monitoring-foundation
fix/cli-auth-settings-navigation
fix/device-unknown-facet
fix-plugin-assignment-upgrades
fix/sweep-profile-mode-status
fix/armis-northbound-fixes
fix-proxmox-console-react-client-render
fix-openbao-signing-job-token-mount
fix-services-plugin-status-read-model
fix-stream-status-final-chunk
codex/fix-prod-armis-sync-icmp
fix-demo-cnpg-operator-networkpolicy
update/docs-cleanup
fix-armis-northbound-raw-token-auth
fix-armis-northbound-stuck-running
update-manual-device-hostname-readd
fix/observability-severity-card-links
renovate/debian_testing_slim-testing-slim
docs/kubernetes-ingestion-gateway
fix/flow-collector-external-network-policy
fix/armis-availability-and-stream-config
refactor-fast-fresh-db-bootstrap
fix/cnpg-not-found
fix/docker-update/cnpg
renovate/debian_bookworm_slim-bookworm-slim
add-socket-firewall-ci
add-ipv6-sweep-scanners
fix-armis-northbound
add-streamed-agent-config
fix/sweep-ip-family-routing
codex/fix-core-migration-hook-upgrade
renovate/arc_runner
renovate/actions_runner
codex/remote-access-desktop-rdp
updates/missing-fixes
fix/batched-sweep-targets
fix/agent-sysmon-memory-growth
fix/armis-northbound-single-run-and-timestamps
release/1.2.67
fix/wasm-plugin-service-status
fix/armis-names-string
codex/harden-remote-access-app-tcp-followup
codex/harden-remote-access-app-tcp
codex/harden-gateway-proxy-auth
codex/harden-remote-access-destroy-rbac
codex/harden-ssh-ca-remote-access
codex/harden-remote-access-broker-registry
codex/harden-remote-access-approval-policy
codex/harden-remote-access-file-transfer
codex/harden-remote-access-attach-ticket
fix-release-plugin-list-cleanup
codex/harden-grpc-logger-payloads
propose-interface-action-target-context
add-signed-northbound-action-callbacks
fix/reap-stale-self-scheduled-oban-jobs
fix/device-logs-tab-async
fix/device-metadata-summary
fix/northbound-launch-target-auth
fix/northbound-provider-atomic-read
proposal/add-sample-northbound-wasm-plugin
proposal/northbound-action-integrations
fix/ansible-run-job-device-launch
codex/fix-audit-events-shell-theme
codex/fix-sweep-tcp-availability
codex/fix-armis-names-string
fix/update-nats-2-14
fix/elixir-quality-release-blockers
fix/latest-release-device-ingest-ui-bugs
fix/release-1.2.61-ci-failures
fix/web-ng-precommit-format
fix/wasm-tinygo-go125
fix/agent-file-transfer-bazel-src
fix/release-key-stamp-root
fix/release-1.2.59-staging
main
fix/agent-accept-deprecated-remote-access-config
fix/device-results-count-facets
fix/bazelisk-installer-retries
fix/tinygo-host-toolchain-fetch
add-per-agent-availability
fix/forgejo-release-multipart-assets
fix/agent-config-stale-session
fix/mtr-hop-dns-resolution
fix/hostname-only-device-create
fix/otlp-log-metadata-sanitization
fix/sweep-icmp-legacy-mode-classification
add-nats-object-store-retention
fix/helm-serviceradar-state-pvc
harden/forgejo-ci-nonroot
codex/expand-remote-access-teleport-parity
fix/sweep-port-history-consistency
fix/remote-access-ssh-feature-flag
fix/devices-refresh-artifacts
fix/identity-ingestion-sweep-availability
spec/identity-cache-ingestion-correctness
fix/sweep-mapper-promotion-stale-cache
fix/sweep-provisional-duplicate-ip
fix/sweep-target-invalid-ip-order
fix/armis-large-sync-streaming
fix/docusaurus-blog-build-date
fix/armis-sync-compat-deep-dive
fix/awx-controller-credential-secret
fix/demo-release-source-branch
demo/release-v1.2.44-source-fix
codex/teleport-agent-routed-remote-access
feature/agent-config-dependency-catalog
proposal/agent-config-dependency-catalog
bugfix/armis-credentials-save-display
bugfix/armis-integration-credentials
fix/web-ng-precommit-formatting
bugfix/armis-secret-config-push
feature/close-controllers-to-pipelines
carverauto/extract-palisade
feature/migrate-dashboard-cli-to-plugs
feature/audit-history-page
feature/migrate-controllers-to-security-pipelines
fix/security-events-test-and-retention-worker
feature/add-platform-security-hardening
demo-rollout-proxmox-bazel-fix
bug/armis-sync-issues
add-virtualization-srql-queries
add-proxmox-ingestion-hardening-tests
add-ssh-private-key-credential-rules
redact-plugin-credential-material
add-credential-rules-settings-entry
add-credential-rules-settings-flows
require-network-credential-broker-grants
preview-credential-rule-target-scope
add-proxmox-credential-secret-preset
harden-credential-rules-live-tests
clarify-proxmox-plugin-credential-modes
docs-proxmox-credential-operations
fingerprint-proxmox-candidates
proxmox-resource-efficiency-dashboard
proxmox-console-security-docs
proxmox-focused-quality-validation
proxmox-metric-baseline-alerts
proxmox-device-scoped-logs
proxmox-defer-vector-log-forwarding
proxmox-console-session-tickets
proxmox-console-xterm-shell
proxmox-console-websocket-broker
proxmox-console-control-frames
proxmox-console-agent-session-manager
proxmox-console-plugin-pty-bridge
proxmox-console-assignment-materializer
proxmox-console-ssh-connector
proxmox-console-agent-local-broker
proxmox-console-device-actions
proxmox-console-stream-timeouts
proxmox-console-guest-mode-gating
proxmox-console-ci-race-fix
fix/falco-alert-routing-datasvc-channel
fix-agent-release-page-bugs
add-proxmox-device-details-summary
add-proxmox-virtualization-ingestor
add-virtualization-inventory-schema
add-proxmox-infrastructure-inventory
add-proxmox-plugin-live-smoke
add-proxmox-local-api-smoke
add-proxmox-credential-test-dispatch
add-proxmox-credential-test-plan
add-network-credential-rule-preview
add-proxmox-plugin-inventory-details
add-proxmox-credential-reconcile-worker
add-proxmox-credential-assignment-materializer
add-plugin-input-template-secret-refs
add-proxmox-plugin-policy-inputs
add-proxmox-wasm-plugin-scaffold
add-network-credential-rules-model
add-proxmox-plugin-credential-rules
fix/rperf-rustls-provider-demo
fix/dashboard-template-sdk-014
fix/tinygo-go126-release
fix/reqsign-provider-bazel-deps
fix/release-bazel-rust-crates
fix/core-coordinator-connection-leak
chore/security-updates
update/readme-versions-update
docs/readme-dashboard-sdk
chore/cli-0.1.5
fix/dashboard-cli-local-map-dev
fix/dashboard-cli-hmr-map-libraries
chore/bump-serviceradar-cli-0.1.2
fix/dashboard-cli-hmr-harness
fix/helm-contour-liveview-websocket
fix/helm-cnpg-pooler-defaults
updates/helm-fixes
update/fix-light-mode-analytics
ual/dashboard-sdk-dx
security/postgres-update
update-falco-alert-diagnostics
ual/react-dashboard-sdk
fix/cnpg-saturation-fk-and-bootstrap
ual/wifi-site-map
fix-coraza-log-db-writer
fix-log-viewer-syslog-processed
plan-fieldsurvey-spatial-selection
plan-envoy-coraza-waf
plan-alienvault-otx-integration
plan-fieldsurvey-sidekick-daemon
cleanup-openspec-archive-closed-proposals
bug-core-elx-ip-enrichment-reap
fix-release-libcap2-pin-v125
fix-camera-stream-gateway-route
add-core-elx-prometheus-metrics
add-serviceradar-observability-dashboards
add-pgbouncer-helm-cnpg
renovate/ghcr.io-actions-actions-runner
feat/cluster-agent-runtime-metadata-stability
feat/observability-shell-standardization
feat/observability-live-log-toggle
fix/mtr-bulk-queue-and-srql-targets
fix/mtr-profile-protocol-keyerror
fix/mtr-diagnostics-keyerror
add-demo-rollout-skills
fix-web-ng-test-support-dialyzer
fix-web-ng-dialyzer-findings
add-bulk-queued-mtr-diagnostics
fix-serviceradar-core-integration-failures
harden-agent-updater-exec-arguments
harden-agent-release-trust-boundaries
fix-compose-hermetic-nats-datasvc-bootstrap
fix/openbao-release-issues
elixir/formatting-updates
codex/demo-cnpg-signing-release-fixes
chore/lint-fixes
fix/bazel-alpine-bump-and-cosign-skip
update-event-alert-dedup-and-suppression
armis-northbound-events
armis-northbound-availability-updates
push-owvypksrmooo
codex/topology-endpoint-evidence-investigation
codex/topology-bootstrap-and-layout-simplification
codex/remove-ingress-nginx-edge
fix/forgejo-ci-snmp-cache-and-ubuntu24
bug/cnpg-mtls-failure
bug/log-collector
fix/forgejo-runner-labels
fix/cargo-lock-sync
remove-arc-runner-from-push-all
fix-push-all-cosign-preflight
fix-go-ci
add-versioned-openapi-publish
chore/forgejo-hardening
security/k8s-hardening
update/cluster-page-agents
updates/helm-security-updates
2406-feat-agent-fleet-management-secure-self-update-system
bug/k8s-helm-deployments
chore/k8s-arc-update
rust-fix
2371-analytics-stats-cards-should-abbreviate-numbers
chore/perl-cleanup
2942-featweb-ng-add-logs-tab-to-device-details-page
192-feat-tftp-server
mikemiles-dev/feature/netflow_collection
testing
dependabot/cargo/hostname-0.4.1
dependabot/cargo/redis-1.0.1
dependabot/cargo/bb8-0.9.0
dependabot/cargo/rcgen-0.14.6
dependabot/cargo/hyper-1.8.1
dependabot/cargo/hyper-util-0.1.19
dependabot/cargo/clap-4.5.51
dependabot/cargo/thiserror-2.0.17
dependabot/cargo/time-tz-2.0.0
dependabot/cargo/tonic-build-0.14.2
backup/main-pre-staging-sync-2026-04-02
dependabot/npm_and_yarn/docs/mdast-util-to-hast-13.2.1
815-feat-support-win32-for-agentpoller
gh-pages
v1.2.79
v1.2.78
v1.2.77
v1.2.76
v1.2.75
v1.2.74
v1.2.73
v1.2.72
v1.2.71
v1.2.70
v1.2.69
v1.2.68
v1.2.67
v1.2.66
v1.2.65
v1.2.64
v1.2.63
v1.2.62
v1.2.61
v1.2.60
v1.2.59
v1.2.58
v1.2.57
v1.2.54
v1.2.53
v1.2.52
v1.2.51
v1.2.50
v1.2.49
v1.2.48
v1.2.47
v1.2.46
v1.2.45
v1.2.44
v1.2.43
v1.2.42
v1.2.41
v1.2.40
v1.2.39
v1.2.38
sha-de6d1025d59f039188754b895ff7fe65db9b306b
sha-8006b6105635acf43060fab2613eab3bccb1efcf
v1.2.37
v1.2.36
v1.2.35
v1.2.34
v1.2.33
v1.2.32
v1.2.31
v1.2.30
v1.2.29
v1.2.28
v1.2.27
v1.2.26
v1.2.25
v1.2.24
v1.2.23
v1.2.22
v1.2.21
v1.2.20
v1.2.19
v1.2.18
v1.2.17
v1.2.16
v1.2.15
v1.2.14
v1.2.13
v1.2.12
v1.2.11
v1.2.6
v1.2.10
v1.2.9
v1.2.8
v1.2.7
v1.2.5
v1.2.4
v1.2.3
v1.2.2
v1.2.1
v1.2.0
v1.1.2
v1.1.0
v1.0.92
v1.0.91
v1.0.90
v1.0.89
v1.0.88
v1.0.87
v1.0.86
v1.0.85
v1.0.84
v1.0.83
v1.0.82
v1.0.81
v1.0.78
v1.0.77
v1.0.76
v1.0.70
v1.0.69
v1.0.68
v1.0.67
v1.0.66
v1.0.65
v1.0.64
v1.0.63
v1.0.62
v1.0.61
v1.0.60
v1.0.59
v1.0.58
1.0.57
v1.0.56
v1.0.55
v1.0.54-pre5
v1.0.53
v1.0.53-pre19
v1.0.53-pre18
v1.0.53-pre17
v1.0.53-pre15
1.0.53-pre10
1.0.53-pre9
1.0.53-pre8
1.0.53-pre7
1.0.53-pre6
1.0.53-pre5
1.0.53-pre4
1.0.53-pre3
1.0.53-pre2
1.0.53-pre1
1.0.52
1.0.51
1.0.50
1.0.49
1.0.49-pre5
1.0.49-pre4
1.0.49-pre3
1.0.49-pre2
1.0.48
1.0.48-rc2
1.0.48-rc1
1.0.48-pre8
1.0.48-pre7
1.0.48-pre6
1.0.48-pre5
1.0.48-pre4
1.0.48-pre3
1.0.48-pre2
1.0.48-pre1
1.0.47
1.0.47-pre8
1.0.47-pre7
1.0.47-pre6
1.0.47-pre5
1.0.47-pre4
1.0.47-pre3
1.0.47-pre2
1.0.47-pre1
1.0.46
1.0.46-pre9
1.0.46-pre8
1.0.46-pre7
1.0.46-pre6
1.0.46-pre5
1.0.46-pre4
1.0.46-pre3
1.0.46-pre2
1.0.46-pre1
1.0.45
1.0.44
1.0.44-pre12
1.0.44-pre11
1.0.44-pre10
1.0.44-pre9
1.0.44-pre8
1.0.44-pre7
1.0.44-pre6
1.0.44-pre5
1.0.44-pre4
1.0.44-pre3
1.0.44-pre2
1.0.44-pre1
1.0.43
1.0.42
1.0.41
1.0.41-pre1
1.0.40
1.0.40-pre11
1.0.40-pre10
1.0.40-pre9
1.0.40-pre8
1.0.40-pre7
1.0.40-pre6
1.0.40-pre5
1.0.40-pre4
1.0.40-pre3
1.0.40-pre2
1.0.40-pre1
1.0.39
1.0.38
1.0.37
1.0.36
1.0.36-pre5
1.0.36-pre4
1.0.36-pre3
1.0.36-pre2
1.0.35
1.0.35-pre3
1.0.35-pre2
1.0.35-pre
1.0.34-pre3
1.0.34-pre2
1.0.34-pre1
1.0.33
1.0.33-pre2
1.0.33-pre
1.0.32
1.0.31
1.0.30
1.0.29
1.0.28
1.0.27
1.0.26
1.0.25
1.0.24
1.0.23
1.0.22
1.0.21
1.0.20
1.0.19
1.0.18
1.0.17
1.0.16
1.0.15
1.0.14
1.0.13
1.0.11
1.0.10
1.0.9
1.0.8
1.0.7
1.0.6
1.0.5
1.0.4
1.0.3
1.0.2
1.0.1
1.0.0
Labels
Clear labels
1week
2weeks
Failed compliance check
IP cameras
NATS
NATS JetStream
Possible security concern
Review effort 1/5
Review effort 2/5
Review effort 3/5
Review effort 4/5
Review effort 5/5
UI
aardvark
accessibility
amd64
api
arm64
auth
back-end
bgp
blog
bug
Something isn't working
build
checkers
ci-cd
continuous integration-continuous deployments
cleanup
cnpg
cloud-native postgres
codex
core
core service
dependencies
Pull requests that update a dependency file
device-management
documentation
Improvements or additions to documentation
duplicate
This issue or pull request already exists
dusk
ebpf
enhancement
New feature or request
eta 1d
eta 1hr
eta 3d
eta 3hr
feature
fieldsurvey
github_actions
Pull requests that update GitHub Actions code
go
Pull requests that update Go code
good first issue
Good for newcomers
help wanted
Extra attention is needed
invalid
This doesn't seem right
javascript
Pull requests that update Javascript code
k8s
log-collector
mapper
mtr
multi traceroute
needs-triage
netflow
network-sweep
observability
oracle
Oracle Linux related issues
otel
opentelemetry logs, traces, metrics
plug-in
proton
timeplus proton streaming database
python
question
Further information is requested
reddit
redhat
research
rperf
rperf-checker
rust
Pull requests that update rust code
sdk
security
serviceradar-agent
serviceradar-agent-gateway
serviceradar-web
serviceradar-web-ng
siem
snmp
sysmon
topology
ubiquiti
wasm
wontfix
This will not be worked on
zen-engine
No labels
1week
2weeks
Failed compliance check
IP cameras
NATS
Possible security concern
Review effort 1/5
Review effort 2/5
Review effort 3/5
Review effort 4/5
Review effort 5/5
UI
aardvark
accessibility
amd64
api
arm64
auth
back-end
bgp
blog
bug
build
checkers
ci-cd
cleanup
cnpg
codex
core
dependencies
device-management
documentation
duplicate
dusk
ebpf
enhancement
eta 1d
eta 1hr
eta 3d
eta 3hr
feature
fieldsurvey
github_actions
go
good first issue
help wanted
invalid
javascript
k8s
log-collector
mapper
mtr
needs-triage
netflow
network-sweep
observability
oracle
otel
plug-in
proton
python
question
reddit
redhat
research
rperf
rperf-checker
rust
sdk
security
serviceradar-agent
serviceradar-agent-gateway
serviceradar-web
serviceradar-web-ng
siem
snmp
sysmon
topology
ubiquiti
wasm
wontfix
zen-engine
Milestone
Clear milestone
No items
No milestone
Projects
Clear projects
No items
No project
Assignees
Clear assignees
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".
No due date set.
Dependencies
No dependencies set.
Reference
carverauto/serviceradar#1028
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Imported from GitHub.
Original GitHub issue: #2834
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/2834
Original created: 2026-02-14T04:30:53Z
This is the Definitive Product Requirements Document (PRD) for the ServiceRadar "God-View" Topology Platform. This edition integrates the Hybrid Filter Strategy, ensuring a clean decoupling between the high-performance backend and the GPU-accelerated frontend.
PRD: ServiceRadar "God-View" Visualization Engine (Integrated Edition)
1. Vision & Executive Summary
To transform "Network Monitoring" into a Cyber-Physical Radar experience. ServiceRadar visualizes massive-scale global infrastructure (100k+ nodes) as a living, breathing organism. By combining Zero-Copy Data Streaming, GPU-Native Rendering, and Deep Causal Inference, we eliminate "Alert Fatigue" and provide an instant, visual "Blast Radius" for every incident.
2. The High-Performance Technical Stack (The "Three Pillars")
To achieve 60fps performance and sub-second data updates at a scale of 100k nodes/250k edges, we bypass the "JSON/REST Bottleneck" entirely.
Pillar 1: The Vehicle (Apache Arrow IPC)
deck.gl.Pillar 2: The Filter (Hybrid Roaring Bitmaps)
is_cisco,is_critical).Pillar 3: The Brain (Deep Causality & Rustler)
deep_causalityRust crate) evaluates telemetry (SNMP, Flow, BGP, Security) to distinguish between a Root Cause and an Inferred Symptom.3. The Data Pipeline: "Telemetry to Vision"
deck.gl(WebGL2/WebGPU) receives the buffer and updates the 100k nodes in a single draw call.4. The Hybrid Filtering & Ghosting Engine (Architectural Core)
To maintain backend/frontend decoupling while ensuring 60fps performance, we utilize a Hybrid Filter Strategy. This ensures the backend remains GPU-agnostic while the frontend remains logic-light.
4.1 Separation of Concerns
deck.glreceives the bitmap and passes it to the GPU as a Vertex Attribute via theDataFilterExtension.4.2 The "Reshape" vs. "Visual" Logic
5. Multi-Layer Visualization Architecture
We use a "Layered Projection" model to maintain clarity across physical and logical planes.
Layer 1: The Mantle (Physical Infrastructure)
Layer 2: The Crust (Logical Topology)
Layer 3: The Atmosphere (Telemetry Flow)
rperfthroughput.Layer 4: The Security & Causal Perimeter
6. Advanced UI Features
6.1 Semantic Zoom & Fractal Navigation
6.2 Radial Subnet Layouts
To prevent "Sprawl," Leaf/Access nodes are arranged in Compact Radial Clusters around Distribution switches, reducing the visual footprint of large subnets.
7. Aesthetic Specification ("Cyber-Punk Nocturne")
8. Success Metrics
9. Critical Use Case: The "Security-Exfiltration" Incident
deep_causalityidentifies a Server IP as the source, notes an unauthorized Falco process, and matches the destination to a known C2 server.Imported GitHub comment.
Original author: @mfreeman451
Original URL: https://github.com/carverauto/serviceradar/issues/2834#issuecomment-3901207836
Original created: 2026-02-14T06:10:42Z
v2:
This updated Definitive Product Requirements Document (PRD) integrates the Wasm-Arrow Bridge into the ServiceRadar "God-View" architecture. This addition elevates the platform from a high-performance web app to a "computationally elite" visualization engine, eliminating the "JavaScript Tax" to ensure a locked 60fps at 100k+ nodes.
PRD: ServiceRadar "God-View" Visualization Engine (Wasm-Arrow Edition)
1. Vision & Executive Summary
To transform "Network Monitoring" into a Cyber-Physical Radar experience. ServiceRadar visualizes massive-scale global infrastructure (100k+ nodes) as a living, breathing organism. By combining Zero-Copy Data Streaming, Wasm-Native Logic, and GPU-Accelerated Rendering, we eliminate "Alert Fatigue" and provide an instant, visual "Blast Radius" for every incident.
2. The High-Performance Technical Stack (The "Four Pillars")
To achieve 60fps performance and sub-second data updates at a scale of 100k nodes/250k edges, we bypass the "JSON/REST Bottleneck" and JavaScript Garbage Collection stutters entirely.
Pillar 1: The Vehicle (Apache Arrow IPC)
Pillar 2: The Engine (Wasm-Arrow Bridge)
Pillar 3: The Filter (Hybrid Roaring Bitmaps)
Pillar 4: The Brain (Deep Causality & Rustler)
deep_causalityRust crate) evaluates telemetry (SNMP, Flow, BGP, Security) to distinguish between a Root Cause and an Inferred Symptom.3. The Data Pipeline: "Telemetry to Vision"
deck.gl(via GeoArrow patterns) reads coordinates directly from the Wasm heap to update 100k nodes in a single draw call.4. The Hybrid Filtering & Ghosting Engine
4.1 Separation of Concerns
4.2 The "3-Hop" Rule (Local Traversal)
When a user clicks a node, the Wasm Engine traverses the graph adjacency list in memory. It identifies neighbors within
Nhops and updates the "Ghosting Mask" in < 1ms, providing instantaneous visual isolation.5. Multi-Layer Visualization Architecture
Layer 1: The Mantle (Physical Infrastructure)
Layer 2: The Crust (Logical Topology)
Layer 3: The Atmosphere (Telemetry Flow)
rperfthroughput. Animated Particle Shaders calculated in Wasm for 100k+ particles.Layer 4: The Security & Causal Perimeter
6. Advanced UI Features
6.1 Semantic Zoom & Wasm Interpolation
6.2 Radial Subnet Layouts
To prevent "Sprawl," Leaf/Access nodes are arranged in Compact Radial Clusters. Wasm handles the radial coordinate math locally to keep the layout snappy.
7. Aesthetic Specification ("Cyber-Punk Nocturne")
8. Success Metrics
9. Critical Use Case: The "Security-Exfiltration" Incident
deep_causalityidentifies a Server IP as the source and a malicious destination.Imported GitHub comment.
Original author: @marvin-hansen
Original URL: https://github.com/carverauto/serviceradar/issues/2834#issuecomment-3904074197
Original created: 2026-02-15T09:57:17Z
Okay, I took some time to write up my thoughts on the DC integration. No AI, just my humble brain dump;
Service Radar DeepCausality integration
Big idea:
Constructing and updating a context hyper graph in real-time as the various devices in the network are discovered.
DeepCausality enables multi contextual reasoning across arbitrarily complex hypergraphs. However, because the model abstraction that wraps a causal model and its context defines context as a Arc<RwLock>>, it’s also possible to share a global context across different models because Arc is clone and the RwLock (fine grained Mutex) ensures read / write protection. Thus one can experiment with various causal models reasoning over the same shared global network context graph.
In practice, it is advisable to build and update the graph in tandem with the persistence e.g. the database upset operation to ensure data synchronization.
Why?
A handful of use cases become trivial to solve with the context graph:
Network diagnostic and reliability detection
A) Detecting mission critical choke points.
Problem:
In large networks, it’s rarely fully known where all the oblivious bottlenecks are buried. However, if just one of those highly centralized routers or gateways were to fail, the bulk of the network would instantly be disconnected.
Solution: This is actually trivial because one only need to create a deep copy (clone) of the current network context graph, freeze it, and run the betweenness_centrality() algo, sort the results by concentration score, and highlight the top N nodes. The betweenness_centrality measures the relative path through connectivity meaning a high score implies the highest number of network paths go through this node and therefore its implicitly mission critical. In the UI it’s advisable to set N to a sensible default e.g. N=5 to identify the top 5 mission critical network nodes. However, the user should also be able to set N to a custom value.
Value:
If just one single unmitigated choke point was upgraded to HA failover, a complete network takedown could have been prevented.
The most important incident is always the one that never occurred because of effective mitigation.
B) Identifying over-centralized nodes
Problem:
As networks grow large, its possible that certain services become over-centralized and because of it a structural risk:
Solution:
Trivial, just clone the graph, freeze and run
strongly_connected_components()
Which returns all nodes where each set represents the strongly connected components. This would be central routers, DNS servers and core network services. Once these are identified, a network security audit can being with mitigation.
Value:
If just one of those core services is made redundant through proper HA failover, another potential network takedown as been mitigated before it could happened.
C) Testing network pathways
Problem:
As network grow large debugging connectivity issues becomes increasingly complex. Also, for security reasons some network nodes should not be reachable from some network segments.
Solution:
Trivial, just clone the graph, freeze and run
is_reachable(start_index, stop_index)
This shows instantly if a node is reachable from the target node. This verifies that security policies have been enforced correctly or, equally valid, answers the question why a certain service is not reachable.
Value:
Network security and intrusion detection
Problem:
Advanced Persistent Threats (APT) pose a significant challenge because adversaries spread out network infiltration over time and camouflage their activities as regular traffic that would normally remain undetected.
Here, larger network size becomes a major complication because it’s impractical to deploy an in-depth IDS on each single device mainly because heterogeneous platforms and systems of all connected devices.
Note: The WiFi scan and monitor capabilities would massively help here to capture the network 360 degrees by keeping an eye on all wired and wireless devices. That way, one can block wireless devices the moment they try to do anything stupid long before anything else breaks deep down in the network.
Approach:
Because the network graph represents all discovered network devices and captures traffic between devices, one can deploy multiple causal models for watching out for multiple anomalies in the network hyper graph.
For once, one can deploy certain rules e.g. workstations of network segment X are only allowed to connect to printers and SMB shares of the same segment, but not to a certain number of other core services. If that rule were to be violated, an alert and/or silent mitigation could be triggered.
Then one can instantly detect and capture the blast radius of a compromised machine granted one has detected an anomaly by simply querying for all edges of that node. How “dangerous” a compromised node is can be determined by testing if there is any pathway from the compromised node to a number of mission critical nodes. The more pathways exists, the higher the danger and the swifter the counter measure.
A central challenge is anomaly detection itself because, as stated before, APT tends to camouflage as regular network traffic. Meaning, a compromised SMB server would try to send out some kind of SMB traffic to adjacent SMB servers to obtain access to other file servers. One key distinction between “norma” and “anomaly” network is in the details of the handshake, or network header. For example, there were historic CVE’s where SMB was compromised by a buffer overflow caused by a an oversized network header. Likewise, a classic Denial of Service usually aborts TCP handshakes in an attempt to exhaust the hosts open connection limit.
Therefore, the causal rules can only be effective when combined with deep packet inspection that scans each protocol for standard conformance. This one is easier to implement because it’s relatively hard to trigger a buffer overflow on the receiving host when network packages with non-standard headers simply never arrive.
Disallowing certain host to connect to certain network nodes should, in theory, be handled on an internal router, but in practice its useful to have the functionality in place for those networks that don’t secure internal routes.
Preliminary solution:
Priority 1:
Implement wifi network and wifi device discovery to enable 360 visibly across all network types.
Implement real-time network graph for both, wired and wireless networking
Priority 2:
Priority 3:
Once 360 wired and wireless visibly is in place the and the network hypergraph runs stable in the background, its time to design an end-to-end advanced APT detection system.