Initiale Infrastruktur-Dokumentation pve1 und pve2.

Enthält Host-Doku, MQTT/HA, Git-Setup, Power-Monitoring und GPU-Idle (pve2). Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-27 19:53:55 +02:00
commit 6f52d46267
24 changed files with 1549 additions and 0 deletions
@@ -0,0 +1,44 @@
+# pve2 — Dokumentation
+
+**Host:** pve2 · **IP:** 192.168.10.4 · **Rolle:** Produktions-Proxmox, Router (OPNsense), GPU, Docker/Frigate
+
+## Inhaltsverzeichnis
+
+| Nr. | Datei | Thema |
+|-----|-------|-------|
+| — | [00_README.md](00_README.md) | Diese Übersicht |
+| — | [infrastructure-host.md](infrastructure-host.md) | Hardware, CT/VM, GPU, Storage |
+| — | [power-mqtt-agent.md](power-mqtt-agent.md) | CPU+GPU Power → MQTT/HA |
+| 01 | [01_System-und-Speicher-Uebersicht.md](01_System-und-Speicher-Uebersicht.md) | Disks, Pools |
+| 02 | [02_Ansible-Playbooks.md](02_Ansible-Playbooks.md) | Ansible |
+| 03 | [03_VM-Analyse-und-Bereinigung.md](03_VM-Analyse-und-Bereinigung.md) | VM-Bereinigung |
+| 04 | [04_Backup-Strategie.md](04_Backup-Strategie.md) | Backups |
+| 05 | [05_Wartung-und-Monitoring.md](05_Wartung-und-Monitoring.md) | Wartung |
+| 06 | [06_Quick-Reference.md](06_Quick-Reference.md) | Kurzbefehle |
+| 07 | [07_Storage-Migration-docker.md](07_Storage-Migration-docker.md) | Docker-Storage |
+| 08 | [08_GPU-Idle-und-Power-Monitoring.md](08_GPU-Idle-und-Power-Monitoring.md) | GPU Idle (Kurz) |
+| 09 | [09_GPU-Idle-vollstaendig.md](09_GPU-Idle-vollstaendig.md) | GPU Idle (vollständig) |
+
+## Shared
+
+- [MQTT & HA](../shared/mqtt-homeassistant.md)
+- [Git & Repos](../shared/git-und-repos.md)
+- [Netzwerk](../shared/infrastruktur-netzwerk.md)
+
+## Besonderheiten pve2
+
+- **2× GTX 1080** — Headless, Persistence Mode, ~8 W/GPU idle (P8)
+- **Power-Agent:** CPU + GPU0 + GPU1 + estimated_total
+- **CT 101** Docker/Frigate — **Intel VAAPI**, keine NVIDIA-Mounts
+- **CT 110** AIDEV — NVIDIA für ML/Jupyter
+
+## Schnellbefehle
+
+```bash
+nvidia-smi
+systemctl status nvidia-persistenced pve-power-mqtt
+grep nvidia /etc/pve/lxc/*.conf
+cd /root/docu-repo && git pull
+```
+
+Stand: Juni 2026
@@ -0,0 +1,43 @@
+# 01 — System- und Speicher-Übersicht
+
+## Host
+
+- **System:** Proxmox VE auf **pve2**
+- **Node-Storage:** Konfiguration in `/etc/pve/storage.cfg`
+
+## Physische Laufwerke
+
+| Device | Größe | Verwendung |
+|--------|-------|------------|
+| **nvme0n1** | ~477 GB | System (`/`), Proxmox-Host, Thin-Pool `local-lvm` |
+| **nvme1n1** | ~477 GB | Thin-Pool `nvme_second` (z. B. CT AIDEV) |
+| **sda** | 1,8 TB | HDD — Storage `records` (Aufnahmen, Backups, Docker-Daten) |
+
+## Proxmox Storage-Pools
+
+| Storage | Typ | Mount / Pfad | Inhalt |
+|---------|-----|--------------|--------|
+| **local** | dir | `/var/lib/vz` | ISO, Templates (kein Backup mehr) |
+| **local-lvm** | lvmthin | nvme0n1 | VM/CT-Images, Rootfs |
+| **nvme_second** | lvmthin | nvme1n1 | VM/CT-Images, Rootfs |
+| **records** | dir | `/mnt/pve/records` | Backups, Images, rootdir, ISO |
+
+## Ausgangslage (vor Bereinigung)
+
+Beim ersten Check war **`local-lvm` ~93 % voll** — kritisch, nur ~26 GB frei im Thin-Pool.
+
+Ursache war **nicht** allein „volle VMs“, sondern eine Kombination aus:
+
+1. **Thin-Provisioning ohne `fstrim`** — gelöschte Daten im Guest wurden am Pool nicht freigegeben
+2. **Große VM/CT-Disks** auf `local-lvm` (docker, media, …)
+3. **Ungenutzte VMs** (kali, dev2) mit allokiertem Platz
+4. **Backups auf `local`** statt auf der HDD
+
+## Gemountete Dateisysteme (Host)
+
+| Mount | Typ | Größe | Typische Nutzung |
+|-------|-----|-------|------------------|
+| `/` | ext4 (pve-root) | ~94 GB | Proxmox-System |
+| `/boot/efi` | vfat | 1 GB | EFI |
+| `/mnt/pve/records` | xfs | 1,8 TB | HDD-Storage |
+| `/etc/pve` | fuse | 128 MB | Proxmox-Cluster-Config |
@@ -0,0 +1,60 @@
+# 02 — VM- und Container-Analyse
+
+## Laufende / vorhandene Gäste (nach Bereinigung)
+
+### QEMU-VMs
+
+| VMID | Name | Storage | Boot-Disk | Status |
+|------|------|---------|-----------|--------|
+| 100 | windows | nvme_second | 100 GB | gestoppt |
+| 102 | klipper | local-lvm | 18 GB | läuft |
+| 104 | opnsense | local-lvm | 32 GB | läuft |
+| 106 | homeassistant | local-lvm | 32 GB | läuft |
+
+### LXC-Container
+
+| VMID | Name | Rootfs-Storage | Größe | Status |
+|------|------|----------------|-------|--------|
+| 101 | docker | local-lvm | 104 GB | läuft |
+| 103 | pve-scripts-local | nvme_second | 4 GB | läuft |
+| 109 | media | local-lvm | 80 GB | läuft |
+| 110 | AIDEV | nvme_second | 230 GB | läuft |
+
+## Top-Verbraucher (Analyse)
+
+| Rang | ID | Name | Storage | Allokiert | ~Belegt | Anmerkung |
+|------|-----|------|---------|-----------|---------|-----------|
+| 1 | 110 | AIDEV | nvme_second | 230 GB | Dev-Code, Docker | CT |
+| 2 | 101 | docker (Daten) | records | 200 GB | Frigate-Aufnahmen | mp0 auf HDD |
+| 3 | 101 | docker (Root) | local-lvm | 104 GB | Docker-Stack | CT rootfs |
+| 4 | 109 | media | local-lvm | 80 GB | Jellyfin-Metadaten | CT rootfs |
+
+### CT 101 docker — Details
+
+- **Rootfs:** `local-lvm:vm-101-disk-0` (104 GB)
+- **Daten-Volume:** `records:101/vm-101-disk-0.raw` → `/mnt/records` (200 GB, Frigate)
+- **Dienste:** Frigate, Overleaf/ShareLaTeX, Mongo, Affine, NPMplus, Portainer, Dockge, …
+- **Frigate-Retention:** 30 Tage in `config.yaml` — HDD-Speicher reicht, Anpassung nicht nötig
+
+### CT 109 media — Details
+
+- **Jellyfin-Config:** ~18 GB unter `/opt/stacks/jellyfin/config`
+  - ~13 GB Metadaten (Poster, Artwork) — **beabsichtigt**, nicht löschen
+  - NFS-Mounts für Filme/Serien (kein lokaler Speicherverbrauch auf dem CT-Disk)
+
+### CT 110 AIDEV — Details
+
+- **Code:** `/root/code/` (~69 GB), u. a. `agentic_quant`, `trading_tools`
+- **Docker:** Dev-Images, Build-Cache (regelmäßig aufräumen sinnvoll)
+- **Caches:** `.npm`, `.cache`, IDE-Server (VS Code, Cursor, …)
+
+## Empfehlungen (teilweise umgesetzt)
+
+| Maßnahme | Status |
+|----------|--------|
+| Ungenutzte VMs löschen (kali, dev2) | ✅ erledigt |
+| Backups auf `records` | ✅ erledigt |
+| `fstrim` + Docker-Cleanup | ✅ erledigt |
+| Ansible-Wartung wöchentlich | ✅ eingerichtet |
+| media → nvme_second migrieren | ⏳ optional |
+| windows (100) entfernen falls ungenutzt | ⏳ optional |
@@ -0,0 +1,49 @@
+# 03 — Gelöschte VMs
+
+## VM 107 — kali
+
+- **Status:** Gelöscht (`qm destroy 107 --purge`)
+- **Grund:** Nicht mehr in Nutzung, vom Benutzer bestätigt
+- **Disks entfernt:** `vm-107-disk-0` (EFI), `vm-107-disk-1` (32 GB)
+- **Storage:** local-lvm
+- **Freigegeben:** ~19 GB im Thin-Pool (tatsächlich + fstrim später mehr)
+
+## VM 105 — dev2
+
+### Prüfung vor Löschung
+
+Disk per `qemu-nbd` read-only gemountet. Inhalt:
+
+| Bereich | Inhalt |
+|---------|--------|
+| `/home/dev/projects/` | Azure-DevOps-Repos: tornau-mono, buko-mono, traunstein-mono, munich-mono, ng-next |
+| `/home/dev/projects/sveltewelt` | Kleines lokales Svelte-Projekt (~88 KB), kein Git |
+| `/opt/kimai`, `/opt/install` | Docker-Compose für Kimai & Cattr (Zeiterfassung) |
+| Docker | Kimai + Cattr-Stack mit Volumes (~4 GB) |
+
+**Git-Remotes:** Alle Mono-Repos auf `ssh.dev.azure.com` (arva-digital). Code ist remote gesichert.
+
+**Lokale Besonderheiten:**
+- `munich-mono`: uncommittete Änderung in `package-lock.json`
+- `traunstein-mono`: detached HEAD
+- `sveltewelt`: nur lokal
+
+### Löschung
+
+- **Befehl:** `qm destroy 105 --purge`
+- **Disk:** `vm-105-disk-0` (50 GB auf local-lvm)
+- **Entscheidung:** Benutzer — „weg damit“ nach Prüfung
+
+## Befehle zur Referenz
+
+```bash
+# VM stoppen (falls nötig)
+qm stop <vmid>
+
+# VM inkl. Disks endgültig löschen
+qm destroy <vmid> --purge
+```
+
+## Hinweis
+
+Snapshots waren auf keiner der gelöschten VMs vorhanden.
@@ -0,0 +1,62 @@
+# 04 — Backup-Konfiguration
+
+## Ziel
+
+Backups sollen **nicht mehr auf `local`** (System-NVMe) landen, sondern auf der **HDD `records`** (~1,6 TB frei).
+
+## Durchgeführte Änderungen
+
+### 1. Standard-Storage für vzdump
+
+Datei: `/etc/vzdump.conf`
+
+```ini
+storage: records
+```
+
+Neue manuelle und geplante Backups nutzen damit standardmäßig `records`.
+
+### 2. Backup von `local` entfernt
+
+Datei: `/etc/pve/storage.cfg`
+
+```ini
+dir: local
+    path /var/lib/vz
+    content iso,vztmpl
+```
+
+`backup` wurde aus `content` entfernt — ein versehentliches Backup auf die System-Partition wird erschwert.
+
+### 3. Storage `records` (unverändert, bereits korrekt)
+
+```ini
+dir: records
+    path /mnt/pve/records
+    content iso,vztmpl,backup,images,rootdir
+```
+
+## Vorhandene Backups auf local
+
+- Es lag u. a. ein **OPNsense-Backup (~15 GB)** unter `/var/lib/vz/dump/`
+- Wurde vom Benutzer **manuell gelöscht** (wird noch benötigt — ggf. neu erstellen)
+
+### OPNsense neu sichern
+
+```bash
+vzdump 104 --storage records --mode snapshot --compress zstd
+```
+
+## Geplante Backups (Cron)
+
+Die Datei `/etc/pve/vzdump.cron` ist derzeit **leer** (kein clusterweiter Zeitplan). Backups laufen aktuell manuell oder müssen separat im Proxmox-UI / per Cron eingerichtet werden.
+
+## Prüfen
+
+```bash
+pvesm status
+grep storage /etc/vzdump.conf
+cat /etc/pve/storage.cfg
+ls -lh /mnt/pve/records/dump/
+ls -lh /var/lib/vz/dump/
+```
@@ -0,0 +1,87 @@
+# 05 — LXC-Speicher aufräumen
+
+## Wichtigste Erkenntnis: Thin-Pool vs. Guest-Belegung
+
+Proxmox **Thin-LVM** (`local-lvm`, `nvme_second`) kann **viel voller** anzeigen als das Dateisystem **inside** des Containers.
+
+| CT | Pool-Anzeige (vorher) | `df` im Guest (vorher) |
+|----|------------------------|-------------------------|
+| 109 media | ~99 % | ~43 % |
+| 101 docker | ~97 % | ~45 % |
+| 110 AIDEV | ~99 % | ~74 % |
+
+**Ursache:** Gelöschte Dateien/Blöcke im Guest werden am Thin-Pool erst nach **`fstrim`** freigegeben.
+
+### fstrim — größter Hebel
+
+```bash
+pct exec 101 -- fstrim -v /
+pct exec 109 -- fstrim -v /
+pct exec 110 -- fstrim -v /
+```
+
+Ergebnis der Session:
+
+| Pool | Vorher | Nachher |
+|------|--------|---------|
+| local-lvm | ~93 % | ~43 % |
+| nvme_second | ~59 % | ~40 % |
+
+---
+
+## Docker-Cleanup (manuell durchgeführt)
+
+Auf CT **101**, **109**, **110**:
+
+- Gestoppte Container entfernen (>7 Tage)
+- Dangling Images löschen
+- Build-Cache leeren (v. a. AIDEV: ~14 GB)
+- Ungenutzte Images (>14 Tage) auf AIDEV
+- Container-Logs >50 MB auf 10 MB kürzen (AIDEV: ein Log hatte **2,5 GB**)
+- Journal auf ~200 MB begrenzen (`journalctl --vacuum-size=200M`)
+- `apt-get clean`
+
+### CT-spezifische Befunde
+
+**101 docker**
+- ~38 GB Frigate-Aufnahmen auf `/mnt/records` — **normal**, 30-Tage-Retention
+- ~48 GB Mongo (Overleaf) — **Produktivdaten**
+- ~3 GB alte Docker-Images entfernt
+
+**109 media**
+- ~4 GB alte Jellyfin/tvheadend-Images entfernt
+- 18 GB Jellyfin-Config — überwiegend **Metadaten**, nicht anfassen
+
+**110 AIDEV**
+- ~69 GB `/root/code` — aktive Projekte
+- ~14 GB Build-Cache + alte Images entfernt
+- 17 ungenutzte Docker-Volumes (~1,2 GB) — optional manuell prüfen
+
+---
+
+## Was bewusst nicht gelöscht wird
+
+| Pfad / Bereich | Grund |
+|----------------|--------|
+| `/mnt/records/recordings` | Frigate-Aufnahmen, HDD reicht |
+| Jellyfin `metadata/` | Bibliotheks-Artwork |
+| Mongo / Overleaf-Daten | Produktiv |
+| `/root/code` auf AIDEV | Entwicklungsprojekte |
+
+---
+
+## Nützliche Befehle
+
+```bash
+# Speicher im Container
+pct exec <vmid> -- df -hT /
+
+# Docker-Übersicht
+pct exec <vmid> -- docker system df -v
+
+# Größte Verzeichnisse
+pct exec <vmid> -- du -xh / --max-depth=2 2>/dev/null | sort -hr | head -20
+
+# Große Docker-Logs finden
+pct exec <vmid> -- find /var/lib/docker/containers -name '*-json.log' -size +50M -exec ls -lh {} \;
+```
@@ -0,0 +1,129 @@
+# 06 — Ansible-Automatisierung
+
+## Konzept
+
+**Ansible legt keine Crons in den Containern an.**
+
+Stattdessen:
+
+1. Auf dem **Proxmox-Host** existiert ein **Cron-Job** (wöchentlich)
+2. Der Cron startet ein **Shell-Script**
+3. Das Script führt **`ansible-playbook`** aus
+4. Ansible verbindet sich per **SSH** zu den CTs und führt Wartungs-Tasks aus
+
+```
+/etc/cron.weekly/pve-lxc-disk-maintenance
+        ↓ (Symlink)
+/root/ansible/run-disk-maintenance.sh
+        ↓
+ansible-playbook playbooks/disk-maintenance.yml
+        ↓ SSH
+   docker (101) · media (109) · AIDEV (110)
+```
+
+## Verzeichnisstruktur
+
+```
+/root/ansible/
+├── ansible.cfg
+├── run-disk-maintenance.sh      → von cron.weekly aufgerufen
+├── inventory/
+│   ├── hosts.yml                → Hosts + CT-spezifische Variablen
+│   └── group_vars/all.yml       → globale Schwellwerte
+├── playbooks/
+│   └── disk-maintenance.yml
+└── roles/
+    └── disk_cleanup/
+        ├── defaults/main.yml
+        ├── tasks/main.yml
+        └── handlers/main.yml
+```
+
+## Verwaltete Hosts
+
+| Ansible-Host | VMID | IP | Besonderheiten |
+|--------------|------|-----|----------------|
+| docker | 101 | 192.168.10.101 | Frigate-Pfade auf `/mnt/records` |
+| media | 109 | 192.168.20.6 | Jellyfin-Cache-Pfad |
+| aidev | 110 | 10.100.2.13 | Dev-Tooling optional |
+
+SSH als `root` vom Proxmox-Host — Key-Auth war bereits eingerichtet.
+
+## Was das Playbook macht
+
+| Task | Beschreibung |
+|------|--------------|
+| Journal | `journalctl --vacuum-size=200M` |
+| apt | `autoclean` / `autoremove` / `clean` |
+| Docker-Logs | Dateien >50 MB auf 10 MB kürzen |
+| Docker | Gestoppte Container (>7 T.), dangling Images, Build-Cache (>14 T.) |
+| Docker-Volumes | Nur **dangling** Volumes |
+| daemon.json | Log-Limits `10m` × `3` — nur wenn Datei noch nicht existiert |
+| fstrim | `/` im Container (**wichtig für Thin-Pool**) |
+| Frigate | Aufnahme-Ordner älter als 30 Tage löschen |
+| Jellyfin | Cache-Dateien älter als 30 Tage löschen |
+
+### Tags (optional)
+
+```bash
+# Alles (Standard)
+ansible-playbook playbooks/disk-maintenance.yml
+
+# Nur aggressive Image-Bereinigung zusätzlich
+ansible-playbook playbooks/disk-maintenance.yml --tags aggressive
+
+# Nur Frigate oder Jellyfin
+ansible-playbook playbooks/disk-maintenance.yml --tags frigate
+ansible-playbook playbooks/disk-maintenance.yml --tags jellyfin
+```
+
+## Cron
+
+```bash
+ls -la /etc/cron.weekly/pve-lxc-disk-maintenance
+# → Symlink nach /root/ansible/run-disk-maintenance.sh
+```
+
+- **Intervall:** `cron.weekly` (typisch Sonntag morgens)
+- **Log:** `/var/log/pve-lxc-disk-maintenance.log`
+
+### Cron deaktivieren
+
+```bash
+rm /etc/cron.weekly/pve-lxc-disk-maintenance
+```
+
+### Cron auf täglich umstellen (Beispiel)
+
+```bash
+echo '0 3 * * * root /root/ansible/run-disk-maintenance.sh' > /etc/cron.d/pve-lxc-disk-maintenance
+```
+
+## Konfiguration anpassen
+
+Globale Werte: `/root/ansible/inventory/group_vars/all.yml`
+
+```yaml
+journal_max_size: 200M
+docker_prune_stopped_containers_older_than: 168h   # 7 Tage
+docker_prune_unused_images_older_than: 336h        # 14 Tage (Tag: aggressive)
+frigate_recordings_retain_days: 30
+jellyfin_cache_max_age_days: 30
+fstrim_enabled: true
+```
+
+Host-spezifisch in `inventory/hosts.yml` (z. B. Frigate-Pfade nur auf `docker`).
+
+## Voraussetzungen
+
+- **Ansible** auf dem Proxmox-Host installiert (`apt install ansible`)
+- **SSH** vom Host zu den CTs als root
+- CTs müssen laufen (für SSH)
+
+## Manuell testen
+
+```bash
+/root/ansible/run-disk-maintenance.sh
+# oder
+cd /root/ansible && ansible-playbook playbooks/disk-maintenance.yml
+```
@@ -0,0 +1,66 @@
+# 07 — Aktueller Stand
+
+Stand nach Abschluss der Bereinigung und Einrichtung der Automatisierung.
+
+## Storage-Pools
+
+| Storage | Belegung | Frei (ca.) | Anmerkung |
+|---------|----------|------------|-----------|
+| local | ~32 % | ~60 GB | System-Partition |
+| **local-lvm** | **~43 %** | **~200 GB** | war ~93 % |
+| **nvme_second** | **~40 %** | **~280 GB** | war ~59 % |
+| records | ~12 % | ~1,6 TB | HDD, viel Reserve |
+
+## VMs / Container
+
+### VMs
+100 windows · 102 klipper · 104 opnsense · 106 homeassistant
+
+### Container
+101 docker · 103 pve-scripts-local · 109 media · 110 AIDEV
+
+**Entfernt:** 107 kali · 105 dev2
+
+## Automatisierung aktiv
+
+| Komponente | Pfad |
+|------------|------|
+| Ansible-Projekt | `/root/ansible/` |
+| Wöchentlicher Cron | `/etc/cron.weekly/pve-lxc-disk-maintenance` |
+| Wartungs-Log | `/var/log/pve-lxc-disk-maintenance.log` |
+| Dokumentation | `/root/docu/` |
+
+## Offene / optionale Punkte
+
+1. **OPNsense-Backup neu erstellen** auf `records` (altes Backup versehentlich gelöscht)
+   ```bash
+   vzdump 104 --storage records --mode snapshot --compress zstd
+   ```
+
+2. **VM 100 windows** — gestoppt, ~100 GB auf nvme_second; löschen falls ungenutzt
+
+3. **CT 109 media** nach `nvme_second` migrieren — entlastet local-lvm langfristig
+
+4. **AIDEV Docker-Volumes** — ~17 ungenutzte Volumes (~1,2 GB), manuell prüfen:
+   ```bash
+   pct exec 110 -- docker volume ls -f dangling=false
+   ```
+
+5. **Frigate-Retention** — bei reichlich HDD-Speicher unverändert lassen (30 Tage)
+
+## Wichtigste Lessons Learned
+
+1. **`fstrim` regelmäßig** auf LXC-Gästen — sonst explodiert der Thin-Pool scheinbar
+2. **Docker-Logs begrenzen** — fehlende `max-size`/`max-file` führte zu Multi-GB-Logs
+3. **Backups auf HDD** — System-NVMe nicht für vzdump nutzen
+4. **Thin-Pool ≠ Guest-`df`** — immer beides prüfen
+
+## Nächste Wartung prüfen
+
+```bash
+# Thin-Pools
+lvs pve/data nvme_second/nvme_second -o data_percent
+
+# Letzter Ansible-Lauf
+tail -50 /var/log/pve-lxc-disk-maintenance.log
+```
@@ -0,0 +1,32 @@
+# GPU Idle & Power-Monitoring — pve2
+
+Ausführliche Doku im Repo: [`/root/code/pve-power-mqtt/docs/gpu-idle-pve2.md`](/root/code/pve-power-mqtt/docs/gpu-idle-pve2.md)
+
+## Kurzfassung
+
+- **2× GTX 1080** am Host-Treiber (kein VM-Passthrough)
+- Idle-Ziel: **~8 W/GPU, P8** — erfordert **Persistence Mode**
+- **Frigate (CT 101):** nur Intel iGPU/VAAPI → keine NVIDIA-Mounts in LXC
+- **AIDEV (CT 110):** NVIDIA-Mounts für ML bei Bedarf
+- **Power-Agent:** `pve-power-mqtt` → MQTT/Home Assistant
+
+## Schnellcheck
+
+```bash
+nvidia-smi --query-gpu=index,power.draw,pstate,persistence_mode --format=csv
+systemctl status nvidia-persistenced pve-power-mqtt
+grep nvidia /etc/pve/lxc/*.conf
+```
+
+## Persistenced installieren (aus Repo)
+
+```bash
+cd /root/code/pve-power-mqtt
+./deploy/nvidia-persistenced/install.sh
+```
+
+## Änderungen Juni 2026
+
+- CT 101: NVIDIA device mounts auskommentiert
+- `nvidia-persistenced`: kaputter `--user-persistence-mode` Flag entfernt, Service läuft
+- Doku in `server-power` Repo unter `docs/`
@@ -0,0 +1,168 @@
+# GPU Idle & Power Monitoring — pve2
+
+Stand: Juni 2026 · Host: **pve2** · 2× NVIDIA GeForce GTX 1080
+
+## Ziel
+
+GTX 1080 im Headless-Betrieb sollen im Leerlauf **~6–8 W pro GPU (P8)** verbrauchen und bei Bedarf für LXC/Compute verfügbar bleiben — **ohne** GPU-Passthrough an VMs.
+
+Typische Idle-Messung bei korrekter Konfiguration:
+
+```
+GPU0: ~8–9 W, P8, Core 139 MHz
+GPU1: ~8–9 W, P8, Core 139 MHz
+```
+
+---
+
+## Ursachen für hohen Idle-Verbrauch (P0/P5, 40–70 W gesamt)
+
+| Ursache | Symptom | Fix |
+|---------|---------|-----|
+| **Persistence Mode aus / Daemon tot** | P0 oder P5, `Idle: Not Active` | `nvidia-persistenced` + `-pm 1` (siehe unten) |
+| **NVIDIA-Devices in LXC ohne Nutzung** | Treiber hält GPUs wach, Wechsel 139↔1657 MHz | Mounts entfernen (CT 101) |
+| **Echter GPU-Load** | Prozesse in `nvidia-smi`, Encoder/Decoder > 0 | Prozess finden und beenden |
+| **Häufiges `nvidia-smi`-Polling** | Kurzzeitiges Aufwachen | GPU-Messung seltener (z. B. 60 s) |
+| **Neuer Treiber (580.x) ohne PM** | Pascal bleibt in P5 | Persistence Mode ist Pflicht |
+
+**Nicht zutreffend auf pve2:** VFIO-GPU-Passthrough (GPUs hängen am Host-Treiber, nicht an VMs).
+
+---
+
+## 1. NVIDIA Persistence Mode (Pflicht für Headless P8)
+
+Auf Headless-Linux-Systemen ohne Display schaltet der Treiber GTX-1080-Karten nach Last oft **nicht** zuverlässig in **P8** zurück, wenn Persistence Mode aus ist.
+
+### Prüfen
+
+```bash
+nvidia-smi --query-gpu=index,power.draw,pstate,persistence_mode,clocks.gr --format=csv
+```
+
+Erwartung im Idle: `P8`, `Enabled`, ~139 MHz Core.
+
+### Dauerhaft einrichten (pve2)
+
+Service-Dateien liegen im Repo unter `deploy/nvidia-persistenced/`:
+
+```bash
+cp deploy/nvidia-persistenced/nvidia-persistenced.service /etc/systemd/system/
+mkdir -p /etc/systemd/system/nvidia-persistenced.service.d
+cp deploy/nvidia-persistenced/override.conf /etc/systemd/system/nvidia-persistenced.service.d/
+systemctl daemon-reload
+systemctl enable --now nvidia-persistenced
+```
+
+**Wichtig:** Die Option `--user-persistence-mode` ist ungültig und ließ den Dienst sofort wieder beenden — daher der Fix im Repo.
+
+### Manuell (bis Reboot)
+
+```bash
+nvidia-smi -pm 1
+```
+
+### Status prüfen
+
+```bash
+systemctl status nvidia-persistenced
+nvidia-smi -q | grep -E 'Performance State|Idle|Persistence'
+fuser -v /dev/nvidia*    # sollte leer sein im Idle
+```
+
+---
+
+## 2. LXC: Wer braucht `/dev/nvidia*`?
+
+| CT | Name | GPU-Mounts | Grund |
+|----|------|------------|-------|
+| **101** | docker | **Nein** (entfernt) | Frigate nutzt **Intel iGPU (VAAPI)**, NVIDIA in `compose.yml` auskommentiert |
+| **110** | AIDEV | **Ja** | Jupyter/ML bei Bedarf |
+| **109** | media | Nur wenn aktiv genutzt | Gestoppt → kein Mount nötig |
+
+### Frigate (CT 101)
+
+- Detector: **OpenVINO** (CPU/iGPU)
+- Record-Streams: `hwaccel_args: preset-vaapi`
+- Docker-Devices: `/dev/dri/renderD128`, `/dev/dri/card0` (Intel)
+- **Kein** `deploy.resources.devices` NVIDIA in `compose.yml`
+
+NVIDIA-Bind-Mounts in `/etc/pve/lxc/101.conf` sind auskommentiert (Juni 2026). Nach CT-Neustart:
+
+```bash
+pct reboot 101
+pct exec 101 -- ls /dev/nvidia*    # sollte fehlen
+pct exec 101 -- docker ps          # frigate healthy
+```
+
+---
+
+## 3. Power-Monitoring (`pve-power-mqtt`)
+
+Der Agent liest GPU-Leistung via `nvidia-smi`. Das ist **kein** Dauerlast-Compute, kann aber GPUs kurz aus P8 wecken.
+
+Empfehlung:
+
+- CPU/RAPL: alle **5 s** (MQTT → Home Assistant)
+- GPU: seltener messen oder nur wenn `estimated_total` GPU-Anteil braucht
+
+Verifikation ohne Agent-Einfluss:
+
+```bash
+systemctl stop pve-power-mqtt
+sleep 60
+nvidia-smi --query-gpu=power.draw,pstate --format=csv
+systemctl start pve-power-mqtt
+```
+
+---
+
+## 4. Troubleshooting
+
+### GPUs hängen in P0/P5 (~15–45 W)
+
+1. `systemctl status nvidia-persistenced` — läuft er?
+2. `nvidia-smi -pm 1`
+3. LXC mit GPU-Mounts identifizieren: `grep nvidia /etc/pve/lxc/*.conf`
+4. Prozesse: `nvidia-smi`, `fuser -v /dev/nvidia*`
+5. Host neu starten (letzter Ausweg)
+
+### Wechsel zwischen 139 MHz und 1657 MHz (eine GPU ~44 W)
+
+Typisch wenn **mehrere Consumer** den Treiber ansprechen (LXC-Mounts + Monitoring). CT 101 ohne NVIDIA-Mounts behebt einen großen Teil.
+
+### Dummy-HDMI-Stecker
+
+Nur erwägen, wenn **mit** Persistence Mode weiterhin kein P8 erreichbar ist. Auf pve2 aktuell **nicht nötig** (P8 stabil mit persistenced).
+
+### Treiber
+
+Installiert: **580.95.05** (Pascal). Für Idle ist Persistence Mode wichtiger als Downgrade — bei Bedarf 550.x LTS testen.
+
+---
+
+## 5. Referenz-Befehle
+
+```bash
+# Schnellcheck
+nvidia-smi
+
+# Power über Zeit (15 s Abstand)
+watch -n 15 'nvidia-smi --query-gpu=index,power.draw,pstate,clocks.gr --format=csv'
+
+# LXC GPU-Konfiguration
+grep -H nvidia /etc/pve/lxc/*.conf
+
+# MQTT Power Agent
+systemctl status pve-power-mqtt
+journalctl -u pve-power-mqtt -f
+```
+
+---
+
+## Änderungshistorie
+
+| Datum | Änderung |
+|-------|----------|
+| 2026-06-27 | CT 101: NVIDIA-Mounts entfernt (Frigate = VAAPI) |
+| 2026-06-27 | `nvidia-persistenced` repariert (ungültiger CLI-Flag entfernt) |
+| 2026-06-27 | Doku angelegt, P8 Idle ~8 W pro GPU verifiziert |
@@ -0,0 +1,91 @@
+# pve2 — Host-Infrastruktur
+
+**IP:** 192.168.10.4 · **GPU:** 2× NVIDIA GeForce GTX 1080 · **Treiber:** 580.95.05
+
+## Physische Disks
+
+| Device | Größe | Nutzung |
+|--------|-------|---------|
+| nvme0n1 | ~477 GB | System, `local-lvm` |
+| nvme1n1 | ~477 GB | `nvme_second` (z. B. AIDEV) |
+| sda | 1,8 TB | `records` — Aufnahmen, Backups, Docker-Daten |
+
+Details: [01_System-und-Speicher-Uebersicht.md](01_System-und-Speicher-Uebersicht.md)
+
+## Storage-Pools
+
+| Pool | Typ | Inhalt |
+|------|-----|--------|
+| local | dir | ISO, Templates |
+| local-lvm | lvmthin | VM/CT auf nvme0 |
+| nvme_second | lvmthin | VM/CT auf nvme1 |
+| records | dir | HDD — Backups, Frigate, große Daten |
+
+## VMs (Auswahl)
+
+| VMID | Name | Rolle |
+|------|------|-------|
+| 104 | opnsense | **Router/Firewall** — produktiv |
+| 106 | homeassistant | Home Assistant + Mosquitto |
+
+## Container (Auswahl)
+
+| CTID | Name | GPU | Rolle |
+|------|------|-----|-------|
+| 101 | docker | **Nein** (NVIDIA entfernt) | Frigate, Compose-Stack |
+| 109 | media | optional | Medien (oft gestoppt) |
+| 110 | AIDEV | **Ja** | Jupyter/ML |
+
+### GPU-Mount-Policy LXC
+
+| CT | `/dev/nvidia*` | Grund |
+|----|----------------|-------|
+| 101 | **Nein** | Frigate = OpenVINO + Intel VAAPI |
+| 110 | **Ja** | ML bei Bedarf |
+| 109 | nur wenn aktiv | Gestoppt → kein Mount |
+
+Konfiguration: `/etc/pve/lxc/101.conf` — NVIDIA-Zeilen auskommentiert (`#lxc.mount.entry%3A ...`).
+
+Frigate in CT 101:
+- Detector: OpenVINO (CPU/iGPU)
+- `hwaccel_args: preset-vaapi`
+- Devices: `/dev/dri/renderD128`, `/dev/dri/card0`
+- NVIDIA in `compose.yml` auskommentiert
+
+## NVIDIA auf dem Host
+
+```bash
+nvidia-smi
+systemctl status nvidia-persistenced
+```
+
+Persistence Mode **Pflicht** für P8 Idle (~8 W/GPU). Service-Dateien auch im Repo server-power: `deploy/nvidia-persistenced/`.
+
+Vollständige GPU-Doku: [09_GPU-Idle-vollstaendig.md](09_GPU-Idle-vollstaendig.md)
+
+## Host-Dienste
+
+| Dienst | Zweck |
+|--------|-------|
+| `nvidia-persistenced` | GPU Persistence Mode |
+| `pve-power-mqtt` | RAPL + nvidia-smi → MQTT |
+| Proxmox | Web :8006 |
+
+## Git / Doku auf diesem Host
+
+| Pfad | Inhalt |
+|------|--------|
+| `/root/docu-repo` | docu-Repo |
+| `/root/code/pve-power-mqtt` | Go-Agent + GPU-Doku |
+| `/root/docu/` | Legacy lokale Kopie (optional durch docu-repo ersetzen) |
+| `/root/.git-credentials-jeanavril` | Gitea Token |
+
+## Ansible
+
+Playbooks: siehe [02_Ansible-Playbooks.md](02_Ansible-Playbooks.md)
+
+## Wartung
+
+- `fstrim` in VMs/CTs für Thin-Pools
+- Backup auf `records`, nicht `local-lvm`
+- Speicher-Monitoring: [05_Wartung-und-Monitoring.md](05_Wartung-und-Monitoring.md)
@@ -0,0 +1,95 @@
+# pve2 — Power-MQTT-Agent
+
+CPU (Intel RAPL) + GPU (`nvidia-smi`) → MQTT → Home Assistant.
+
+## Installation
+
+| Komponente | Pfad |
+|------------|------|
+| Binary | `/usr/local/bin/pve-power-mqtt` |
+| systemd | `/etc/systemd/system/pve-power-mqtt.service` |
+| Env | `/etc/pve-power-mqtt.env` |
+| Quellcode | `/root/code/pve-power-mqtt` |
+| Repo | https://git.jeanavril.com/jean/server-power.git |
+
+## Konfiguration `/etc/pve-power-mqtt.env`
+
+```ini
+POWER_MQTT_BROKER=tcp://homeassistant.iot:1883
+POWER_MQTT_USER=server
+POWER_MQTT_PASSWORD="F0x84rAOW#q@LX"
+POWER_MQTT_HOSTNAME=pve2
+POWER_MQTT_DISCOVERY=true
+```
+
+Client-ID: **`pve-power-mqtt-pve2`**
+
+## HA-Sensoren
+
+| Entity | Quelle |
+|--------|--------|
+| sensor.pve2_cpu_power | RAPL |
+| sensor.pve2_gpu0_power | GPU 0 |
+| sensor.pve2_gpu1_power | GPU 1 |
+| sensor.pve2_estimated_total | CPU + GPU0 + GPU1 |
+
+`estimated_total` ist **kein** Netzteil-/PDU-Wert.
+
+Broker-Details: [../shared/mqtt-homeassistant.md](../shared/mqtt-homeassistant.md)
+
+## Build & Deploy
+
+```bash
+cd /root/code/pve-power-mqtt
+git pull
+export PATH="/usr/local/go/bin:$PATH"
+go build -o pve-power-mqtt ./cmd/pve-power-mqtt
+install -m 755 pve-power-mqtt /usr/local/bin/pve-power-mqtt
+systemctl restart pve-power-mqtt
+```
+
+## NVIDIA Persistence (Voraussetzung für sinnvolle GPU-Idle-Werte)
+
+```bash
+systemctl status nvidia-persistenced
+nvidia-smi --query-gpu=power.draw,pstate,persistence_mode --format=csv
+```
+
+Erwartung idle: **P8**, ~8–9 W pro GTX 1080.
+
+Siehe [09_GPU-Idle-vollstaendig.md](09_GPU-Idle-vollstaendig.md)
+
+## Agent vs. GPU Idle
+
+`nvidia-smi` alle **5 s** kann GPUs kurz aus P8 wecken — für reine Idle-Messung:
+
+```bash
+systemctl stop pve-power-mqtt
+sleep 60
+nvidia-smi --query-gpu=power.draw,pstate --format=csv
+systemctl start pve-power-mqtt
+```
+
+Optional: GPU-Intervall im Code erhöhen (z. B. 60 s) — siehe server-power Repo.
+
+## Betrieb
+
+```bash
+systemctl status pve-power-mqtt
+journalctl -u pve-power-mqtt -f
+```
+
+## Fixes (Historie)
+
+- `expire_after` / `availability_topic` aus Discovery entfernt (HA „unavailable“)
+- Eindeutige Client-IDs pro Host
+- Keepalive 120 s, Ping-Timeout 30 s
+- MQTT-Reconnect-Logging
+
+## Troubleshooting
+
+| Problem | Lösung |
+|---------|--------|
+| GPU unavailable in HA | Agent läuft? `nvidia-smi` auf Host? |
+| Hohe GPU-Idle-Werte | Persistence + LXC-Mounts prüfen (CT 101 ohne NVIDIA) |
+| MQTT timeout | VLAN 10→40, Broker homeassistant.iot erreichbar? |