Initiale Infrastruktur-Dokumentation pve1 und pve2.
Enthält Host-Doku, MQTT/HA, Git-Setup, Power-Monitoring und GPU-Idle (pve2). Co-authored-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
@@ -0,0 +1,59 @@
|
|||||||
|
# Infrastruktur-Dokumentation (privat)
|
||||||
|
|
||||||
|
Zentrale Dokumentation für die Proxmox-Umgebung **jeanavril**.
|
||||||
|
|
||||||
|
**Git:** https://git.jeanavril.com/jean/docu.git
|
||||||
|
|
||||||
|
## Hosts
|
||||||
|
|
||||||
|
| Host | IP (Management) | Rolle | Doku |
|
||||||
|
|------|-----------------|-------|------|
|
||||||
|
| **pve1** | 192.168.10.3 | Primärer Proxmox, Fallback-OPNsense | [pve1/](pve1/) |
|
||||||
|
| **pve2** | 192.168.10.4 | Produktions-Proxmox, Router, GPU-Compute | [pve2/](pve2/) |
|
||||||
|
|
||||||
|
DNS intern: `*.iot` → VLAN 40 (z. B. `homeassistant.iot` → 192.168.40.254)
|
||||||
|
|
||||||
|
## Verzeichnis
|
||||||
|
|
||||||
|
```
|
||||||
|
docu/
|
||||||
|
├── README.md ← diese Datei
|
||||||
|
├── shared/ ← übergreifend (MQTT, Git, Netzwerk)
|
||||||
|
├── pve1/ ← nur pve1
|
||||||
|
└── pve2/ ← nur pve2
|
||||||
|
```
|
||||||
|
|
||||||
|
## Shared (beide Hosts)
|
||||||
|
|
||||||
|
| Datei | Inhalt |
|
||||||
|
|-------|--------|
|
||||||
|
| [shared/infrastruktur-netzwerk.md](shared/infrastruktur-netzwerk.md) | VLANs, IPs, Bridges |
|
||||||
|
| [shared/mqtt-homeassistant.md](shared/mqtt-homeassistant.md) | MQTT-Broker, HA Discovery, Credentials |
|
||||||
|
| [shared/git-und-repos.md](shared/git-und-repos.md) | Gitea, Tokens, Clone-Pfade |
|
||||||
|
|
||||||
|
## Code-Repos (separat von dieser Doku)
|
||||||
|
|
||||||
|
| Repo | URL | Inhalt |
|
||||||
|
|------|-----|--------|
|
||||||
|
| **server-power** | https://git.jeanavril.com/jean/server-power.git | Go-Agent `pve-power-mqtt` |
|
||||||
|
| **docu** | https://git.jeanavril.com/jean/docu.git | Diese Dokumentation |
|
||||||
|
|
||||||
|
## Auf einem Host bearbeiten & pushen
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd /root/docu-repo
|
||||||
|
git pull
|
||||||
|
# Dateien unter pve1/ oder pve2/ editieren
|
||||||
|
git add -A && git commit -m "Beschreibung" && git push
|
||||||
|
```
|
||||||
|
|
||||||
|
Clone-Pfad auf beiden Nodes: **`/root/docu-repo`**
|
||||||
|
|
||||||
|
## Schnellreferenz Power-Monitoring
|
||||||
|
|
||||||
|
```bash
|
||||||
|
systemctl status pve-power-mqtt nvidia-persistenced # nur pve2 GPU
|
||||||
|
journalctl -u pve-power-mqtt -f
|
||||||
|
```
|
||||||
|
|
||||||
|
Stand: Juni 2026
|
||||||
@@ -0,0 +1,38 @@
|
|||||||
|
# pve1 — Dokumentation
|
||||||
|
|
||||||
|
**Host:** pve1 · **IP:** 192.168.10.3 · **Rolle:** Primärer Proxmox, Fallback-Router (OPNsense)
|
||||||
|
|
||||||
|
## Inhaltsverzeichnis
|
||||||
|
|
||||||
|
| Nr. | Datei | Thema |
|
||||||
|
|-----|-------|-------|
|
||||||
|
| — | [00_README.md](00_README.md) | Diese Übersicht |
|
||||||
|
| — | [infrastructure-host.md](infrastructure-host.md) | Hardware, CT/VM, Storage |
|
||||||
|
| — | [power-mqtt-agent.md](power-mqtt-agent.md) | CPU-Power → MQTT/HA |
|
||||||
|
| 01 | [01_uebersicht.md](01_uebersicht.md) | System-Übersicht |
|
||||||
|
| 02 | [02_netzwerk.md](02_netzwerk.md) | Bridges, VLANs |
|
||||||
|
| 03 | [03_backup_restore.md](03_backup_restore.md) | Backup & Restore |
|
||||||
|
| 04 | [04_fallback_aktivierung.md](04_fallback_aktivierung.md) | OPNsense-Fallback |
|
||||||
|
| 05 | [05_speicher_wartung.md](05_speicher_wartung.md) | Speicher & Wartung |
|
||||||
|
|
||||||
|
## Shared
|
||||||
|
|
||||||
|
- [MQTT & HA](../shared/mqtt-homeassistant.md)
|
||||||
|
- [Git & Repos](../shared/git-und-repos.md)
|
||||||
|
- [Netzwerk](../shared/infrastruktur-netzwerk.md)
|
||||||
|
|
||||||
|
## Besonderheiten pve1
|
||||||
|
|
||||||
|
- **Keine NVIDIA-GPU** → Power-Agent nur CPU (Intel RAPL)
|
||||||
|
- **Fallback-OPNsense** VM 104 (Klon von pve2) — nur bei Ausfall pve2 starten
|
||||||
|
- CT **100** (files) — Datei-Server
|
||||||
|
|
||||||
|
## Schnellbefehle
|
||||||
|
|
||||||
|
```bash
|
||||||
|
systemctl status pve-power-mqtt
|
||||||
|
qm list && pct list
|
||||||
|
cd /root/docu-repo && git pull
|
||||||
|
```
|
||||||
|
|
||||||
|
Stand: Juni 2026
|
||||||
@@ -0,0 +1,20 @@
|
|||||||
|
# Übersicht
|
||||||
|
|
||||||
|
## Infrastruktur
|
||||||
|
|
||||||
|
| Host | IP | Rolle |
|
||||||
|
|------|-----|-------|
|
||||||
|
| pve1 | 192.168.10.3 | Primärer Proxmox-Host, Fallback-Router |
|
||||||
|
| pve2 | 192.168.10.4 | Produktions-Proxmox-Host, aktiver Router |
|
||||||
|
|
||||||
|
## Fallback-Router
|
||||||
|
|
||||||
|
- **VMID:** 104
|
||||||
|
- **Name:** `opnsense-fallback`
|
||||||
|
- **Quelle:** OPNsense (VM 104) von pve2
|
||||||
|
- **Status:** Gestoppt, `onboot: 0`
|
||||||
|
- **Zweck:** Ersatz-Router falls pve2 oder der originale OPNsense ausfällt
|
||||||
|
|
||||||
|
## Wichtig
|
||||||
|
|
||||||
|
Die Fallback-VM darf **nicht parallel** zum Original auf pve2 laufen — gleiche Konfiguration und IPs würden kollidieren.
|
||||||
@@ -0,0 +1,31 @@
|
|||||||
|
# Netzwerk
|
||||||
|
|
||||||
|
## pve1 (192.168.10.3)
|
||||||
|
|
||||||
|
| Bridge | Typ | Verwendung |
|
||||||
|
|--------|-----|------------|
|
||||||
|
| vmbr0 | VLAN-aware, an `nic0` | WAN / Management (192.168.10.3/24) |
|
||||||
|
| vmbr1 | Intern, keine phys. Ports | LAN-Seite des Routers |
|
||||||
|
|
||||||
|
`vmbr1` wurde für die Fallback-VM angelegt (analog zu pve2):
|
||||||
|
|
||||||
|
```
|
||||||
|
auto vmbr1
|
||||||
|
iface vmbr1 inet manual
|
||||||
|
bridge-ports none
|
||||||
|
bridge-stp off
|
||||||
|
bridge-fd 0
|
||||||
|
```
|
||||||
|
|
||||||
|
## OPNsense-Fallback (VM 104)
|
||||||
|
|
||||||
|
| Interface | Bridge | MAC |
|
||||||
|
|-----------|--------|-----|
|
||||||
|
| net0 | vmbr0 | BC:24:11:A1:B2:C3 |
|
||||||
|
| net1 | vmbr1 | BC:24:11:D4:E5:F6 |
|
||||||
|
|
||||||
|
MAC-Adressen wurden absichtlich von pve2 abweichend gesetzt, um Konflikte zu vermeiden.
|
||||||
|
|
||||||
|
## Hinweis
|
||||||
|
|
||||||
|
pve1 hat nur **eine physische NIC**. Beim Failover ggf. Kabel umstecken oder VLAN-Konfiguration prüfen.
|
||||||
@@ -0,0 +1,35 @@
|
|||||||
|
# Backup & Restore
|
||||||
|
|
||||||
|
## Durchgeführter Ablauf (2026-06-27)
|
||||||
|
|
||||||
|
### 1. Backup auf pve2
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh root@192.168.10.4
|
||||||
|
vzdump 104 --mode snapshot --compress zstd --storage records
|
||||||
|
```
|
||||||
|
|
||||||
|
- Snapshot-Modus: OPNsense läuft während des Backups weiter
|
||||||
|
- Storage `records` (da `local` kein `backup`-Content mehr hat)
|
||||||
|
- Ergebnis: ~15 GB `.vma.zst`
|
||||||
|
|
||||||
|
### 2. Transfer nach pve1
|
||||||
|
|
||||||
|
```bash
|
||||||
|
scp root@192.168.10.4:/mnt/pve/records/dump/vzdump-qemu-104-*.vma.zst /var/lib/vz/dump/
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Restore auf pve1
|
||||||
|
|
||||||
|
```bash
|
||||||
|
qmrestore /var/lib/vz/dump/vzdump-qemu-104-*.vma.zst 104 --storage local-lvm
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4. Aufräumen
|
||||||
|
|
||||||
|
Backup-Datei nach erfolgreichem Restore gelöscht (~15 GB frei auf `/`).
|
||||||
|
|
||||||
|
## Bekannte Fallstricke
|
||||||
|
|
||||||
|
- Unterbrochener `qmrestore` hinterlässt verwaiste LVs (`vm-104-disk-*`) → manuell entfernen mit `lvremove -f`
|
||||||
|
- Vor Restore genug Platz prüfen: ~15 GB Backup + ~32 GB VM-Disk
|
||||||
@@ -0,0 +1,47 @@
|
|||||||
|
# Fallback aktivieren
|
||||||
|
|
||||||
|
## Voraussetzungen
|
||||||
|
|
||||||
|
- pve2-OPNsense ist gestoppt oder pve2 ist ausgefallen
|
||||||
|
- Netzwerk-Kabel/VLANs sind für pve1 vorbereitet
|
||||||
|
|
||||||
|
## Schritte
|
||||||
|
|
||||||
|
### 1. Original stoppen (falls noch erreichbar)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh root@192.168.10.4 "qm stop 104"
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Fallback starten
|
||||||
|
|
||||||
|
```bash
|
||||||
|
qm start 104
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Konsole prüfen
|
||||||
|
|
||||||
|
```bash
|
||||||
|
qm terminal 104
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4. Netzwerk testen
|
||||||
|
|
||||||
|
- Web-UI von OPNsense aufrufen (Standard: LAN-Interface)
|
||||||
|
- Gateway/DHCP/DNS prüfen
|
||||||
|
|
||||||
|
## Fallback beenden
|
||||||
|
|
||||||
|
```bash
|
||||||
|
qm stop 104
|
||||||
|
```
|
||||||
|
|
||||||
|
Danach Original auf pve2 wieder starten:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh root@192.168.10.4 "qm start 104"
|
||||||
|
```
|
||||||
|
|
||||||
|
## Aktualisierung der Fallback-Kopie
|
||||||
|
|
||||||
|
Bei größeren OPNsense-Änderungen Backup/Restore erneut durchführen (siehe `03_backup_restore.md`).
|
||||||
@@ -0,0 +1,36 @@
|
|||||||
|
# Speicher & Wartung
|
||||||
|
|
||||||
|
## Speicher pve1 (Stand nach Setup)
|
||||||
|
|
||||||
|
| Storage | Verwendung |
|
||||||
|
|---------|------------|
|
||||||
|
| `/` (local) | ISOs, Templates, Backups |
|
||||||
|
| local-lvm | VM-Disks (opnsense-fallback: 32 GB) |
|
||||||
|
|
||||||
|
## Speicher prüfen
|
||||||
|
|
||||||
|
```bash
|
||||||
|
df -h /
|
||||||
|
pvesm status
|
||||||
|
lvs pve
|
||||||
|
```
|
||||||
|
|
||||||
|
## Verwaiste Disks entfernen
|
||||||
|
|
||||||
|
Falls ein Restore fehlschlägt:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
qm destroy 104 --purge 1
|
||||||
|
lvremove -f pve/vm-104-disk-0 pve/vm-104-disk-1
|
||||||
|
```
|
||||||
|
|
||||||
|
## VM-Konfiguration anzeigen
|
||||||
|
|
||||||
|
```bash
|
||||||
|
qm config 104
|
||||||
|
qm list
|
||||||
|
```
|
||||||
|
|
||||||
|
## Thin-Pool Warnung
|
||||||
|
|
||||||
|
Beim Restore kann LVM warnen, dass die Summe der Thin-Volumes den Pool übersteigt. Bei Bedarf Thin-Pool erweitern oder ungenutzte VMs/Disks entfernen.
|
||||||
@@ -0,0 +1,66 @@
|
|||||||
|
# pve1 — Host-Infrastruktur
|
||||||
|
|
||||||
|
**IP:** 192.168.10.3 · **Gateway:** 192.168.10.1 (wenn OPNsense auf pve2 läuft)
|
||||||
|
|
||||||
|
## Hardware / Storage
|
||||||
|
|
||||||
|
| Device | Größe | Nutzung |
|
||||||
|
|--------|-------|---------|
|
||||||
|
| nvme0n1 | ~477 GB | Proxmox-System, local-lvm |
|
||||||
|
| sda | variabel | records / Backups (falls konfiguriert) |
|
||||||
|
|
||||||
|
Details: [05_speicher_wartung.md](05_speicher_wartung.md)
|
||||||
|
|
||||||
|
## VMs
|
||||||
|
|
||||||
|
| VMID | Name | Status | Rolle |
|
||||||
|
|------|------|--------|-------|
|
||||||
|
| 104 | opnsense-fallback | **stopped** | OPNsense-Klon für Notfall — **nicht parallel zu pve2:104** |
|
||||||
|
|
||||||
|
Konfiguration: `/etc/pve/qemu-server/104.conf`
|
||||||
|
|
||||||
|
Netz: vmbr0 (WAN), vmbr1 (LAN) — siehe [02_netzwerk.md](02_netzwerk.md)
|
||||||
|
|
||||||
|
## Container
|
||||||
|
|
||||||
|
| CTID | Name | Status | Rolle |
|
||||||
|
|------|------|--------|-------|
|
||||||
|
| 100 | files | running | Datei-/Storage-CT |
|
||||||
|
|
||||||
|
## Dienste auf dem Host
|
||||||
|
|
||||||
|
| Dienst | Zweck |
|
||||||
|
|--------|-------|
|
||||||
|
| `pve-power-mqtt` | CPU-Leistung → MQTT (kein GPU) |
|
||||||
|
| `pveproxy`, `pvedaemon` | Proxmox Web-UI :8006 |
|
||||||
|
|
||||||
|
## Power-Monitoring
|
||||||
|
|
||||||
|
Siehe [power-mqtt-agent.md](power-mqtt-agent.md)
|
||||||
|
|
||||||
|
- Binary: `/usr/local/bin/pve-power-mqtt`
|
||||||
|
- Env: `/etc/pve-power-mqtt.env`
|
||||||
|
- Quellcode: `/root/code/pve-power-mqtt` (Repo server-power)
|
||||||
|
|
||||||
|
## Git / Doku auf diesem Host
|
||||||
|
|
||||||
|
| Pfad | Inhalt |
|
||||||
|
|------|--------|
|
||||||
|
| `/root/docu-repo` | Dieses docu-Repo (Clone) |
|
||||||
|
| `/root/code/pve-power-mqtt` | Go-Agent |
|
||||||
|
| `/root/.git-credentials-jeanavril` | Gitea HTTPS-Token |
|
||||||
|
|
||||||
|
## SSH von pve2
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh root@192.168.10.3
|
||||||
|
```
|
||||||
|
|
||||||
|
## Failover-Checkliste (Kurz)
|
||||||
|
|
||||||
|
1. pve2 / OPNsense ausgefallen?
|
||||||
|
2. VM 104 auf **pve1** starten: `qm start 104`
|
||||||
|
3. Physische WAN-Kabel / Bridge prüfen
|
||||||
|
4. **Nicht** gleichzeitig OPNsense auf pve2 laufen lassen
|
||||||
|
|
||||||
|
Vollständig: [04_fallback_aktivierung.md](04_fallback_aktivierung.md)
|
||||||
@@ -0,0 +1,79 @@
|
|||||||
|
# pve1 — Power-MQTT-Agent
|
||||||
|
|
||||||
|
CPU-Leistungsmessung (Intel RAPL) → MQTT → Home Assistant Auto-Discovery.
|
||||||
|
|
||||||
|
## Installation (Stand)
|
||||||
|
|
||||||
|
| Komponente | Pfad |
|
||||||
|
|------------|------|
|
||||||
|
| Binary | `/usr/local/bin/pve-power-mqtt` |
|
||||||
|
| systemd | `/etc/systemd/system/pve-power-mqtt.service` |
|
||||||
|
| Konfiguration | `/etc/pve-power-mqtt.env` |
|
||||||
|
| Quellcode | `/root/code/pve-power-mqtt` |
|
||||||
|
| Repo | https://git.jeanavril.com/jean/server-power.git |
|
||||||
|
|
||||||
|
## Konfiguration `/etc/pve-power-mqtt.env`
|
||||||
|
|
||||||
|
```ini
|
||||||
|
POWER_MQTT_BROKER=tcp://homeassistant.iot:1883
|
||||||
|
POWER_MQTT_USER=server
|
||||||
|
POWER_MQTT_PASSWORD="F0x84rAOW#q@LX"
|
||||||
|
POWER_MQTT_HOSTNAME=
|
||||||
|
POWER_MQTT_CLIENT_ID=
|
||||||
|
POWER_MQTT_DISCOVERY=true
|
||||||
|
```
|
||||||
|
|
||||||
|
Leere Hostname/Client-ID → automatisch **`pve1`** / **`pve-power-mqtt-pve1`**.
|
||||||
|
|
||||||
|
## MQTT-Sensoren in HA
|
||||||
|
|
||||||
|
| Entity (typisch) | Quelle |
|
||||||
|
|------------------|--------|
|
||||||
|
| sensor.pve1_cpu_power | RAPL package-0 |
|
||||||
|
| sensor.pve1_estimated_total | = CPU (kein GPU auf pve1) |
|
||||||
|
|
||||||
|
Topics: `homeassistant/sensor/pve1/cpu_power/state` usw.
|
||||||
|
|
||||||
|
Details Broker: [../shared/mqtt-homeassistant.md](../shared/mqtt-homeassistant.md)
|
||||||
|
|
||||||
|
## Build & Deploy (Update)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd /root/code/pve-power-mqtt
|
||||||
|
git pull
|
||||||
|
export PATH="/usr/local/go/bin:$PATH"
|
||||||
|
go build -o pve-power-mqtt ./cmd/pve-power-mqtt
|
||||||
|
install -m 755 pve-power-mqtt /usr/local/bin/pve-power-mqtt
|
||||||
|
systemctl restart pve-power-mqtt
|
||||||
|
```
|
||||||
|
|
||||||
|
Oder aus Repo:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd /root/code/pve-power-mqtt
|
||||||
|
git pull && ./deploy/install.sh
|
||||||
|
systemctl restart pve-power-mqtt
|
||||||
|
```
|
||||||
|
|
||||||
|
## Betrieb
|
||||||
|
|
||||||
|
```bash
|
||||||
|
systemctl status pve-power-mqtt
|
||||||
|
journalctl -u pve-power-mqtt -f
|
||||||
|
```
|
||||||
|
|
||||||
|
Intervall: CPU alle **5 s**.
|
||||||
|
|
||||||
|
## Unterschied zu pve2
|
||||||
|
|
||||||
|
- **Kein** `nvidia-smi` — nur RAPL
|
||||||
|
- Kein `nvidia-persistenced`
|
||||||
|
- Keine GPU-Topics
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
| Problem | Lösung |
|
||||||
|
|---------|--------|
|
||||||
|
| HA „unavailable“ | MQTT neu laden; `journalctl -u pve-power-mqtt` auf Connect-Fehler |
|
||||||
|
| session taken over | Client-ID prüfen — muss `pve-power-mqtt-pve1` sein |
|
||||||
|
| RAPL fehlt | `ls /sys/class/powercap/intel-rapl/` — Intel-CPU erforderlich |
|
||||||
@@ -0,0 +1,44 @@
|
|||||||
|
# pve2 — Dokumentation
|
||||||
|
|
||||||
|
**Host:** pve2 · **IP:** 192.168.10.4 · **Rolle:** Produktions-Proxmox, Router (OPNsense), GPU, Docker/Frigate
|
||||||
|
|
||||||
|
## Inhaltsverzeichnis
|
||||||
|
|
||||||
|
| Nr. | Datei | Thema |
|
||||||
|
|-----|-------|-------|
|
||||||
|
| — | [00_README.md](00_README.md) | Diese Übersicht |
|
||||||
|
| — | [infrastructure-host.md](infrastructure-host.md) | Hardware, CT/VM, GPU, Storage |
|
||||||
|
| — | [power-mqtt-agent.md](power-mqtt-agent.md) | CPU+GPU Power → MQTT/HA |
|
||||||
|
| 01 | [01_System-und-Speicher-Uebersicht.md](01_System-und-Speicher-Uebersicht.md) | Disks, Pools |
|
||||||
|
| 02 | [02_Ansible-Playbooks.md](02_Ansible-Playbooks.md) | Ansible |
|
||||||
|
| 03 | [03_VM-Analyse-und-Bereinigung.md](03_VM-Analyse-und-Bereinigung.md) | VM-Bereinigung |
|
||||||
|
| 04 | [04_Backup-Strategie.md](04_Backup-Strategie.md) | Backups |
|
||||||
|
| 05 | [05_Wartung-und-Monitoring.md](05_Wartung-und-Monitoring.md) | Wartung |
|
||||||
|
| 06 | [06_Quick-Reference.md](06_Quick-Reference.md) | Kurzbefehle |
|
||||||
|
| 07 | [07_Storage-Migration-docker.md](07_Storage-Migration-docker.md) | Docker-Storage |
|
||||||
|
| 08 | [08_GPU-Idle-und-Power-Monitoring.md](08_GPU-Idle-und-Power-Monitoring.md) | GPU Idle (Kurz) |
|
||||||
|
| 09 | [09_GPU-Idle-vollstaendig.md](09_GPU-Idle-vollstaendig.md) | GPU Idle (vollständig) |
|
||||||
|
|
||||||
|
## Shared
|
||||||
|
|
||||||
|
- [MQTT & HA](../shared/mqtt-homeassistant.md)
|
||||||
|
- [Git & Repos](../shared/git-und-repos.md)
|
||||||
|
- [Netzwerk](../shared/infrastruktur-netzwerk.md)
|
||||||
|
|
||||||
|
## Besonderheiten pve2
|
||||||
|
|
||||||
|
- **2× GTX 1080** — Headless, Persistence Mode, ~8 W/GPU idle (P8)
|
||||||
|
- **Power-Agent:** CPU + GPU0 + GPU1 + estimated_total
|
||||||
|
- **CT 101** Docker/Frigate — **Intel VAAPI**, keine NVIDIA-Mounts
|
||||||
|
- **CT 110** AIDEV — NVIDIA für ML/Jupyter
|
||||||
|
|
||||||
|
## Schnellbefehle
|
||||||
|
|
||||||
|
```bash
|
||||||
|
nvidia-smi
|
||||||
|
systemctl status nvidia-persistenced pve-power-mqtt
|
||||||
|
grep nvidia /etc/pve/lxc/*.conf
|
||||||
|
cd /root/docu-repo && git pull
|
||||||
|
```
|
||||||
|
|
||||||
|
Stand: Juni 2026
|
||||||
@@ -0,0 +1,43 @@
|
|||||||
|
# 01 — System- und Speicher-Übersicht
|
||||||
|
|
||||||
|
## Host
|
||||||
|
|
||||||
|
- **System:** Proxmox VE auf **pve2**
|
||||||
|
- **Node-Storage:** Konfiguration in `/etc/pve/storage.cfg`
|
||||||
|
|
||||||
|
## Physische Laufwerke
|
||||||
|
|
||||||
|
| Device | Größe | Verwendung |
|
||||||
|
|--------|-------|------------|
|
||||||
|
| **nvme0n1** | ~477 GB | System (`/`), Proxmox-Host, Thin-Pool `local-lvm` |
|
||||||
|
| **nvme1n1** | ~477 GB | Thin-Pool `nvme_second` (z. B. CT AIDEV) |
|
||||||
|
| **sda** | 1,8 TB | HDD — Storage `records` (Aufnahmen, Backups, Docker-Daten) |
|
||||||
|
|
||||||
|
## Proxmox Storage-Pools
|
||||||
|
|
||||||
|
| Storage | Typ | Mount / Pfad | Inhalt |
|
||||||
|
|---------|-----|--------------|--------|
|
||||||
|
| **local** | dir | `/var/lib/vz` | ISO, Templates (kein Backup mehr) |
|
||||||
|
| **local-lvm** | lvmthin | nvme0n1 | VM/CT-Images, Rootfs |
|
||||||
|
| **nvme_second** | lvmthin | nvme1n1 | VM/CT-Images, Rootfs |
|
||||||
|
| **records** | dir | `/mnt/pve/records` | Backups, Images, rootdir, ISO |
|
||||||
|
|
||||||
|
## Ausgangslage (vor Bereinigung)
|
||||||
|
|
||||||
|
Beim ersten Check war **`local-lvm` ~93 % voll** — kritisch, nur ~26 GB frei im Thin-Pool.
|
||||||
|
|
||||||
|
Ursache war **nicht** allein „volle VMs“, sondern eine Kombination aus:
|
||||||
|
|
||||||
|
1. **Thin-Provisioning ohne `fstrim`** — gelöschte Daten im Guest wurden am Pool nicht freigegeben
|
||||||
|
2. **Große VM/CT-Disks** auf `local-lvm` (docker, media, …)
|
||||||
|
3. **Ungenutzte VMs** (kali, dev2) mit allokiertem Platz
|
||||||
|
4. **Backups auf `local`** statt auf der HDD
|
||||||
|
|
||||||
|
## Gemountete Dateisysteme (Host)
|
||||||
|
|
||||||
|
| Mount | Typ | Größe | Typische Nutzung |
|
||||||
|
|-------|-----|-------|------------------|
|
||||||
|
| `/` | ext4 (pve-root) | ~94 GB | Proxmox-System |
|
||||||
|
| `/boot/efi` | vfat | 1 GB | EFI |
|
||||||
|
| `/mnt/pve/records` | xfs | 1,8 TB | HDD-Storage |
|
||||||
|
| `/etc/pve` | fuse | 128 MB | Proxmox-Cluster-Config |
|
||||||
@@ -0,0 +1,60 @@
|
|||||||
|
# 02 — VM- und Container-Analyse
|
||||||
|
|
||||||
|
## Laufende / vorhandene Gäste (nach Bereinigung)
|
||||||
|
|
||||||
|
### QEMU-VMs
|
||||||
|
|
||||||
|
| VMID | Name | Storage | Boot-Disk | Status |
|
||||||
|
|------|------|---------|-----------|--------|
|
||||||
|
| 100 | windows | nvme_second | 100 GB | gestoppt |
|
||||||
|
| 102 | klipper | local-lvm | 18 GB | läuft |
|
||||||
|
| 104 | opnsense | local-lvm | 32 GB | läuft |
|
||||||
|
| 106 | homeassistant | local-lvm | 32 GB | läuft |
|
||||||
|
|
||||||
|
### LXC-Container
|
||||||
|
|
||||||
|
| VMID | Name | Rootfs-Storage | Größe | Status |
|
||||||
|
|------|------|----------------|-------|--------|
|
||||||
|
| 101 | docker | local-lvm | 104 GB | läuft |
|
||||||
|
| 103 | pve-scripts-local | nvme_second | 4 GB | läuft |
|
||||||
|
| 109 | media | local-lvm | 80 GB | läuft |
|
||||||
|
| 110 | AIDEV | nvme_second | 230 GB | läuft |
|
||||||
|
|
||||||
|
## Top-Verbraucher (Analyse)
|
||||||
|
|
||||||
|
| Rang | ID | Name | Storage | Allokiert | ~Belegt | Anmerkung |
|
||||||
|
|------|-----|------|---------|-----------|---------|-----------|
|
||||||
|
| 1 | 110 | AIDEV | nvme_second | 230 GB | Dev-Code, Docker | CT |
|
||||||
|
| 2 | 101 | docker (Daten) | records | 200 GB | Frigate-Aufnahmen | mp0 auf HDD |
|
||||||
|
| 3 | 101 | docker (Root) | local-lvm | 104 GB | Docker-Stack | CT rootfs |
|
||||||
|
| 4 | 109 | media | local-lvm | 80 GB | Jellyfin-Metadaten | CT rootfs |
|
||||||
|
|
||||||
|
### CT 101 docker — Details
|
||||||
|
|
||||||
|
- **Rootfs:** `local-lvm:vm-101-disk-0` (104 GB)
|
||||||
|
- **Daten-Volume:** `records:101/vm-101-disk-0.raw` → `/mnt/records` (200 GB, Frigate)
|
||||||
|
- **Dienste:** Frigate, Overleaf/ShareLaTeX, Mongo, Affine, NPMplus, Portainer, Dockge, …
|
||||||
|
- **Frigate-Retention:** 30 Tage in `config.yaml` — HDD-Speicher reicht, Anpassung nicht nötig
|
||||||
|
|
||||||
|
### CT 109 media — Details
|
||||||
|
|
||||||
|
- **Jellyfin-Config:** ~18 GB unter `/opt/stacks/jellyfin/config`
|
||||||
|
- ~13 GB Metadaten (Poster, Artwork) — **beabsichtigt**, nicht löschen
|
||||||
|
- NFS-Mounts für Filme/Serien (kein lokaler Speicherverbrauch auf dem CT-Disk)
|
||||||
|
|
||||||
|
### CT 110 AIDEV — Details
|
||||||
|
|
||||||
|
- **Code:** `/root/code/` (~69 GB), u. a. `agentic_quant`, `trading_tools`
|
||||||
|
- **Docker:** Dev-Images, Build-Cache (regelmäßig aufräumen sinnvoll)
|
||||||
|
- **Caches:** `.npm`, `.cache`, IDE-Server (VS Code, Cursor, …)
|
||||||
|
|
||||||
|
## Empfehlungen (teilweise umgesetzt)
|
||||||
|
|
||||||
|
| Maßnahme | Status |
|
||||||
|
|----------|--------|
|
||||||
|
| Ungenutzte VMs löschen (kali, dev2) | ✅ erledigt |
|
||||||
|
| Backups auf `records` | ✅ erledigt |
|
||||||
|
| `fstrim` + Docker-Cleanup | ✅ erledigt |
|
||||||
|
| Ansible-Wartung wöchentlich | ✅ eingerichtet |
|
||||||
|
| media → nvme_second migrieren | ⏳ optional |
|
||||||
|
| windows (100) entfernen falls ungenutzt | ⏳ optional |
|
||||||
@@ -0,0 +1,49 @@
|
|||||||
|
# 03 — Gelöschte VMs
|
||||||
|
|
||||||
|
## VM 107 — kali
|
||||||
|
|
||||||
|
- **Status:** Gelöscht (`qm destroy 107 --purge`)
|
||||||
|
- **Grund:** Nicht mehr in Nutzung, vom Benutzer bestätigt
|
||||||
|
- **Disks entfernt:** `vm-107-disk-0` (EFI), `vm-107-disk-1` (32 GB)
|
||||||
|
- **Storage:** local-lvm
|
||||||
|
- **Freigegeben:** ~19 GB im Thin-Pool (tatsächlich + fstrim später mehr)
|
||||||
|
|
||||||
|
## VM 105 — dev2
|
||||||
|
|
||||||
|
### Prüfung vor Löschung
|
||||||
|
|
||||||
|
Disk per `qemu-nbd` read-only gemountet. Inhalt:
|
||||||
|
|
||||||
|
| Bereich | Inhalt |
|
||||||
|
|---------|--------|
|
||||||
|
| `/home/dev/projects/` | Azure-DevOps-Repos: tornau-mono, buko-mono, traunstein-mono, munich-mono, ng-next |
|
||||||
|
| `/home/dev/projects/sveltewelt` | Kleines lokales Svelte-Projekt (~88 KB), kein Git |
|
||||||
|
| `/opt/kimai`, `/opt/install` | Docker-Compose für Kimai & Cattr (Zeiterfassung) |
|
||||||
|
| Docker | Kimai + Cattr-Stack mit Volumes (~4 GB) |
|
||||||
|
|
||||||
|
**Git-Remotes:** Alle Mono-Repos auf `ssh.dev.azure.com` (arva-digital). Code ist remote gesichert.
|
||||||
|
|
||||||
|
**Lokale Besonderheiten:**
|
||||||
|
- `munich-mono`: uncommittete Änderung in `package-lock.json`
|
||||||
|
- `traunstein-mono`: detached HEAD
|
||||||
|
- `sveltewelt`: nur lokal
|
||||||
|
|
||||||
|
### Löschung
|
||||||
|
|
||||||
|
- **Befehl:** `qm destroy 105 --purge`
|
||||||
|
- **Disk:** `vm-105-disk-0` (50 GB auf local-lvm)
|
||||||
|
- **Entscheidung:** Benutzer — „weg damit“ nach Prüfung
|
||||||
|
|
||||||
|
## Befehle zur Referenz
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# VM stoppen (falls nötig)
|
||||||
|
qm stop <vmid>
|
||||||
|
|
||||||
|
# VM inkl. Disks endgültig löschen
|
||||||
|
qm destroy <vmid> --purge
|
||||||
|
```
|
||||||
|
|
||||||
|
## Hinweis
|
||||||
|
|
||||||
|
Snapshots waren auf keiner der gelöschten VMs vorhanden.
|
||||||
@@ -0,0 +1,62 @@
|
|||||||
|
# 04 — Backup-Konfiguration
|
||||||
|
|
||||||
|
## Ziel
|
||||||
|
|
||||||
|
Backups sollen **nicht mehr auf `local`** (System-NVMe) landen, sondern auf der **HDD `records`** (~1,6 TB frei).
|
||||||
|
|
||||||
|
## Durchgeführte Änderungen
|
||||||
|
|
||||||
|
### 1. Standard-Storage für vzdump
|
||||||
|
|
||||||
|
Datei: `/etc/vzdump.conf`
|
||||||
|
|
||||||
|
```ini
|
||||||
|
storage: records
|
||||||
|
```
|
||||||
|
|
||||||
|
Neue manuelle und geplante Backups nutzen damit standardmäßig `records`.
|
||||||
|
|
||||||
|
### 2. Backup von `local` entfernt
|
||||||
|
|
||||||
|
Datei: `/etc/pve/storage.cfg`
|
||||||
|
|
||||||
|
```ini
|
||||||
|
dir: local
|
||||||
|
path /var/lib/vz
|
||||||
|
content iso,vztmpl
|
||||||
|
```
|
||||||
|
|
||||||
|
`backup` wurde aus `content` entfernt — ein versehentliches Backup auf die System-Partition wird erschwert.
|
||||||
|
|
||||||
|
### 3. Storage `records` (unverändert, bereits korrekt)
|
||||||
|
|
||||||
|
```ini
|
||||||
|
dir: records
|
||||||
|
path /mnt/pve/records
|
||||||
|
content iso,vztmpl,backup,images,rootdir
|
||||||
|
```
|
||||||
|
|
||||||
|
## Vorhandene Backups auf local
|
||||||
|
|
||||||
|
- Es lag u. a. ein **OPNsense-Backup (~15 GB)** unter `/var/lib/vz/dump/`
|
||||||
|
- Wurde vom Benutzer **manuell gelöscht** (wird noch benötigt — ggf. neu erstellen)
|
||||||
|
|
||||||
|
### OPNsense neu sichern
|
||||||
|
|
||||||
|
```bash
|
||||||
|
vzdump 104 --storage records --mode snapshot --compress zstd
|
||||||
|
```
|
||||||
|
|
||||||
|
## Geplante Backups (Cron)
|
||||||
|
|
||||||
|
Die Datei `/etc/pve/vzdump.cron` ist derzeit **leer** (kein clusterweiter Zeitplan). Backups laufen aktuell manuell oder müssen separat im Proxmox-UI / per Cron eingerichtet werden.
|
||||||
|
|
||||||
|
## Prüfen
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pvesm status
|
||||||
|
grep storage /etc/vzdump.conf
|
||||||
|
cat /etc/pve/storage.cfg
|
||||||
|
ls -lh /mnt/pve/records/dump/
|
||||||
|
ls -lh /var/lib/vz/dump/
|
||||||
|
```
|
||||||
@@ -0,0 +1,87 @@
|
|||||||
|
# 05 — LXC-Speicher aufräumen
|
||||||
|
|
||||||
|
## Wichtigste Erkenntnis: Thin-Pool vs. Guest-Belegung
|
||||||
|
|
||||||
|
Proxmox **Thin-LVM** (`local-lvm`, `nvme_second`) kann **viel voller** anzeigen als das Dateisystem **inside** des Containers.
|
||||||
|
|
||||||
|
| CT | Pool-Anzeige (vorher) | `df` im Guest (vorher) |
|
||||||
|
|----|------------------------|-------------------------|
|
||||||
|
| 109 media | ~99 % | ~43 % |
|
||||||
|
| 101 docker | ~97 % | ~45 % |
|
||||||
|
| 110 AIDEV | ~99 % | ~74 % |
|
||||||
|
|
||||||
|
**Ursache:** Gelöschte Dateien/Blöcke im Guest werden am Thin-Pool erst nach **`fstrim`** freigegeben.
|
||||||
|
|
||||||
|
### fstrim — größter Hebel
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pct exec 101 -- fstrim -v /
|
||||||
|
pct exec 109 -- fstrim -v /
|
||||||
|
pct exec 110 -- fstrim -v /
|
||||||
|
```
|
||||||
|
|
||||||
|
Ergebnis der Session:
|
||||||
|
|
||||||
|
| Pool | Vorher | Nachher |
|
||||||
|
|------|--------|---------|
|
||||||
|
| local-lvm | ~93 % | ~43 % |
|
||||||
|
| nvme_second | ~59 % | ~40 % |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Docker-Cleanup (manuell durchgeführt)
|
||||||
|
|
||||||
|
Auf CT **101**, **109**, **110**:
|
||||||
|
|
||||||
|
- Gestoppte Container entfernen (>7 Tage)
|
||||||
|
- Dangling Images löschen
|
||||||
|
- Build-Cache leeren (v. a. AIDEV: ~14 GB)
|
||||||
|
- Ungenutzte Images (>14 Tage) auf AIDEV
|
||||||
|
- Container-Logs >50 MB auf 10 MB kürzen (AIDEV: ein Log hatte **2,5 GB**)
|
||||||
|
- Journal auf ~200 MB begrenzen (`journalctl --vacuum-size=200M`)
|
||||||
|
- `apt-get clean`
|
||||||
|
|
||||||
|
### CT-spezifische Befunde
|
||||||
|
|
||||||
|
**101 docker**
|
||||||
|
- ~38 GB Frigate-Aufnahmen auf `/mnt/records` — **normal**, 30-Tage-Retention
|
||||||
|
- ~48 GB Mongo (Overleaf) — **Produktivdaten**
|
||||||
|
- ~3 GB alte Docker-Images entfernt
|
||||||
|
|
||||||
|
**109 media**
|
||||||
|
- ~4 GB alte Jellyfin/tvheadend-Images entfernt
|
||||||
|
- 18 GB Jellyfin-Config — überwiegend **Metadaten**, nicht anfassen
|
||||||
|
|
||||||
|
**110 AIDEV**
|
||||||
|
- ~69 GB `/root/code` — aktive Projekte
|
||||||
|
- ~14 GB Build-Cache + alte Images entfernt
|
||||||
|
- 17 ungenutzte Docker-Volumes (~1,2 GB) — optional manuell prüfen
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Was bewusst nicht gelöscht wird
|
||||||
|
|
||||||
|
| Pfad / Bereich | Grund |
|
||||||
|
|----------------|--------|
|
||||||
|
| `/mnt/records/recordings` | Frigate-Aufnahmen, HDD reicht |
|
||||||
|
| Jellyfin `metadata/` | Bibliotheks-Artwork |
|
||||||
|
| Mongo / Overleaf-Daten | Produktiv |
|
||||||
|
| `/root/code` auf AIDEV | Entwicklungsprojekte |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Nützliche Befehle
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Speicher im Container
|
||||||
|
pct exec <vmid> -- df -hT /
|
||||||
|
|
||||||
|
# Docker-Übersicht
|
||||||
|
pct exec <vmid> -- docker system df -v
|
||||||
|
|
||||||
|
# Größte Verzeichnisse
|
||||||
|
pct exec <vmid> -- du -xh / --max-depth=2 2>/dev/null | sort -hr | head -20
|
||||||
|
|
||||||
|
# Große Docker-Logs finden
|
||||||
|
pct exec <vmid> -- find /var/lib/docker/containers -name '*-json.log' -size +50M -exec ls -lh {} \;
|
||||||
|
```
|
||||||
@@ -0,0 +1,129 @@
|
|||||||
|
# 06 — Ansible-Automatisierung
|
||||||
|
|
||||||
|
## Konzept
|
||||||
|
|
||||||
|
**Ansible legt keine Crons in den Containern an.**
|
||||||
|
|
||||||
|
Stattdessen:
|
||||||
|
|
||||||
|
1. Auf dem **Proxmox-Host** existiert ein **Cron-Job** (wöchentlich)
|
||||||
|
2. Der Cron startet ein **Shell-Script**
|
||||||
|
3. Das Script führt **`ansible-playbook`** aus
|
||||||
|
4. Ansible verbindet sich per **SSH** zu den CTs und führt Wartungs-Tasks aus
|
||||||
|
|
||||||
|
```
|
||||||
|
/etc/cron.weekly/pve-lxc-disk-maintenance
|
||||||
|
↓ (Symlink)
|
||||||
|
/root/ansible/run-disk-maintenance.sh
|
||||||
|
↓
|
||||||
|
ansible-playbook playbooks/disk-maintenance.yml
|
||||||
|
↓ SSH
|
||||||
|
docker (101) · media (109) · AIDEV (110)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Verzeichnisstruktur
|
||||||
|
|
||||||
|
```
|
||||||
|
/root/ansible/
|
||||||
|
├── ansible.cfg
|
||||||
|
├── run-disk-maintenance.sh → von cron.weekly aufgerufen
|
||||||
|
├── inventory/
|
||||||
|
│ ├── hosts.yml → Hosts + CT-spezifische Variablen
|
||||||
|
│ └── group_vars/all.yml → globale Schwellwerte
|
||||||
|
├── playbooks/
|
||||||
|
│ └── disk-maintenance.yml
|
||||||
|
└── roles/
|
||||||
|
└── disk_cleanup/
|
||||||
|
├── defaults/main.yml
|
||||||
|
├── tasks/main.yml
|
||||||
|
└── handlers/main.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
## Verwaltete Hosts
|
||||||
|
|
||||||
|
| Ansible-Host | VMID | IP | Besonderheiten |
|
||||||
|
|--------------|------|-----|----------------|
|
||||||
|
| docker | 101 | 192.168.10.101 | Frigate-Pfade auf `/mnt/records` |
|
||||||
|
| media | 109 | 192.168.20.6 | Jellyfin-Cache-Pfad |
|
||||||
|
| aidev | 110 | 10.100.2.13 | Dev-Tooling optional |
|
||||||
|
|
||||||
|
SSH als `root` vom Proxmox-Host — Key-Auth war bereits eingerichtet.
|
||||||
|
|
||||||
|
## Was das Playbook macht
|
||||||
|
|
||||||
|
| Task | Beschreibung |
|
||||||
|
|------|--------------|
|
||||||
|
| Journal | `journalctl --vacuum-size=200M` |
|
||||||
|
| apt | `autoclean` / `autoremove` / `clean` |
|
||||||
|
| Docker-Logs | Dateien >50 MB auf 10 MB kürzen |
|
||||||
|
| Docker | Gestoppte Container (>7 T.), dangling Images, Build-Cache (>14 T.) |
|
||||||
|
| Docker-Volumes | Nur **dangling** Volumes |
|
||||||
|
| daemon.json | Log-Limits `10m` × `3` — nur wenn Datei noch nicht existiert |
|
||||||
|
| fstrim | `/` im Container (**wichtig für Thin-Pool**) |
|
||||||
|
| Frigate | Aufnahme-Ordner älter als 30 Tage löschen |
|
||||||
|
| Jellyfin | Cache-Dateien älter als 30 Tage löschen |
|
||||||
|
|
||||||
|
### Tags (optional)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Alles (Standard)
|
||||||
|
ansible-playbook playbooks/disk-maintenance.yml
|
||||||
|
|
||||||
|
# Nur aggressive Image-Bereinigung zusätzlich
|
||||||
|
ansible-playbook playbooks/disk-maintenance.yml --tags aggressive
|
||||||
|
|
||||||
|
# Nur Frigate oder Jellyfin
|
||||||
|
ansible-playbook playbooks/disk-maintenance.yml --tags frigate
|
||||||
|
ansible-playbook playbooks/disk-maintenance.yml --tags jellyfin
|
||||||
|
```
|
||||||
|
|
||||||
|
## Cron
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ls -la /etc/cron.weekly/pve-lxc-disk-maintenance
|
||||||
|
# → Symlink nach /root/ansible/run-disk-maintenance.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
- **Intervall:** `cron.weekly` (typisch Sonntag morgens)
|
||||||
|
- **Log:** `/var/log/pve-lxc-disk-maintenance.log`
|
||||||
|
|
||||||
|
### Cron deaktivieren
|
||||||
|
|
||||||
|
```bash
|
||||||
|
rm /etc/cron.weekly/pve-lxc-disk-maintenance
|
||||||
|
```
|
||||||
|
|
||||||
|
### Cron auf täglich umstellen (Beispiel)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
echo '0 3 * * * root /root/ansible/run-disk-maintenance.sh' > /etc/cron.d/pve-lxc-disk-maintenance
|
||||||
|
```
|
||||||
|
|
||||||
|
## Konfiguration anpassen
|
||||||
|
|
||||||
|
Globale Werte: `/root/ansible/inventory/group_vars/all.yml`
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
journal_max_size: 200M
|
||||||
|
docker_prune_stopped_containers_older_than: 168h # 7 Tage
|
||||||
|
docker_prune_unused_images_older_than: 336h # 14 Tage (Tag: aggressive)
|
||||||
|
frigate_recordings_retain_days: 30
|
||||||
|
jellyfin_cache_max_age_days: 30
|
||||||
|
fstrim_enabled: true
|
||||||
|
```
|
||||||
|
|
||||||
|
Host-spezifisch in `inventory/hosts.yml` (z. B. Frigate-Pfade nur auf `docker`).
|
||||||
|
|
||||||
|
## Voraussetzungen
|
||||||
|
|
||||||
|
- **Ansible** auf dem Proxmox-Host installiert (`apt install ansible`)
|
||||||
|
- **SSH** vom Host zu den CTs als root
|
||||||
|
- CTs müssen laufen (für SSH)
|
||||||
|
|
||||||
|
## Manuell testen
|
||||||
|
|
||||||
|
```bash
|
||||||
|
/root/ansible/run-disk-maintenance.sh
|
||||||
|
# oder
|
||||||
|
cd /root/ansible && ansible-playbook playbooks/disk-maintenance.yml
|
||||||
|
```
|
||||||
@@ -0,0 +1,66 @@
|
|||||||
|
# 07 — Aktueller Stand
|
||||||
|
|
||||||
|
Stand nach Abschluss der Bereinigung und Einrichtung der Automatisierung.
|
||||||
|
|
||||||
|
## Storage-Pools
|
||||||
|
|
||||||
|
| Storage | Belegung | Frei (ca.) | Anmerkung |
|
||||||
|
|---------|----------|------------|-----------|
|
||||||
|
| local | ~32 % | ~60 GB | System-Partition |
|
||||||
|
| **local-lvm** | **~43 %** | **~200 GB** | war ~93 % |
|
||||||
|
| **nvme_second** | **~40 %** | **~280 GB** | war ~59 % |
|
||||||
|
| records | ~12 % | ~1,6 TB | HDD, viel Reserve |
|
||||||
|
|
||||||
|
## VMs / Container
|
||||||
|
|
||||||
|
### VMs
|
||||||
|
100 windows · 102 klipper · 104 opnsense · 106 homeassistant
|
||||||
|
|
||||||
|
### Container
|
||||||
|
101 docker · 103 pve-scripts-local · 109 media · 110 AIDEV
|
||||||
|
|
||||||
|
**Entfernt:** 107 kali · 105 dev2
|
||||||
|
|
||||||
|
## Automatisierung aktiv
|
||||||
|
|
||||||
|
| Komponente | Pfad |
|
||||||
|
|------------|------|
|
||||||
|
| Ansible-Projekt | `/root/ansible/` |
|
||||||
|
| Wöchentlicher Cron | `/etc/cron.weekly/pve-lxc-disk-maintenance` |
|
||||||
|
| Wartungs-Log | `/var/log/pve-lxc-disk-maintenance.log` |
|
||||||
|
| Dokumentation | `/root/docu/` |
|
||||||
|
|
||||||
|
## Offene / optionale Punkte
|
||||||
|
|
||||||
|
1. **OPNsense-Backup neu erstellen** auf `records` (altes Backup versehentlich gelöscht)
|
||||||
|
```bash
|
||||||
|
vzdump 104 --storage records --mode snapshot --compress zstd
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **VM 100 windows** — gestoppt, ~100 GB auf nvme_second; löschen falls ungenutzt
|
||||||
|
|
||||||
|
3. **CT 109 media** nach `nvme_second` migrieren — entlastet local-lvm langfristig
|
||||||
|
|
||||||
|
4. **AIDEV Docker-Volumes** — ~17 ungenutzte Volumes (~1,2 GB), manuell prüfen:
|
||||||
|
```bash
|
||||||
|
pct exec 110 -- docker volume ls -f dangling=false
|
||||||
|
```
|
||||||
|
|
||||||
|
5. **Frigate-Retention** — bei reichlich HDD-Speicher unverändert lassen (30 Tage)
|
||||||
|
|
||||||
|
## Wichtigste Lessons Learned
|
||||||
|
|
||||||
|
1. **`fstrim` regelmäßig** auf LXC-Gästen — sonst explodiert der Thin-Pool scheinbar
|
||||||
|
2. **Docker-Logs begrenzen** — fehlende `max-size`/`max-file` führte zu Multi-GB-Logs
|
||||||
|
3. **Backups auf HDD** — System-NVMe nicht für vzdump nutzen
|
||||||
|
4. **Thin-Pool ≠ Guest-`df`** — immer beides prüfen
|
||||||
|
|
||||||
|
## Nächste Wartung prüfen
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Thin-Pools
|
||||||
|
lvs pve/data nvme_second/nvme_second -o data_percent
|
||||||
|
|
||||||
|
# Letzter Ansible-Lauf
|
||||||
|
tail -50 /var/log/pve-lxc-disk-maintenance.log
|
||||||
|
```
|
||||||
@@ -0,0 +1,32 @@
|
|||||||
|
# GPU Idle & Power-Monitoring — pve2
|
||||||
|
|
||||||
|
Ausführliche Doku im Repo: [`/root/code/pve-power-mqtt/docs/gpu-idle-pve2.md`](/root/code/pve-power-mqtt/docs/gpu-idle-pve2.md)
|
||||||
|
|
||||||
|
## Kurzfassung
|
||||||
|
|
||||||
|
- **2× GTX 1080** am Host-Treiber (kein VM-Passthrough)
|
||||||
|
- Idle-Ziel: **~8 W/GPU, P8** — erfordert **Persistence Mode**
|
||||||
|
- **Frigate (CT 101):** nur Intel iGPU/VAAPI → keine NVIDIA-Mounts in LXC
|
||||||
|
- **AIDEV (CT 110):** NVIDIA-Mounts für ML bei Bedarf
|
||||||
|
- **Power-Agent:** `pve-power-mqtt` → MQTT/Home Assistant
|
||||||
|
|
||||||
|
## Schnellcheck
|
||||||
|
|
||||||
|
```bash
|
||||||
|
nvidia-smi --query-gpu=index,power.draw,pstate,persistence_mode --format=csv
|
||||||
|
systemctl status nvidia-persistenced pve-power-mqtt
|
||||||
|
grep nvidia /etc/pve/lxc/*.conf
|
||||||
|
```
|
||||||
|
|
||||||
|
## Persistenced installieren (aus Repo)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd /root/code/pve-power-mqtt
|
||||||
|
./deploy/nvidia-persistenced/install.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
## Änderungen Juni 2026
|
||||||
|
|
||||||
|
- CT 101: NVIDIA device mounts auskommentiert
|
||||||
|
- `nvidia-persistenced`: kaputter `--user-persistence-mode` Flag entfernt, Service läuft
|
||||||
|
- Doku in `server-power` Repo unter `docs/`
|
||||||
@@ -0,0 +1,168 @@
|
|||||||
|
# GPU Idle & Power Monitoring — pve2
|
||||||
|
|
||||||
|
Stand: Juni 2026 · Host: **pve2** · 2× NVIDIA GeForce GTX 1080
|
||||||
|
|
||||||
|
## Ziel
|
||||||
|
|
||||||
|
GTX 1080 im Headless-Betrieb sollen im Leerlauf **~6–8 W pro GPU (P8)** verbrauchen und bei Bedarf für LXC/Compute verfügbar bleiben — **ohne** GPU-Passthrough an VMs.
|
||||||
|
|
||||||
|
Typische Idle-Messung bei korrekter Konfiguration:
|
||||||
|
|
||||||
|
```
|
||||||
|
GPU0: ~8–9 W, P8, Core 139 MHz
|
||||||
|
GPU1: ~8–9 W, P8, Core 139 MHz
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Ursachen für hohen Idle-Verbrauch (P0/P5, 40–70 W gesamt)
|
||||||
|
|
||||||
|
| Ursache | Symptom | Fix |
|
||||||
|
|---------|---------|-----|
|
||||||
|
| **Persistence Mode aus / Daemon tot** | P0 oder P5, `Idle: Not Active` | `nvidia-persistenced` + `-pm 1` (siehe unten) |
|
||||||
|
| **NVIDIA-Devices in LXC ohne Nutzung** | Treiber hält GPUs wach, Wechsel 139↔1657 MHz | Mounts entfernen (CT 101) |
|
||||||
|
| **Echter GPU-Load** | Prozesse in `nvidia-smi`, Encoder/Decoder > 0 | Prozess finden und beenden |
|
||||||
|
| **Häufiges `nvidia-smi`-Polling** | Kurzzeitiges Aufwachen | GPU-Messung seltener (z. B. 60 s) |
|
||||||
|
| **Neuer Treiber (580.x) ohne PM** | Pascal bleibt in P5 | Persistence Mode ist Pflicht |
|
||||||
|
|
||||||
|
**Nicht zutreffend auf pve2:** VFIO-GPU-Passthrough (GPUs hängen am Host-Treiber, nicht an VMs).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. NVIDIA Persistence Mode (Pflicht für Headless P8)
|
||||||
|
|
||||||
|
Auf Headless-Linux-Systemen ohne Display schaltet der Treiber GTX-1080-Karten nach Last oft **nicht** zuverlässig in **P8** zurück, wenn Persistence Mode aus ist.
|
||||||
|
|
||||||
|
### Prüfen
|
||||||
|
|
||||||
|
```bash
|
||||||
|
nvidia-smi --query-gpu=index,power.draw,pstate,persistence_mode,clocks.gr --format=csv
|
||||||
|
```
|
||||||
|
|
||||||
|
Erwartung im Idle: `P8`, `Enabled`, ~139 MHz Core.
|
||||||
|
|
||||||
|
### Dauerhaft einrichten (pve2)
|
||||||
|
|
||||||
|
Service-Dateien liegen im Repo unter `deploy/nvidia-persistenced/`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cp deploy/nvidia-persistenced/nvidia-persistenced.service /etc/systemd/system/
|
||||||
|
mkdir -p /etc/systemd/system/nvidia-persistenced.service.d
|
||||||
|
cp deploy/nvidia-persistenced/override.conf /etc/systemd/system/nvidia-persistenced.service.d/
|
||||||
|
systemctl daemon-reload
|
||||||
|
systemctl enable --now nvidia-persistenced
|
||||||
|
```
|
||||||
|
|
||||||
|
**Wichtig:** Die Option `--user-persistence-mode` ist ungültig und ließ den Dienst sofort wieder beenden — daher der Fix im Repo.
|
||||||
|
|
||||||
|
### Manuell (bis Reboot)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
nvidia-smi -pm 1
|
||||||
|
```
|
||||||
|
|
||||||
|
### Status prüfen
|
||||||
|
|
||||||
|
```bash
|
||||||
|
systemctl status nvidia-persistenced
|
||||||
|
nvidia-smi -q | grep -E 'Performance State|Idle|Persistence'
|
||||||
|
fuser -v /dev/nvidia* # sollte leer sein im Idle
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. LXC: Wer braucht `/dev/nvidia*`?
|
||||||
|
|
||||||
|
| CT | Name | GPU-Mounts | Grund |
|
||||||
|
|----|------|------------|-------|
|
||||||
|
| **101** | docker | **Nein** (entfernt) | Frigate nutzt **Intel iGPU (VAAPI)**, NVIDIA in `compose.yml` auskommentiert |
|
||||||
|
| **110** | AIDEV | **Ja** | Jupyter/ML bei Bedarf |
|
||||||
|
| **109** | media | Nur wenn aktiv genutzt | Gestoppt → kein Mount nötig |
|
||||||
|
|
||||||
|
### Frigate (CT 101)
|
||||||
|
|
||||||
|
- Detector: **OpenVINO** (CPU/iGPU)
|
||||||
|
- Record-Streams: `hwaccel_args: preset-vaapi`
|
||||||
|
- Docker-Devices: `/dev/dri/renderD128`, `/dev/dri/card0` (Intel)
|
||||||
|
- **Kein** `deploy.resources.devices` NVIDIA in `compose.yml`
|
||||||
|
|
||||||
|
NVIDIA-Bind-Mounts in `/etc/pve/lxc/101.conf` sind auskommentiert (Juni 2026). Nach CT-Neustart:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pct reboot 101
|
||||||
|
pct exec 101 -- ls /dev/nvidia* # sollte fehlen
|
||||||
|
pct exec 101 -- docker ps # frigate healthy
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Power-Monitoring (`pve-power-mqtt`)
|
||||||
|
|
||||||
|
Der Agent liest GPU-Leistung via `nvidia-smi`. Das ist **kein** Dauerlast-Compute, kann aber GPUs kurz aus P8 wecken.
|
||||||
|
|
||||||
|
Empfehlung:
|
||||||
|
|
||||||
|
- CPU/RAPL: alle **5 s** (MQTT → Home Assistant)
|
||||||
|
- GPU: seltener messen oder nur wenn `estimated_total` GPU-Anteil braucht
|
||||||
|
|
||||||
|
Verifikation ohne Agent-Einfluss:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
systemctl stop pve-power-mqtt
|
||||||
|
sleep 60
|
||||||
|
nvidia-smi --query-gpu=power.draw,pstate --format=csv
|
||||||
|
systemctl start pve-power-mqtt
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Troubleshooting
|
||||||
|
|
||||||
|
### GPUs hängen in P0/P5 (~15–45 W)
|
||||||
|
|
||||||
|
1. `systemctl status nvidia-persistenced` — läuft er?
|
||||||
|
2. `nvidia-smi -pm 1`
|
||||||
|
3. LXC mit GPU-Mounts identifizieren: `grep nvidia /etc/pve/lxc/*.conf`
|
||||||
|
4. Prozesse: `nvidia-smi`, `fuser -v /dev/nvidia*`
|
||||||
|
5. Host neu starten (letzter Ausweg)
|
||||||
|
|
||||||
|
### Wechsel zwischen 139 MHz und 1657 MHz (eine GPU ~44 W)
|
||||||
|
|
||||||
|
Typisch wenn **mehrere Consumer** den Treiber ansprechen (LXC-Mounts + Monitoring). CT 101 ohne NVIDIA-Mounts behebt einen großen Teil.
|
||||||
|
|
||||||
|
### Dummy-HDMI-Stecker
|
||||||
|
|
||||||
|
Nur erwägen, wenn **mit** Persistence Mode weiterhin kein P8 erreichbar ist. Auf pve2 aktuell **nicht nötig** (P8 stabil mit persistenced).
|
||||||
|
|
||||||
|
### Treiber
|
||||||
|
|
||||||
|
Installiert: **580.95.05** (Pascal). Für Idle ist Persistence Mode wichtiger als Downgrade — bei Bedarf 550.x LTS testen.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Referenz-Befehle
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Schnellcheck
|
||||||
|
nvidia-smi
|
||||||
|
|
||||||
|
# Power über Zeit (15 s Abstand)
|
||||||
|
watch -n 15 'nvidia-smi --query-gpu=index,power.draw,pstate,clocks.gr --format=csv'
|
||||||
|
|
||||||
|
# LXC GPU-Konfiguration
|
||||||
|
grep -H nvidia /etc/pve/lxc/*.conf
|
||||||
|
|
||||||
|
# MQTT Power Agent
|
||||||
|
systemctl status pve-power-mqtt
|
||||||
|
journalctl -u pve-power-mqtt -f
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Änderungshistorie
|
||||||
|
|
||||||
|
| Datum | Änderung |
|
||||||
|
|-------|----------|
|
||||||
|
| 2026-06-27 | CT 101: NVIDIA-Mounts entfernt (Frigate = VAAPI) |
|
||||||
|
| 2026-06-27 | `nvidia-persistenced` repariert (ungültiger CLI-Flag entfernt) |
|
||||||
|
| 2026-06-27 | Doku angelegt, P8 Idle ~8 W pro GPU verifiziert |
|
||||||
@@ -0,0 +1,91 @@
|
|||||||
|
# pve2 — Host-Infrastruktur
|
||||||
|
|
||||||
|
**IP:** 192.168.10.4 · **GPU:** 2× NVIDIA GeForce GTX 1080 · **Treiber:** 580.95.05
|
||||||
|
|
||||||
|
## Physische Disks
|
||||||
|
|
||||||
|
| Device | Größe | Nutzung |
|
||||||
|
|--------|-------|---------|
|
||||||
|
| nvme0n1 | ~477 GB | System, `local-lvm` |
|
||||||
|
| nvme1n1 | ~477 GB | `nvme_second` (z. B. AIDEV) |
|
||||||
|
| sda | 1,8 TB | `records` — Aufnahmen, Backups, Docker-Daten |
|
||||||
|
|
||||||
|
Details: [01_System-und-Speicher-Uebersicht.md](01_System-und-Speicher-Uebersicht.md)
|
||||||
|
|
||||||
|
## Storage-Pools
|
||||||
|
|
||||||
|
| Pool | Typ | Inhalt |
|
||||||
|
|------|-----|--------|
|
||||||
|
| local | dir | ISO, Templates |
|
||||||
|
| local-lvm | lvmthin | VM/CT auf nvme0 |
|
||||||
|
| nvme_second | lvmthin | VM/CT auf nvme1 |
|
||||||
|
| records | dir | HDD — Backups, Frigate, große Daten |
|
||||||
|
|
||||||
|
## VMs (Auswahl)
|
||||||
|
|
||||||
|
| VMID | Name | Rolle |
|
||||||
|
|------|------|-------|
|
||||||
|
| 104 | opnsense | **Router/Firewall** — produktiv |
|
||||||
|
| 106 | homeassistant | Home Assistant + Mosquitto |
|
||||||
|
|
||||||
|
## Container (Auswahl)
|
||||||
|
|
||||||
|
| CTID | Name | GPU | Rolle |
|
||||||
|
|------|------|-----|-------|
|
||||||
|
| 101 | docker | **Nein** (NVIDIA entfernt) | Frigate, Compose-Stack |
|
||||||
|
| 109 | media | optional | Medien (oft gestoppt) |
|
||||||
|
| 110 | AIDEV | **Ja** | Jupyter/ML |
|
||||||
|
|
||||||
|
### GPU-Mount-Policy LXC
|
||||||
|
|
||||||
|
| CT | `/dev/nvidia*` | Grund |
|
||||||
|
|----|----------------|-------|
|
||||||
|
| 101 | **Nein** | Frigate = OpenVINO + Intel VAAPI |
|
||||||
|
| 110 | **Ja** | ML bei Bedarf |
|
||||||
|
| 109 | nur wenn aktiv | Gestoppt → kein Mount |
|
||||||
|
|
||||||
|
Konfiguration: `/etc/pve/lxc/101.conf` — NVIDIA-Zeilen auskommentiert (`#lxc.mount.entry%3A ...`).
|
||||||
|
|
||||||
|
Frigate in CT 101:
|
||||||
|
- Detector: OpenVINO (CPU/iGPU)
|
||||||
|
- `hwaccel_args: preset-vaapi`
|
||||||
|
- Devices: `/dev/dri/renderD128`, `/dev/dri/card0`
|
||||||
|
- NVIDIA in `compose.yml` auskommentiert
|
||||||
|
|
||||||
|
## NVIDIA auf dem Host
|
||||||
|
|
||||||
|
```bash
|
||||||
|
nvidia-smi
|
||||||
|
systemctl status nvidia-persistenced
|
||||||
|
```
|
||||||
|
|
||||||
|
Persistence Mode **Pflicht** für P8 Idle (~8 W/GPU). Service-Dateien auch im Repo server-power: `deploy/nvidia-persistenced/`.
|
||||||
|
|
||||||
|
Vollständige GPU-Doku: [09_GPU-Idle-vollstaendig.md](09_GPU-Idle-vollstaendig.md)
|
||||||
|
|
||||||
|
## Host-Dienste
|
||||||
|
|
||||||
|
| Dienst | Zweck |
|
||||||
|
|--------|-------|
|
||||||
|
| `nvidia-persistenced` | GPU Persistence Mode |
|
||||||
|
| `pve-power-mqtt` | RAPL + nvidia-smi → MQTT |
|
||||||
|
| Proxmox | Web :8006 |
|
||||||
|
|
||||||
|
## Git / Doku auf diesem Host
|
||||||
|
|
||||||
|
| Pfad | Inhalt |
|
||||||
|
|------|--------|
|
||||||
|
| `/root/docu-repo` | docu-Repo |
|
||||||
|
| `/root/code/pve-power-mqtt` | Go-Agent + GPU-Doku |
|
||||||
|
| `/root/docu/` | Legacy lokale Kopie (optional durch docu-repo ersetzen) |
|
||||||
|
| `/root/.git-credentials-jeanavril` | Gitea Token |
|
||||||
|
|
||||||
|
## Ansible
|
||||||
|
|
||||||
|
Playbooks: siehe [02_Ansible-Playbooks.md](02_Ansible-Playbooks.md)
|
||||||
|
|
||||||
|
## Wartung
|
||||||
|
|
||||||
|
- `fstrim` in VMs/CTs für Thin-Pools
|
||||||
|
- Backup auf `records`, nicht `local-lvm`
|
||||||
|
- Speicher-Monitoring: [05_Wartung-und-Monitoring.md](05_Wartung-und-Monitoring.md)
|
||||||
@@ -0,0 +1,95 @@
|
|||||||
|
# pve2 — Power-MQTT-Agent
|
||||||
|
|
||||||
|
CPU (Intel RAPL) + GPU (`nvidia-smi`) → MQTT → Home Assistant.
|
||||||
|
|
||||||
|
## Installation
|
||||||
|
|
||||||
|
| Komponente | Pfad |
|
||||||
|
|------------|------|
|
||||||
|
| Binary | `/usr/local/bin/pve-power-mqtt` |
|
||||||
|
| systemd | `/etc/systemd/system/pve-power-mqtt.service` |
|
||||||
|
| Env | `/etc/pve-power-mqtt.env` |
|
||||||
|
| Quellcode | `/root/code/pve-power-mqtt` |
|
||||||
|
| Repo | https://git.jeanavril.com/jean/server-power.git |
|
||||||
|
|
||||||
|
## Konfiguration `/etc/pve-power-mqtt.env`
|
||||||
|
|
||||||
|
```ini
|
||||||
|
POWER_MQTT_BROKER=tcp://homeassistant.iot:1883
|
||||||
|
POWER_MQTT_USER=server
|
||||||
|
POWER_MQTT_PASSWORD="F0x84rAOW#q@LX"
|
||||||
|
POWER_MQTT_HOSTNAME=pve2
|
||||||
|
POWER_MQTT_DISCOVERY=true
|
||||||
|
```
|
||||||
|
|
||||||
|
Client-ID: **`pve-power-mqtt-pve2`**
|
||||||
|
|
||||||
|
## HA-Sensoren
|
||||||
|
|
||||||
|
| Entity | Quelle |
|
||||||
|
|--------|--------|
|
||||||
|
| sensor.pve2_cpu_power | RAPL |
|
||||||
|
| sensor.pve2_gpu0_power | GPU 0 |
|
||||||
|
| sensor.pve2_gpu1_power | GPU 1 |
|
||||||
|
| sensor.pve2_estimated_total | CPU + GPU0 + GPU1 |
|
||||||
|
|
||||||
|
`estimated_total` ist **kein** Netzteil-/PDU-Wert.
|
||||||
|
|
||||||
|
Broker-Details: [../shared/mqtt-homeassistant.md](../shared/mqtt-homeassistant.md)
|
||||||
|
|
||||||
|
## Build & Deploy
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd /root/code/pve-power-mqtt
|
||||||
|
git pull
|
||||||
|
export PATH="/usr/local/go/bin:$PATH"
|
||||||
|
go build -o pve-power-mqtt ./cmd/pve-power-mqtt
|
||||||
|
install -m 755 pve-power-mqtt /usr/local/bin/pve-power-mqtt
|
||||||
|
systemctl restart pve-power-mqtt
|
||||||
|
```
|
||||||
|
|
||||||
|
## NVIDIA Persistence (Voraussetzung für sinnvolle GPU-Idle-Werte)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
systemctl status nvidia-persistenced
|
||||||
|
nvidia-smi --query-gpu=power.draw,pstate,persistence_mode --format=csv
|
||||||
|
```
|
||||||
|
|
||||||
|
Erwartung idle: **P8**, ~8–9 W pro GTX 1080.
|
||||||
|
|
||||||
|
Siehe [09_GPU-Idle-vollstaendig.md](09_GPU-Idle-vollstaendig.md)
|
||||||
|
|
||||||
|
## Agent vs. GPU Idle
|
||||||
|
|
||||||
|
`nvidia-smi` alle **5 s** kann GPUs kurz aus P8 wecken — für reine Idle-Messung:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
systemctl stop pve-power-mqtt
|
||||||
|
sleep 60
|
||||||
|
nvidia-smi --query-gpu=power.draw,pstate --format=csv
|
||||||
|
systemctl start pve-power-mqtt
|
||||||
|
```
|
||||||
|
|
||||||
|
Optional: GPU-Intervall im Code erhöhen (z. B. 60 s) — siehe server-power Repo.
|
||||||
|
|
||||||
|
## Betrieb
|
||||||
|
|
||||||
|
```bash
|
||||||
|
systemctl status pve-power-mqtt
|
||||||
|
journalctl -u pve-power-mqtt -f
|
||||||
|
```
|
||||||
|
|
||||||
|
## Fixes (Historie)
|
||||||
|
|
||||||
|
- `expire_after` / `availability_topic` aus Discovery entfernt (HA „unavailable“)
|
||||||
|
- Eindeutige Client-IDs pro Host
|
||||||
|
- Keepalive 120 s, Ping-Timeout 30 s
|
||||||
|
- MQTT-Reconnect-Logging
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
| Problem | Lösung |
|
||||||
|
|---------|--------|
|
||||||
|
| GPU unavailable in HA | Agent läuft? `nvidia-smi` auf Host? |
|
||||||
|
| Hohe GPU-Idle-Werte | Persistence + LXC-Mounts prüfen (CT 101 ohne NVIDIA) |
|
||||||
|
| MQTT timeout | VLAN 10→40, Broker homeassistant.iot erreichbar? |
|
||||||
@@ -0,0 +1,74 @@
|
|||||||
|
# Git & Repositories
|
||||||
|
|
||||||
|
Gitea: **https://git.jeanavril.com** · User: **jean**
|
||||||
|
|
||||||
|
## Repositories
|
||||||
|
|
||||||
|
| Repo | URL | Clone-Pfad auf Hosts |
|
||||||
|
|------|-----|----------------------|
|
||||||
|
| docu | https://git.jeanavril.com/jean/docu.git | `/root/docu-repo` |
|
||||||
|
| server-power | https://git.jeanavril.com/jean/server-power.git | `/root/code/pve-power-mqtt` |
|
||||||
|
|
||||||
|
## Authentifizierung (HTTPS)
|
||||||
|
|
||||||
|
SSH zu Gitea ist über Reverse-Proxy **nicht** eingerichtet → **HTTPS + Token**.
|
||||||
|
|
||||||
|
### Token (Gitea → Settings → Applications)
|
||||||
|
|
||||||
|
User `jean`, Token für Automation auf den Proxmox-Hosts.
|
||||||
|
|
||||||
|
Gespeichert in: **`/root/.git-credentials-jeanavril`**
|
||||||
|
|
||||||
|
```
|
||||||
|
https://jean:<TOKEN>@git.jeanavril.com
|
||||||
|
```
|
||||||
|
|
||||||
|
(chmod 600)
|
||||||
|
|
||||||
|
### Git-Credential pro Repo (lokal)
|
||||||
|
|
||||||
|
In jedem Repo unter `.git/config`:
|
||||||
|
|
||||||
|
```
|
||||||
|
credential.helper=store --file /root/.git-credentials-jeanavril
|
||||||
|
```
|
||||||
|
|
||||||
|
Setzen:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd /root/docu-repo # oder /root/code/pve-power-mqtt
|
||||||
|
git config --local credential.helper 'store --file /root/.git-credentials-jeanavril'
|
||||||
|
```
|
||||||
|
|
||||||
|
## Erstes Setup auf neuem Host
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Doku
|
||||||
|
git clone https://git.jeanavril.com/jean/docu.git /root/docu-repo
|
||||||
|
cd /root/docu-repo
|
||||||
|
git config --local credential.helper 'store --file /root/.git-credentials-jeanavril'
|
||||||
|
# Token-Datei anlegen (Inhalt von anderem Host kopieren)
|
||||||
|
|
||||||
|
# Power-Agent
|
||||||
|
git clone https://git.jeanavril.com/jean/server-power.git /root/code/pve-power-mqtt
|
||||||
|
cd /root/code/pve-power-mqtt
|
||||||
|
git config --local credential.helper 'store --file /root/.git-credentials-jeanavril'
|
||||||
|
```
|
||||||
|
|
||||||
|
## Go installiert
|
||||||
|
|
||||||
|
Pfad: `/usr/local/go/bin/go` — in `~/.bashrc`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export PATH="/usr/local/go/bin:$PATH"
|
||||||
|
```
|
||||||
|
|
||||||
|
## Workflow
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd /root/docu-repo && git pull
|
||||||
|
# editieren
|
||||||
|
git add -A
|
||||||
|
git commit -m "Kurze Beschreibung"
|
||||||
|
git push
|
||||||
|
```
|
||||||
@@ -0,0 +1,44 @@
|
|||||||
|
# Infrastruktur & Netzwerk
|
||||||
|
|
||||||
|
## Proxmox-Hosts
|
||||||
|
|
||||||
|
| Host | IP vmbr0 | Gateway | SSH |
|
||||||
|
|------|----------|---------|-----|
|
||||||
|
| pve1 | 192.168.10.3/24 | 192.168.10.1 | `ssh root@192.168.10.3` |
|
||||||
|
| pve2 | 192.168.10.4/24 | 192.168.10.1 | `ssh root@192.168.10.4` |
|
||||||
|
|
||||||
|
Management-Netz: **192.168.10.0/24** (VLAN 10)
|
||||||
|
|
||||||
|
## DNS (intern)
|
||||||
|
|
||||||
|
| Name | IP | Dienst |
|
||||||
|
|------|-----|--------|
|
||||||
|
| homeassistant.iot | 192.168.40.254 | Home Assistant + Mosquitto MQTT |
|
||||||
|
| git.jeanavril.com | (Gitea) | Git-Repositories |
|
||||||
|
|
||||||
|
Schema: VLAN-ID oft = drittes Oktett (`192.168.40.0/24` = VLAN 40)
|
||||||
|
|
||||||
|
## pve1 — Bridges
|
||||||
|
|
||||||
|
| Bridge | Anbindung | Zweck |
|
||||||
|
|--------|-----------|-------|
|
||||||
|
| vmbr0 | nic0, VLAN-aware | WAN / Management |
|
||||||
|
| vmbr1 | keine phys. Ports | LAN-Seite OPNsense-Fallback |
|
||||||
|
|
||||||
|
## pve2 — Bridges
|
||||||
|
|
||||||
|
| Bridge | Zweck |
|
||||||
|
|--------|-------|
|
||||||
|
| vmbr0 | VLAN-aware, VMs/CTs Management |
|
||||||
|
| vmbr1 | Intern (OPNsense LAN, CT-Netze) |
|
||||||
|
|
||||||
|
Details CT/VM-Netze: siehe Host-Doku unter `pve1/` bzw. `pve2/`.
|
||||||
|
|
||||||
|
## Rollen
|
||||||
|
|
||||||
|
- **pve2:** Produktiv, OPNsense VM 104, Home Assistant VM 106, Docker/Frigate CT 101, GPU-Host
|
||||||
|
- **pve1:** Fallback-Router (OPNsense-Klon VM 104, gestoppt), CT 100 files
|
||||||
|
|
||||||
|
## Failover-Hinweis
|
||||||
|
|
||||||
|
OPNsense-Fallback auf pve1 (VM 104) und Original auf pve2 **dürfen nicht parallel** laufen — gleiche IPs/Konfiguration. Siehe [pve1/04_fallback_aktivierung.md](../pve1/04_fallback_aktivierung.md).
|
||||||
@@ -0,0 +1,94 @@
|
|||||||
|
# MQTT & Home Assistant
|
||||||
|
|
||||||
|
## Broker
|
||||||
|
|
||||||
|
| Parameter | Wert |
|
||||||
|
|-----------|------|
|
||||||
|
| Hostname | `homeassistant.iot` |
|
||||||
|
| IP | 192.168.40.254 |
|
||||||
|
| Port | 1883 (TLS: nicht verwendet) |
|
||||||
|
| User | `server` |
|
||||||
|
| Passwort | `F0x84rAOW#q@LX` |
|
||||||
|
| Protokoll | MQTT v3.1.1, QoS 0, retained states |
|
||||||
|
|
||||||
|
Broker läuft auf dem **Home Assistant**-System (VM 106 auf pve2).
|
||||||
|
|
||||||
|
## Power-Sensoren (pve-power-mqtt)
|
||||||
|
|
||||||
|
Agent-Repo: https://git.jeanavril.com/jean/server-power.git
|
||||||
|
|
||||||
|
### Topics (Beispiel pve2)
|
||||||
|
|
||||||
|
```
|
||||||
|
homeassistant/sensor/pve2/cpu_power/state
|
||||||
|
homeassistant/sensor/pve2/gpu0_power/state
|
||||||
|
homeassistant/sensor/pve2/gpu1_power/state
|
||||||
|
homeassistant/sensor/pve2/estimated_total/state
|
||||||
|
```
|
||||||
|
|
||||||
|
Discovery (retained):
|
||||||
|
|
||||||
|
```
|
||||||
|
homeassistant/sensor/pve2_cpu_power/config
|
||||||
|
homeassistant/sensor/pve2_gpu0_power/config
|
||||||
|
...
|
||||||
|
```
|
||||||
|
|
||||||
|
### HA-Geräte
|
||||||
|
|
||||||
|
| Gerät | Host | Sensoren |
|
||||||
|
|-------|------|----------|
|
||||||
|
| pve1 Power | pve1 | CPU, estimated_total |
|
||||||
|
| pve2 Power | pve2 | CPU, GPU0, GPU1, estimated_total |
|
||||||
|
|
||||||
|
`estimated_total` = CPU (RAPL) + GPU-Summe — **kein** Wandverbrauch.
|
||||||
|
|
||||||
|
### Env auf den Hosts
|
||||||
|
|
||||||
|
Datei: `/etc/pve-power-mqtt.env` (chmod 600)
|
||||||
|
|
||||||
|
**pve2:**
|
||||||
|
|
||||||
|
```ini
|
||||||
|
POWER_MQTT_BROKER=tcp://homeassistant.iot:1883
|
||||||
|
POWER_MQTT_USER=server
|
||||||
|
POWER_MQTT_PASSWORD="F0x84rAOW#q@LX"
|
||||||
|
POWER_MQTT_HOSTNAME=pve2
|
||||||
|
POWER_MQTT_DISCOVERY=true
|
||||||
|
```
|
||||||
|
|
||||||
|
**pve1:**
|
||||||
|
|
||||||
|
```ini
|
||||||
|
POWER_MQTT_BROKER=tcp://homeassistant.iot:1883
|
||||||
|
POWER_MQTT_USER=server
|
||||||
|
POWER_MQTT_PASSWORD="F0x84rAOW#q@LX"
|
||||||
|
POWER_MQTT_HOSTNAME=
|
||||||
|
POWER_MQTT_CLIENT_ID=
|
||||||
|
POWER_MQTT_DISCOVERY=true
|
||||||
|
```
|
||||||
|
|
||||||
|
Leere `HOSTNAME` / `CLIENT_ID` → automatisch `pve1` bzw. `pve-power-mqtt-pve1`.
|
||||||
|
|
||||||
|
### MQTT-Client-IDs (wichtig)
|
||||||
|
|
||||||
|
Jeder Host braucht eine **eindeutige** Client-ID, sonst „session taken over“ im Mosquitto-Log:
|
||||||
|
|
||||||
|
| Host | Client-ID |
|
||||||
|
|------|-----------|
|
||||||
|
| pve1 | `pve-power-mqtt-pve1` |
|
||||||
|
| pve2 | `pve-power-mqtt-pve2` |
|
||||||
|
|
||||||
|
### Bekannte Mosquitto-Log-Meldungen
|
||||||
|
|
||||||
|
| Meldung | Bedeutung |
|
||||||
|
|---------|-----------|
|
||||||
|
| `session taken over` | Gleiche Client-ID von neuem Connect — prüfen ob Duplikat |
|
||||||
|
| `exceeded timeout` | Keepalive verpasst — Agent reconnectet |
|
||||||
|
| `pingresp not received` | Netz/Latenz VLAN 10↔40 — Keepalive im Agent auf 120 s |
|
||||||
|
|
||||||
|
### HA nach Agent-Update
|
||||||
|
|
||||||
|
**Einstellungen → Geräte & Dienste → MQTT → Neu laden**
|
||||||
|
|
||||||
|
Alte Discovery-Einträge mit `expire_after` oder `availability_topic` ggf. Entity löschen und neu discovern.
|
||||||
Reference in New Issue
Block a user