GPU资源共享/抢占

GPU共享的整体架构

这个项目是一个Kubernetes GPU共享调度系统，通过两种类型的pod实现GPU资源共享：

1. Shadow Pod（影子Pod）

作用：占用物理GPU资源，作为GPU的实际持有者
特征：运行一个简单的sleep容器，不执行实际计算任务
创建时机：当quota资源不足时自动创建
生命周期：由shadow controller管理，根据使用情况自动创建和删除

作用：实际执行计算任务的pod，通过device plugin共享shadow pod的GPU资源
特征：使用虚拟GPU设备，通过时间片或空间分割方式共享物理GPU
资源请求：请求共享GPU资源（如mizar.share.gpu.2表示2路共享）

核心逻辑链路

1. 配额管理（Quota Controller）

监控quota资源状态，计算shadow resource需求
维护shadow pod与物理GPU的映射关系
处理shadow pod的创建和删除逻辑

2. Shadow Pod管理（Shadow Controller）

创建逻辑：当quota的ready + pipelining数量为负值时，计算需要创建的shadow pod数量
删除逻辑：根据shadow pod的空闲时间决定是否删除
条件检查：通过PodInUsing条件跟踪GPU使用状态

3. 设备插件（Device Plugin）

Share Device Plugin：为share pod提供虚拟GPU设备
设备ID格式：{namespace}/{shadowpod-name}-{index}
动态更新设备列表，反映当前可用的shadow pod资源

4. 调度器（Solver）

ShareGPU Solver：处理share pod事件，管理share pod与shadow pod的绑定关系
SpotGPU Solver：处理spot pod事件，支持低功耗模式和抢占机制

5. 缓存系统（Mapping Cache）

维护三层映射关系：
1. Shadow pod → 物理GPU设备
2. Share pod → Shadow pod
3. 设备状态跟踪（健康、使用中、空闲）

工作流程

资源申请：用户创建请求共享GPU的pod（share pod）
配额检查：quota controller检查当前资源是否充足
Shadow Pod创建：如果资源不足，shadow controller创建shadow pod占用物理GPU
设备注册：device plugin将shadow pod的GPU注册为虚拟设备
Pod调度：share pod被调度到有可用shadow pod的节点
资源绑定：shareGPU solver将share pod绑定到具体的shadow pod设备
使用监控：通过条件系统和metrics监控GPU使用情况
资源回收：当share pod结束或shadow pod空闲超时，自动回收资源

关键技术点

时间片共享：通过CUDA MPS或时间分片实现GPU计算资源共享
动态调整：根据负载动态创建和删除shadow pod
优先级管理：支持低功耗模式和抢占机制
状态同步：通过Kubernetes条件和annotation保持状态一致性

这种架构实现了GPU资源的超卖和弹性伸缩，显著提高了GPU利用率，同时保持了与传统Kubernetes调度的兼容性。

Spot资源抢占机制

1. 抢占触发条件

spot资源的抢占主要通过以下几种情况触发：

a) 所有者Pod恢复活跃（主要抢占场景）

// owner pod alive, evict spot
if !utils.IsLowPowerEnabled(pod) || !s.enableLowPower {
    ns, name, _ := cache.SplitMetaNamespaceKey(e.Target)
    s.logger.WithFields(logman.Fields{"devices": e.Devices}).Infof("device owner alive, evicting pod %s", name)
    if err := s.kubeClient.CoreV1().Pods(ns).Delete(context.Background(), name, metav1.DeleteOptions{}); err != nil {
        s.logger.Errorf("failed evicting pod %s: %v", name, err)
    }
    return
}

触发条件：当GPU的原始所有者pod（非低功耗模式）重新变得活跃时，spot pod会被立即驱逐。

b) 低功耗模式抢占

func (s *spotGPUSolver) solveBusy(e detectors.Event) error {
    // update evict metrics before it happens
    evictedUsers := s.cache.GetOwnerUsers(e.Target)
    s.updateEvictCounter(evictedUsers, consts.EvictReasonOwnerBack)

    // do evict
    if err := s.evictProcessFor(e); err != nil {
        return err
    }
    if err := s.evictFor(e.Target); err != nil {
        return err
    }
}

触发条件：当低功耗pod从空闲状态变为繁忙状态时，会驱逐所有使用其GPU资源的spot pod。

2. 抢占执行方式

a) 容器级别抢占（温和抢占）

func (s *spotGPUSolver) evictProcessFor(e detectors.Event) error {
    for _, container := range pod.Status.ContainerStatuses {
        id := strings.TrimPrefix(container.ContainerID, "containerd://")
        go func(containerID string) {
            s.logger.Infof("evicting container %s for dev %s", containerID, e.Target)
            if err := s.taskClient.Kill(context.Background(), containerID); err != nil {
                s.logger.WithError(err).Errorf("evictProcessFor: failed killing user %s", user)
            }
        }(id)
    }
}

特点：通过containerd API直接杀死容器进程，实现快速资源回收。

b) Pod级别抢占（强制抢占）

func (s *spotGPUSolver) evictFor(target string) error {
    users := s.cache.GetOwnerUsers(target)
    for _, user := range users {
        ns, name, _ := cache.SplitMetaNamespaceKey(user)
        s.logger.Infof("evicting pod %s for dev %s", name, target)
        if err := s.kubeClient.CoreV1().Pods(ns).Delete(context.Background(), name, metav1.DeleteOptions{}); err != nil {
            s.logger.Errorf("failed evicting pod %s: %v", user, err)
        }
    }
}

特点：通过Kubernetes API删除整个pod，确保资源完全释放。

3. 抢占优先级策略

a) 低功耗模式优先

低功耗pod享有最高优先级，不会被spot pod抢占
只有低功耗pod才能共享GPU给spot pod使用
当低功耗pod需要资源时，spot pod会被立即驱逐

b) 资源预留机制

func (s *spotGPUSolver) evictByDevice(e detectors.Event) error {
    // 从空闲空间移除设备
    if err := s.cache.UnUse(consts.IdleSpace, ds.SharingPod, ds.ID); err != nil {
        s.logger.WithFields(logman.Fields{"user": ds.SharingPod, "device": ds.ID}).Errorf("failed unuse from idle space", ds.SharingPod)
    }
    users[ds.SharingPod] = struct{}{}
}

4. 抢占后的状态管理

a) 缓存清理

// 更新缓存
s.cache.Delete(e.Target)
// 移除spot gpu（设置为不健康）
s.updateDP(e, pluginapi.Unhealthy)

b) 条件更新

// 更新活跃条件
if err := s.kubeClient.UpdateActiveCondition(e.Target, true); err != nil {
    s.logger.WithError(err).WithFields(logman.Fields{"active": true}).Errorf("updating active condition")
}

c) 指标记录

1 2	// 更新驱逐计数器 s.updateEvictCounter(evictedUsers, consts.EvictReasonOwnerBack)

5. 抢占策略特点

实时性：通过事件驱动机制，实时响应资源状态变化
分级处理：先尝试容器级温和抢占，必要时进行pod级强制抢占
状态一致性：确保缓存、设备插件、Kubernetes状态的一致性
可观测性：记录详细的抢占日志和指标，便于监控和调试
资源保障：优先保障高优先级任务（低功耗pod）的资源需求

这种抢占机制确保了GPU资源的高效利用，同时在需要时能够快速回收资源给高优先级任务使用。