1.准备

概念:

Ray job至少有三种情况:

第一种:先起ray集群,再往运行中的ray集群提交作业:https://docs.ray.io/en/latest/cluster/running-applications/job-submission/cli.html#
第二种:部署kuberay-operator,生成RayJob的kubernetes自定义CR,然后提交RayJob:https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayjob.md
第三种:ray集成volcano(使用queue和podgroup):https://github.com/ray-project/kuberay/blob/master/docs/guidance/volcano-integration.md
主要代码提交:
https://github.com/ray-project/kuberay/pull/755/files#diff-58d661db0d2307b4cd362b636f6e8753ac85b0dbc19c1d91df1f41c6ee5e826b
本篇博客主要是体验第三种

Volcano概念:

介绍:https://volcano.sh/zh/docs/podgroup/

  • queue是容纳一组podgroup的队列,也是该组podgroup获取集群资源的划分依据
  • podgroup是一组强关联pod的集合,主要用于批处理工作负载场景,比如Tensorflow中的一组ps和worker。它是volcano自定义资源类型。
  • Volcano Job,简称vcjob,是Volcano自定义的Job资源类型。区别于Kubernetes Job,vcjob提供了更多高级功能,如可指定调度器、支持最小运行pod数、 支持task、支持生命周期管理、支持指定队列、支持优先级调度等。Volcano Job更加适用于机器学习、大数据、科学计算等高性能计算场景。

准备:

先有kubernetes集群,本篇博客运行在华为云CCE上,已经有了kubernetes,支持helm插件等
本地安装kubectl和helm等工具

计划:

安装kuberay-operator 0.5.1 helm chart, 镜像版本为0.5.0
安装kuberay-apiserver 0.5.1 helm chart, 镜像版本为0.5.0
job使用rayproject/ray:v2.4.0版本

2. 安装kuberay-operator

开启batch schdular

helm install kuberay-operator --set batchScheduler.enabled=true

3.创建queue

命令:

 kubectl apply -f createQueue.yaml

文件内容:

apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:name: kuberay-test-queue
spec:weight: 1capability:cpu: 4memory: 6Gi

4. 创建Ray集群

root@DESKTOP-3813A3M:/mnt/d/all/app/Ray/rayjob-vocalno# kubectl apply -f createRayCluster.yaml
raycluster.ray.io/test-cluster-0 created

文件内容:

需要:

  • 替换镜像,
  • 指定serviceType,
  • 指定imagePullSecrets
apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:name: test-cluster-0labels:ray.io/scheduler-name: volcanovolcano.sh/queue-name: kuberay-test-queue
spec:rayVersion: '2.4.0'headGroupSpec:rayStartParams: {}replicas: 1serviceType: "ClusterIP"template:spec:imagePullSecrets:- name: default-secretcontainers:- name: ray-headimage: swr.cn-north-7.myhuaweicloud.com/modelarts-idm-auto/rayproject/ray:2.4.0resources:limits:cpu: "1"memory: "2Gi"requests:cpu: "1"memory: "2Gi"workerGroupSpecs:- groupName: workerrayStartParams: {}replicas: 2minReplicas: 2maxReplicas: 2template:spec:imagePullSecrets:- name: default-secretcontainers:- name: ray-headimage: swr.cn-north-7.myhuaweicloud.com/modelarts-idm-auto/rayproject/ray:2.4.0resources:limits:cpu: "1"memory: "1Gi"requests:cpu: "1"memory: "1Gi"

查看podGroup:

root@DESKTOP-3813A3M:/mnt/d/all/app/Ray/rayjob-vocalno# kubectl get podgroup ray-test-cluster-1-pg -o yaml
apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:creationTimestamp: "2023-05-11T07:05:58Z"generation: 285name: ray-test-cluster-1-pgnamespace: defaultownerReferences:- apiVersion: ray.io/v1alpha1blockOwnerDeletion: truecontroller: truekind: RayClustername: test-cluster-1uid: 1fa1e9e8-0e9c-4f36-a10d-9a88abb32853resourceVersion: "54132769"selfLink: /apis/scheduling.volcano.sh/v1beta1/namespaces/default/podgroups/ray-test-cluster-1-pguid: c12a3abd-1dec-41f2-9a9a-2c85e5a74f7a
spec:minMember: 3minResources:cpu: "3"memory: 4Giqueue: kuberay-test-queue
status:conditions:- lastTransitionTime: "2023-05-11T12:15:26Z"reason: tasks in gang are ready to be scheduledstatus: "True"transitionID: f6dd2e6b-b98c-46be-8b0b-57862c2dcd90type: Scheduledphase: Runningrunning: 3
root@DESKTOP-3813A3M:/mnt/d/all/app/Ray/rayjob-vocalno# kubectl get podgroup ray-test-cluster-3-pg -o yaml
apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:creationTimestamp: "2023-05-11T12:14:39Z"generation: 4name: ray-test-cluster-3-pgnamespace: defaultownerReferences:- apiVersion: ray.io/v1alpha1blockOwnerDeletion: truecontroller: truekind: RayClustername: test-cluster-3uid: d4878879-5635-459e-8678-ab668abfbd2bresourceVersion: "54132373"selfLink: /apis/scheduling.volcano.sh/v1beta1/namespaces/default/podgroups/ray-test-cluster-3-pguid: 139d23a7-0260-4eac-9ba9-b8599aab6eab
spec:minMember: 3minResources:cpu: "3"memory: 4Giqueue: kuberay-test-queue
status:conditions:- lastTransitionTime: "2023-05-11T12:14:50Z"reason: tasks in gang are ready to be scheduledstatus: "True"transitionID: f30b3387-e00f-4481-995e-b175a424ea47type: Scheduledphase: Runningrunning: 3
root@DESKTOP-3813A3M:/mnt/d/all/app/Ray/rayjob-vocalno#
root@DESKTOP-3813A3M:/mnt/d/all/app/Ray/rayjob-vocalno#
root@DESKTOP-3813A3M:/mnt/d/all/app/Ray/rayjob-vocalno# kubectl get podgroup ray-test-cluster-0-pg -o yaml
apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:creationTimestamp: "2023-05-11T04:08:53Z"generation: 465name: ray-test-cluster-0-pgnamespace: defaultownerReferences:- apiVersion: ray.io/v1alpha1blockOwnerDeletion: truecontroller: truekind: RayClustername: test-cluster-0uid: a0233819-6e8e-4555-9d3f-13d5d1b2301aresourceVersion: "54145929"selfLink: /apis/scheduling.volcano.sh/v1beta1/namespaces/default/podgroups/ray-test-cluster-0-pguid: 29585663-7352-44c5-8ae8-7575f1cb6937
spec:minMember: 3minResources:cpu: "3"memory: 4Giqueue: kuberay-test-queue
status:conditions:- lastTransitionTime: "2023-05-11T12:34:17Z"reason: tasks in gang are ready to be scheduledstatus: "True"transitionID: 8d0c920b-f503-4f51-afd2-4cd9dca751aetype: Scheduledphase: Runningrunning: 3

5.查看

查看Queue

已经起了三个ray cluster

root@DESKTOP-3813A3M:/mnt/d/all/app/Ray/rayjob-vocalno#  kubectl get queue kuberay-test-queue -o yaml
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:annotations:kubectl.kubernetes.io/last-applied-configuration: |{"apiVersion":"scheduling.volcano.sh/v1beta1","kind":"Queue","metadata":{"annotations":{},"name":"kuberay-test-queue"},"spec":{"capability":{"cpu":4,"memory":"6Gi"},"weight":1}}creationTimestamp: "2023-05-11T04:04:44Z"generation: 1name: kuberay-test-queueresourceVersion: "54132374"selfLink: /apis/scheduling.volcano.sh/v1beta1/queues/kuberay-test-queueuid: dfb0d4af-f899-4c48-ac8c-7bcd5d0016b7
spec:capability:cpu: 4memory: 6Gireclaimable: trueweight: 1
status:allocated:cpu: "9"memory: 12Gireservation: {}running: 3state: Open

查看pod

root@DESKTOP-3813A3M:/mnt/d/all/app/Ray/rayjob-vocalno# kubectl describe pod test-cluster-4-worker-worker-kvrzt
Name:           test-cluster-4-worker-worker-kvrzt
Namespace:      default
Priority:       0
Node:           172.18.154.132/172.18.154.132
Start Time:     Thu, 11 May 2023 20:37:09 +0800
Labels:         app.kubernetes.io/created-by=kuberay-operatorapp.kubernetes.io/name=kuberayray.io/cluster=test-cluster-4ray.io/cluster-dashboard=test-cluster-4-dashboardray.io/group=workerray.io/identifier=test-cluster-4-workerray.io/is-ray-node=yesray.io/node-type=workervolcano.sh/queue-name=kuberay-test-queue
Annotations:    kubernetes.io/psp: psp-globalray.io/ft-enabled: falseray.io/health-state:scheduling.k8s.io/group-name: ray-test-cluster-4-pg
Status:         Pending
IP:
IPs:            <none>
Controlled By:  RayCluster/test-cluster-4
Init Containers:wait-gcs-ready:Container ID:Image:         swr.cn-north-7.myhuaweicloud.com/modelarts-idm-auto/rayproject/ray:2.4.0Image ID:Port:          <none>Host Port:     <none>Command:/bin/bash-lc--Args:until ray health-check --address test-cluster-4-head-svc.default.svc.cluster.local:6379 > /dev/null 2>&1; do echo wait for GCS to be ready; sleep 5; doneState:          WaitingReason:       PodInitializingReady:          FalseRestart Count:  0Environment:FQ_RAY_IP:  test-cluster-4-head-svc.default.svc.cluster.localRAY_IP:     test-cluster-4-head-svcMounts:/var/run/secrets/kubernetes.io/serviceaccount from default-token-spw4v (ro)
Containers:ray-head:Container ID:Image:         swr.cn-north-7.myhuaweicloud.com/modelarts-idm-auto/rayproject/ray:2.4.0Image ID:Port:          8080/TCPHost Port:     0/TCPCommand:/bin/bash-lc--Args:ulimit -n 65536; ray start  --address=test-cluster-4-head-svc.default.svc.cluster.local:6379  --metrics-export-port=8080  --block  --num-cpus=1  --memory=1073741824State:          WaitingReason:       PodInitializingReady:          FalseRestart Count:  0Limits:cpu:     1memory:  1GiRequests:cpu:     1memory:  1GiEnvironment:FQ_RAY_IP:                       test-cluster-4-head-svc.default.svc.cluster.localRAY_IP:                          test-cluster-4-head-svcRAY_CLUSTER_NAME:                 (v1:metadata.labels['ray.io/cluster'])RAY_PORT:                        6379RAY_ADDRESS:                     test-cluster-4-head-svc.default.svc.cluster.local:6379RAY_USAGE_STATS_KUBERAY_IN_USE:  1REDIS_PASSWORD:Mounts:/dev/shm from shared-mem (rw)/var/run/secrets/kubernetes.io/serviceaccount from default-token-spw4v (ro)
Conditions:Type              StatusInitialized       FalseReady             FalseContainersReady   FalsePodScheduled      True
Volumes:shared-mem:Type:       EmptyDir (a temporary directory that shares a pod's lifetime)Medium:     MemorySizeLimit:  1Gidefault-token-spw4v:Type:        Secret (a volume populated by a Secret)SecretName:  default-token-spw4vOptional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300snode.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:Type    Reason                 Age   From     Message----    ------                 ----  ----     -------Normal  Scheduled              42s   volcano  Successfully assigned default/test-cluster-4-worker-worker-kvrzt to 172.18.154.132Normal  SuccessfulMountVolume  41s   kubelet  Successfully mounted volumes for pod "test-cluster-4-worker-worker-kvrzt_default(9ecf1c20-b145-4eae-978b-a6181b5a21c5)"Normal  Pulling                41s   kubelet  Pulling image "swr.cn-north-7.myhuaweicloud.com/modelarts-idm-auto/rayproject/ray:2.4.0"Normal  Pulled                 16s   kubelet  Successfully pulled image "swr.cn-north-7.myhuaweicloud.com/modelarts-idm-auto/rayproject/ray:2.4.0" in 25.037677108s
root@DESKTOP-3813A3M:/mnt/d/all/app/Ray/rayjob-vocalno#

感觉并没有达到预期的结果,案例来说queue是4核6G,创建第二个ray cluster的时候就会pending,但是创建3个都没有pending,都是running。难道是volcano版本的问题?

6.总结

kuberay的volcano集成只展示了queue和podGroup的集成,没有展示vocalnojob的集成

学习和体验Ray on volcano相关推荐

  1. 吴恩达Deeplearning.ai课程学习全体验:深度学习必备课程 By 路雪2017年8月14日 11:44 8 月 8 日,吴恩达正式发布了 Deepleanring.ai——基于 Cours

    吴恩达Deeplearning.ai课程学习全体验:深度学习必备课程 By 路雪2017年8月14日 11:44 8 月 8 日,吴恩达正式发布了 Deepleanring.ai--基于 Course ...

  2. 计算机图形学学习笔记——Whitted-Style Ray Tracing(GAMES101作业5讲解)

    计算机图形学学习笔记--Whitted-Style Ray Tracing GAMES101作业5讲解 遍历所有的像素生成光线 光线与平面求交 遍历所有的像素生成光线 关于作业五中如何遍历所有的像素, ...

  3. 【头歌平台】人工智能-深度学习初体验

    深度学习初体验 第1关:什么是神经网络 第一题 神经网络中也有神经元,这些神经元也会与其他神经元相连接,这样就形成了神经网络,而且这种网络我们称之为全连接网络.如下图所示(方块表示神经元): 从图可以 ...

  4. ​一生E本用奥运品质为教育加持力量,和女排冠军宋妮娜一起开启学习新体验

    2021年7月15日, 一生E本品牌方与女排奥运冠军宋妮娜签约仪式在山东济南举办,中国女排奥运冠军宋妮娜出席发布会现场.包括一生E本品牌方代表.教育界知名专家学者以及企业家代表在内的超200名嘉宾,共 ...

  5. LILY 英语携手神策数据 数据赋能少儿英语学习创新体验

    近日,LILY 英语签约神策数据.神策数据将助力其不断精耕细作,打造少儿最喜爱的英语学习产品,让更多孩子在快乐学英语的同时创造受益终生的价值. LILY 英语于上世纪 90 年代开始进行教研开发,至今 ...

  6. yii schema.mysql.sql_YII学习,初体验 ,对YII的一些理解.

    先说点没用的: 不会选择,选择后不坚持,不断的选择.这是人生中的一个死循环,前两一直迷茫.觉得自己前进方向很不明朗.想去学的东西有很多.想学好YII,想学PYTHON 想学学hadoop什么的,又想研 ...

  7. day21—AngularJS学习初体验

    转行学开发,代码100天--2018-04-06 今天按照学习计划安排,开始AngularJS的学习. 关于AngularJS,在菜鸟教程上这样介绍 好吧,Angular学习起来非常简单,哈哈,现在就 ...

  8. 【程序员进阶之路】吴恩达Deeplearning.ai课程学习全体验:深度学习必备课程

    8 月 8 日,--基于 Coursera 的系列深度学习课程,希望将人工智能时代的基础知识传播到更多人身边.一周过去后,许多人已经学完了目前开放的前三门课程.这些新课适合哪些人,它是否能和经典的&l ...

  9. 编程学习初体验(4. 编程的核心)

    初学编程的朋友,总觉得写程序是件单纯的事情:知道如何使用一种语言,熟悉一个开发环境,了解系统的编程接口(API)就已经能够成为一个合格的程序员 了.在我刚刚接触编程学习的时候,我也是这么认为的.这种认 ...

最新文章

  1. http协议的队首阻塞
  2. 重构-改善既有代码的设计:重新组织函数的九种方法(四)
  3. 怎么损坏mysql_如何修复MySQL中损坏的表
  4. 在一台电脑上运行两个或多个tomcat
  5. math.sqrt 有问题_JavaScript中带有示例的Math.sqrt()方法
  6. 前端学习(1900)vue之电商管理系统电商系统之渲染添加用户的表单自定义邮箱的规则
  7. 使用HIBERNATE的SQL查询并将结果集自动转换成POJO
  8. ZooKeeper学习专题之四:示例 实时更新server列表
  9. python自动化运维之路~DAY1
  10. linux命令行学习游戏,如何在Linux命令行中下载GOG游戏
  11. java day39【HTTP协议:响应消息 、Response对象 、ServletContext对象】
  12. matlab 拟合瑞利分布公式_概率论3「学生成绩转化」为正态分布和偏态分布的方法...
  13. 如何用几何画板破解版制作正方体展开动画
  14. 推动区块链基础设施建设,先要破解区块链发展“三高”难题
  15. 中兴服务器bios启动顺序设置,bios设置启动图解教程
  16. win7 计算器 android,强大的Win7计算器
  17. android 自定义locale,android – 以编程方式设置Locale
  18. 某金融企业核心存储POC测试及选型经验
  19. 网易人工智能受邀出席第二届云创大会
  20. ArcMap导入数据到ArcSDE报000597或者000224的错误

热门文章

  1. android局部动态刷新,RecyclerView的局部刷新爬坑之路简述
  2. 从H265文件读取nalu
  3. 如何对接CDE Gateway,实现EDI传输
  4. ChatGPT的影响力和未来发展
  5. 上周自行车丢了,报案了,警察局给了张纸,还一直没打电话催
  6. AE 自动曝光 Understanding Auto Exposure Control
  7. 学习情况及发展规划交流
  8. 交友项目【基础环境搭建】
  9. 经纬度与高斯-克吕格平面坐标转换
  10. zoj - 1940 - Dungeon Master