学习和体验Ray on volcano
1.准备
概念:
Ray job至少有三种情况:
第一种:先起ray集群,再往运行中的ray集群提交作业:https://docs.ray.io/en/latest/cluster/running-applications/job-submission/cli.html#
第二种:部署kuberay-operator,生成RayJob的kubernetes自定义CR,然后提交RayJob:https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayjob.md
第三种:ray集成volcano(使用queue和podgroup):https://github.com/ray-project/kuberay/blob/master/docs/guidance/volcano-integration.md
主要代码提交:
https://github.com/ray-project/kuberay/pull/755/files#diff-58d661db0d2307b4cd362b636f6e8753ac85b0dbc19c1d91df1f41c6ee5e826b
本篇博客主要是体验第三种
Volcano概念:
介绍:https://volcano.sh/zh/docs/podgroup/
- queue是容纳一组podgroup的队列,也是该组podgroup获取集群资源的划分依据
- podgroup是一组强关联pod的集合,主要用于批处理工作负载场景,比如Tensorflow中的一组ps和worker。它是volcano自定义资源类型。
- Volcano Job,简称vcjob,是Volcano自定义的Job资源类型。区别于Kubernetes Job,vcjob提供了更多高级功能,如可指定调度器、支持最小运行pod数、 支持task、支持生命周期管理、支持指定队列、支持优先级调度等。Volcano Job更加适用于机器学习、大数据、科学计算等高性能计算场景。
准备:
先有kubernetes集群,本篇博客运行在华为云CCE上,已经有了kubernetes,支持helm插件等
本地安装kubectl和helm等工具
计划:
安装kuberay-operator 0.5.1 helm chart, 镜像版本为0.5.0
安装kuberay-apiserver 0.5.1 helm chart, 镜像版本为0.5.0
job使用rayproject/ray:v2.4.0版本
2. 安装kuberay-operator
开启batch schdular
helm install kuberay-operator --set batchScheduler.enabled=true
3.创建queue
命令:
kubectl apply -f createQueue.yaml
文件内容:
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:name: kuberay-test-queue
spec:weight: 1capability:cpu: 4memory: 6Gi
4. 创建Ray集群
root@DESKTOP-3813A3M:/mnt/d/all/app/Ray/rayjob-vocalno# kubectl apply -f createRayCluster.yaml
raycluster.ray.io/test-cluster-0 created
文件内容:
需要:
- 替换镜像,
- 指定serviceType,
- 指定imagePullSecrets
apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:name: test-cluster-0labels:ray.io/scheduler-name: volcanovolcano.sh/queue-name: kuberay-test-queue
spec:rayVersion: '2.4.0'headGroupSpec:rayStartParams: {}replicas: 1serviceType: "ClusterIP"template:spec:imagePullSecrets:- name: default-secretcontainers:- name: ray-headimage: swr.cn-north-7.myhuaweicloud.com/modelarts-idm-auto/rayproject/ray:2.4.0resources:limits:cpu: "1"memory: "2Gi"requests:cpu: "1"memory: "2Gi"workerGroupSpecs:- groupName: workerrayStartParams: {}replicas: 2minReplicas: 2maxReplicas: 2template:spec:imagePullSecrets:- name: default-secretcontainers:- name: ray-headimage: swr.cn-north-7.myhuaweicloud.com/modelarts-idm-auto/rayproject/ray:2.4.0resources:limits:cpu: "1"memory: "1Gi"requests:cpu: "1"memory: "1Gi"
查看podGroup:
root@DESKTOP-3813A3M:/mnt/d/all/app/Ray/rayjob-vocalno# kubectl get podgroup ray-test-cluster-1-pg -o yaml
apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:creationTimestamp: "2023-05-11T07:05:58Z"generation: 285name: ray-test-cluster-1-pgnamespace: defaultownerReferences:- apiVersion: ray.io/v1alpha1blockOwnerDeletion: truecontroller: truekind: RayClustername: test-cluster-1uid: 1fa1e9e8-0e9c-4f36-a10d-9a88abb32853resourceVersion: "54132769"selfLink: /apis/scheduling.volcano.sh/v1beta1/namespaces/default/podgroups/ray-test-cluster-1-pguid: c12a3abd-1dec-41f2-9a9a-2c85e5a74f7a
spec:minMember: 3minResources:cpu: "3"memory: 4Giqueue: kuberay-test-queue
status:conditions:- lastTransitionTime: "2023-05-11T12:15:26Z"reason: tasks in gang are ready to be scheduledstatus: "True"transitionID: f6dd2e6b-b98c-46be-8b0b-57862c2dcd90type: Scheduledphase: Runningrunning: 3
root@DESKTOP-3813A3M:/mnt/d/all/app/Ray/rayjob-vocalno# kubectl get podgroup ray-test-cluster-3-pg -o yaml
apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:creationTimestamp: "2023-05-11T12:14:39Z"generation: 4name: ray-test-cluster-3-pgnamespace: defaultownerReferences:- apiVersion: ray.io/v1alpha1blockOwnerDeletion: truecontroller: truekind: RayClustername: test-cluster-3uid: d4878879-5635-459e-8678-ab668abfbd2bresourceVersion: "54132373"selfLink: /apis/scheduling.volcano.sh/v1beta1/namespaces/default/podgroups/ray-test-cluster-3-pguid: 139d23a7-0260-4eac-9ba9-b8599aab6eab
spec:minMember: 3minResources:cpu: "3"memory: 4Giqueue: kuberay-test-queue
status:conditions:- lastTransitionTime: "2023-05-11T12:14:50Z"reason: tasks in gang are ready to be scheduledstatus: "True"transitionID: f30b3387-e00f-4481-995e-b175a424ea47type: Scheduledphase: Runningrunning: 3
root@DESKTOP-3813A3M:/mnt/d/all/app/Ray/rayjob-vocalno#
root@DESKTOP-3813A3M:/mnt/d/all/app/Ray/rayjob-vocalno#
root@DESKTOP-3813A3M:/mnt/d/all/app/Ray/rayjob-vocalno# kubectl get podgroup ray-test-cluster-0-pg -o yaml
apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:creationTimestamp: "2023-05-11T04:08:53Z"generation: 465name: ray-test-cluster-0-pgnamespace: defaultownerReferences:- apiVersion: ray.io/v1alpha1blockOwnerDeletion: truecontroller: truekind: RayClustername: test-cluster-0uid: a0233819-6e8e-4555-9d3f-13d5d1b2301aresourceVersion: "54145929"selfLink: /apis/scheduling.volcano.sh/v1beta1/namespaces/default/podgroups/ray-test-cluster-0-pguid: 29585663-7352-44c5-8ae8-7575f1cb6937
spec:minMember: 3minResources:cpu: "3"memory: 4Giqueue: kuberay-test-queue
status:conditions:- lastTransitionTime: "2023-05-11T12:34:17Z"reason: tasks in gang are ready to be scheduledstatus: "True"transitionID: 8d0c920b-f503-4f51-afd2-4cd9dca751aetype: Scheduledphase: Runningrunning: 3
5.查看
查看Queue
已经起了三个ray cluster
root@DESKTOP-3813A3M:/mnt/d/all/app/Ray/rayjob-vocalno# kubectl get queue kuberay-test-queue -o yaml
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:annotations:kubectl.kubernetes.io/last-applied-configuration: |{"apiVersion":"scheduling.volcano.sh/v1beta1","kind":"Queue","metadata":{"annotations":{},"name":"kuberay-test-queue"},"spec":{"capability":{"cpu":4,"memory":"6Gi"},"weight":1}}creationTimestamp: "2023-05-11T04:04:44Z"generation: 1name: kuberay-test-queueresourceVersion: "54132374"selfLink: /apis/scheduling.volcano.sh/v1beta1/queues/kuberay-test-queueuid: dfb0d4af-f899-4c48-ac8c-7bcd5d0016b7
spec:capability:cpu: 4memory: 6Gireclaimable: trueweight: 1
status:allocated:cpu: "9"memory: 12Gireservation: {}running: 3state: Open
查看pod
root@DESKTOP-3813A3M:/mnt/d/all/app/Ray/rayjob-vocalno# kubectl describe pod test-cluster-4-worker-worker-kvrzt
Name: test-cluster-4-worker-worker-kvrzt
Namespace: default
Priority: 0
Node: 172.18.154.132/172.18.154.132
Start Time: Thu, 11 May 2023 20:37:09 +0800
Labels: app.kubernetes.io/created-by=kuberay-operatorapp.kubernetes.io/name=kuberayray.io/cluster=test-cluster-4ray.io/cluster-dashboard=test-cluster-4-dashboardray.io/group=workerray.io/identifier=test-cluster-4-workerray.io/is-ray-node=yesray.io/node-type=workervolcano.sh/queue-name=kuberay-test-queue
Annotations: kubernetes.io/psp: psp-globalray.io/ft-enabled: falseray.io/health-state:scheduling.k8s.io/group-name: ray-test-cluster-4-pg
Status: Pending
IP:
IPs: <none>
Controlled By: RayCluster/test-cluster-4
Init Containers:wait-gcs-ready:Container ID:Image: swr.cn-north-7.myhuaweicloud.com/modelarts-idm-auto/rayproject/ray:2.4.0Image ID:Port: <none>Host Port: <none>Command:/bin/bash-lc--Args:until ray health-check --address test-cluster-4-head-svc.default.svc.cluster.local:6379 > /dev/null 2>&1; do echo wait for GCS to be ready; sleep 5; doneState: WaitingReason: PodInitializingReady: FalseRestart Count: 0Environment:FQ_RAY_IP: test-cluster-4-head-svc.default.svc.cluster.localRAY_IP: test-cluster-4-head-svcMounts:/var/run/secrets/kubernetes.io/serviceaccount from default-token-spw4v (ro)
Containers:ray-head:Container ID:Image: swr.cn-north-7.myhuaweicloud.com/modelarts-idm-auto/rayproject/ray:2.4.0Image ID:Port: 8080/TCPHost Port: 0/TCPCommand:/bin/bash-lc--Args:ulimit -n 65536; ray start --address=test-cluster-4-head-svc.default.svc.cluster.local:6379 --metrics-export-port=8080 --block --num-cpus=1 --memory=1073741824State: WaitingReason: PodInitializingReady: FalseRestart Count: 0Limits:cpu: 1memory: 1GiRequests:cpu: 1memory: 1GiEnvironment:FQ_RAY_IP: test-cluster-4-head-svc.default.svc.cluster.localRAY_IP: test-cluster-4-head-svcRAY_CLUSTER_NAME: (v1:metadata.labels['ray.io/cluster'])RAY_PORT: 6379RAY_ADDRESS: test-cluster-4-head-svc.default.svc.cluster.local:6379RAY_USAGE_STATS_KUBERAY_IN_USE: 1REDIS_PASSWORD:Mounts:/dev/shm from shared-mem (rw)/var/run/secrets/kubernetes.io/serviceaccount from default-token-spw4v (ro)
Conditions:Type StatusInitialized FalseReady FalseContainersReady FalsePodScheduled True
Volumes:shared-mem:Type: EmptyDir (a temporary directory that shares a pod's lifetime)Medium: MemorySizeLimit: 1Gidefault-token-spw4v:Type: Secret (a volume populated by a Secret)SecretName: default-token-spw4vOptional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300snode.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:Type Reason Age From Message---- ------ ---- ---- -------Normal Scheduled 42s volcano Successfully assigned default/test-cluster-4-worker-worker-kvrzt to 172.18.154.132Normal SuccessfulMountVolume 41s kubelet Successfully mounted volumes for pod "test-cluster-4-worker-worker-kvrzt_default(9ecf1c20-b145-4eae-978b-a6181b5a21c5)"Normal Pulling 41s kubelet Pulling image "swr.cn-north-7.myhuaweicloud.com/modelarts-idm-auto/rayproject/ray:2.4.0"Normal Pulled 16s kubelet Successfully pulled image "swr.cn-north-7.myhuaweicloud.com/modelarts-idm-auto/rayproject/ray:2.4.0" in 25.037677108s
root@DESKTOP-3813A3M:/mnt/d/all/app/Ray/rayjob-vocalno#
感觉并没有达到预期的结果,案例来说queue是4核6G,创建第二个ray cluster的时候就会pending,但是创建3个都没有pending,都是running。难道是volcano版本的问题?
6.总结
kuberay的volcano集成只展示了queue和podGroup的集成,没有展示vocalnojob的集成
学习和体验Ray on volcano相关推荐
- 吴恩达Deeplearning.ai课程学习全体验:深度学习必备课程 By 路雪2017年8月14日 11:44 8 月 8 日,吴恩达正式发布了 Deepleanring.ai——基于 Cours
吴恩达Deeplearning.ai课程学习全体验:深度学习必备课程 By 路雪2017年8月14日 11:44 8 月 8 日,吴恩达正式发布了 Deepleanring.ai--基于 Course ...
- 计算机图形学学习笔记——Whitted-Style Ray Tracing(GAMES101作业5讲解)
计算机图形学学习笔记--Whitted-Style Ray Tracing GAMES101作业5讲解 遍历所有的像素生成光线 光线与平面求交 遍历所有的像素生成光线 关于作业五中如何遍历所有的像素, ...
- 【头歌平台】人工智能-深度学习初体验
深度学习初体验 第1关:什么是神经网络 第一题 神经网络中也有神经元,这些神经元也会与其他神经元相连接,这样就形成了神经网络,而且这种网络我们称之为全连接网络.如下图所示(方块表示神经元): 从图可以 ...
- 一生E本用奥运品质为教育加持力量,和女排冠军宋妮娜一起开启学习新体验
2021年7月15日, 一生E本品牌方与女排奥运冠军宋妮娜签约仪式在山东济南举办,中国女排奥运冠军宋妮娜出席发布会现场.包括一生E本品牌方代表.教育界知名专家学者以及企业家代表在内的超200名嘉宾,共 ...
- LILY 英语携手神策数据 数据赋能少儿英语学习创新体验
近日,LILY 英语签约神策数据.神策数据将助力其不断精耕细作,打造少儿最喜爱的英语学习产品,让更多孩子在快乐学英语的同时创造受益终生的价值. LILY 英语于上世纪 90 年代开始进行教研开发,至今 ...
- yii schema.mysql.sql_YII学习,初体验 ,对YII的一些理解.
先说点没用的: 不会选择,选择后不坚持,不断的选择.这是人生中的一个死循环,前两一直迷茫.觉得自己前进方向很不明朗.想去学的东西有很多.想学好YII,想学PYTHON 想学学hadoop什么的,又想研 ...
- day21—AngularJS学习初体验
转行学开发,代码100天--2018-04-06 今天按照学习计划安排,开始AngularJS的学习. 关于AngularJS,在菜鸟教程上这样介绍 好吧,Angular学习起来非常简单,哈哈,现在就 ...
- 【程序员进阶之路】吴恩达Deeplearning.ai课程学习全体验:深度学习必备课程
8 月 8 日,--基于 Coursera 的系列深度学习课程,希望将人工智能时代的基础知识传播到更多人身边.一周过去后,许多人已经学完了目前开放的前三门课程.这些新课适合哪些人,它是否能和经典的&l ...
- 编程学习初体验(4. 编程的核心)
初学编程的朋友,总觉得写程序是件单纯的事情:知道如何使用一种语言,熟悉一个开发环境,了解系统的编程接口(API)就已经能够成为一个合格的程序员 了.在我刚刚接触编程学习的时候,我也是这么认为的.这种认 ...
最新文章
- http协议的队首阻塞
- 重构-改善既有代码的设计:重新组织函数的九种方法(四)
- 怎么损坏mysql_如何修复MySQL中损坏的表
- 在一台电脑上运行两个或多个tomcat
- math.sqrt 有问题_JavaScript中带有示例的Math.sqrt()方法
- 前端学习(1900)vue之电商管理系统电商系统之渲染添加用户的表单自定义邮箱的规则
- 使用HIBERNATE的SQL查询并将结果集自动转换成POJO
- ZooKeeper学习专题之四:示例 实时更新server列表
- python自动化运维之路~DAY1
- linux命令行学习游戏,如何在Linux命令行中下载GOG游戏
- java day39【HTTP协议:响应消息 、Response对象 、ServletContext对象】
- matlab 拟合瑞利分布公式_概率论3「学生成绩转化」为正态分布和偏态分布的方法...
- 如何用几何画板破解版制作正方体展开动画
- 推动区块链基础设施建设,先要破解区块链发展“三高”难题
- 中兴服务器bios启动顺序设置,bios设置启动图解教程
- win7 计算器 android,强大的Win7计算器
- android 自定义locale,android – 以编程方式设置Locale
- 某金融企业核心存储POC测试及选型经验
- 网易人工智能受邀出席第二届云创大会
- ArcMap导入数据到ArcSDE报000597或者000224的错误