引言
由于工作需要,即将拥抱Spark,曾经进行过相关知识的学习,现在计划详细读一遍最新版本Spark1.3的部分官方文档,一是复习,二是了解最新进展,三是为公司团队培训做储备。

欢迎转载,请注明出处:
http://blog.csdn.net/u010967382/article/details/45062407

原文URL :http://spark.apache.org/docs/latest/running-on-yarn.html
该文档重点介绍如何在YARN上运行Spark程序。
先回顾YARN的架构:
这里不详述YARN的架构,请自行查阅,重点理解灰色的部分即可,灰色的部分是Container,是YARN体系中资源分为的单元,一个程序,比如图中的MR程序,由一个负责总调度的MR Application Master和多个具体执行任务的Map/Reduce Tasks组成。
从0.6.0版本以后,Spark就提供了对YARN的支持,并在后续版本中持续改进。
运行Spark on YARN需要一个支持YARN功能的Spark的发行版,可以到官网下载(http://spark.apache.org/downloads.html),也可以自行编译(http://spark.apache.org/docs/latest/building-spark.html)。
关于Spark on YARN的配置,基础配置都和别的模式一样,有一些为YARN专门设计的参数,参见下表,我就不翻译了,需要的时候自行查阅:
Property Name Default Meaning
spark.yarn.am.memory 512m Amount of memory to use for the YARN Application Master in client mode, in the same format as JVM memory strings (e.g. 512m2g). In cluster mode, use spark.driver.memory instead.
spark.driver.cores 1 Number of cores used by the driver in YARN cluster mode. Since the driver is run in the same JVM as the YARN Application Master in cluster mode, this also controls the cores used by the YARN AM. In client mode, usespark.yarn.am.cores to control the number of cores used by the YARN AM instead.
spark.yarn.am.cores 1 Number of cores to use for the YARN Application Master in client mode. In cluster mode, use spark.driver.cores instead.
spark.yarn.am.waitTime 100000 In yarn-cluster mode, time in milliseconds for the application master to wait for the SparkContext to be initialized. In yarn-client mode, time for the application master to wait for the driver to connect to it.
spark.yarn.submit.file.replication The default HDFS replication (usually 3) HDFS replication level for the files uploaded into HDFS for the application. These include things like the Spark jar, the app jar, and any distributed cache files/archives.
spark.yarn.preserve.staging.files false Set to true to preserve the staged files (Spark jar, app jar, distributed cache files) at the end of the job rather than delete them.
spark.yarn.scheduler.heartbeat.interval-ms 5000 The interval in ms in which the Spark application master heartbeats into the YARN ResourceManager.
spark.yarn.max.executor.failures numExecutors * 2, with minimum of 3 The maximum number of executor failures before failing the application.
spark.yarn.historyServer.address (none) The address of the Spark history server (i.e. host.com:18080). The address should not contain a scheme (http://). Defaults to not being set since the history server is an optional service. This address is given to the YARN ResourceManager when the Spark application finishes to link the application from the ResourceManager UI to the Spark history server UI.
spark.yarn.dist.archives (none) Comma separated list of archives to be extracted into the working directory of each executor.
spark.yarn.dist.files (none) Comma-separated list of files to be placed in the working directory of each executor.
spark.executor.instances 2 The number of executors. Note that this property is incompatible withspark.dynamicAllocation.enabled.
spark.yarn.executor.memoryOverhead executorMemory * 0.07, with minimum of 384 The amount of off heap memory (in megabytes) to be allocated per executor. This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc. This tends to grow with the executor size (typically 6-10%).
spark.yarn.driver.memoryOverhead driverMemory * 0.07, with minimum of 384 The amount of off heap memory (in megabytes) to be allocated per driver in cluster mode. This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc. This tends to grow with the container size (typically 6-10%).
spark.yarn.am.memoryOverhead AM memory * 0.07, with minimum of 384 Same as spark.yarn.driver.memoryOverhead, but for the Application Master in client mode.
spark.yarn.queue default The name of the YARN queue to which the application is submitted.
spark.yarn.jar (none) The location of the Spark jar file, in case overriding the default location is desired. By default, Spark on YARN will use a Spark jar installed locally, but the Spark jar can also be in a world-readable location on HDFS. This allows YARN to cache it on nodes so that it doesn't need to be distributed each time an application runs. To point to a jar on HDFS, for example, set this configuration to "hdfs:///some/path".
spark.yarn.access.namenodes (none) A list of secure HDFS namenodes your Spark application is going to access. For example, `spark.yarn.access.namenodes=hdfs://nn1.com:8032,hdfs://nn2.com:8032`. The Spark application must have acess to the namenodes listed and Kerberos must be properly configured to be able to access them (either in the same realm or in a trusted realm). Spark acquires security tokens for each of the namenodes so that the Spark application can access those remote HDFS clusters.
spark.yarn.appMasterEnv.[EnvironmentVariableName] (none) Add the environment variable specified by EnvironmentVariableName to the Application Master process launched on YARN. The user can specify multiple of these and to set multiple environment variables. In yarn-cluster mode this controls the environment of the SPARK driver and in yarn-client mode it only controls the environment of the executor launcher.
spark.yarn.containerLauncherMaxThreads 25 The maximum number of threads to use in the application master for launching executor containers.
spark.yarn.am.extraJavaOptions (none) A string of extra JVM options to pass to the YARN Application Master in client mode. In cluster mode, use spark.driver.extraJavaOptions instead.
spark.yarn.maxAppAttempts yarn.resourcemanager.am.max-attempts in YARN The maximum number of attempts that will be made to submit the application. It should be no larger than the global number of max attempts in the YARN configuration.
要想在YARN上运行Spark程序,首先要确保 HADOOP_CONF_DIR或者 YARN_CONF_DIR环境变量已经正确指向了Hadoop的配置目录(export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop)。
接下来就是决定yarn-xxx模式。
根据 driver进程所处的位置,我们可以有两种启动Spark on YARN的模式:
  • yarn-client模式:在yarn-client模式中,driver运行在client进程中,Application Master(Application Master是YARN架构的一部分,是运行在YARN中各个应用程序的调度器)仅仅用于向YARN申请资源(client在控制台可以看到程序打印输出)。

  • yarn-cluster模式:在yarn-cluster模式中,Spark driver运行在 Application Master进程中(client控制台看不到程序打印的输出)。
注意: 和Spark standalone和Mesos模式不同,这两个模式的master地址都用master参数指定, 在YARN模式中,资源管理器的地址是从Hadoop的配置文件中读取的,所以spark-submit脚本中的--master参数仅需要指定yarn-client或者yarn-cluster。
以下命令格式可以用yarn-cluster模式启动一个Spark on YARN应用程序:
./bin/spark-submit --class path.to.your.Class --master yarn-cluster [options] <app jar> [app options]
例如:
$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi \
    --master yarn-cluster \
    --num-executors 3 \
    --driver-memory 4g \
    --executor-memory 2g \
    --executor-cores 1 \
    --queue thequeue \
    lib/spark-examples*.jar \
    10
上面的脚本启动了一个YARN cluster模式的应用程序, SparkPi将会作为Application Master的子线程运行。 Client(在集群外)将周期性的轮询Application Master来获取最新的应用程序执行状态,并将接受到的状态信息展示在控制台上。 Client会在应用程序结束运行时退出。
在yarn-cluster模式中,driver和client运行在不同的机器中,所以driver中的SparkContext.addJar对于client机器的文件就不起作用了。 为了使得driver中的SparkContext.addJar可以访问client机器中的文件,可以在spark-submit脚本中使用 --jars 参数,例如:
$ ./bin/spark-submit --class my.main.Class \--master yarn-cluster \--jars my-other-jar.jar,my-other-other-jar.jarmy-main-jar.jarapp_arg1 app_arg2

【甘道夫】Spark1.3.0 Running Spark on YARN 官方文档精华摘要相关推荐

  1. 【甘道夫】Spark1.3.0 Cluster Mode Overview 官方文档精华摘要

    引言 由于工作需要,即将拥抱Spark,曾经进行过相关知识的学习,现在计划详细读一遍最新版本Spark1.3的部分官方文档,一是复习,二是了解最新进展,三是为公司团队培训做储备. 欢迎转载,请注明出处 ...

  2. 【甘道夫】Hive 0.13.1 on Hadoop2.2.0 + Oracle10g部署详细解释

    环境: hadoop2.2.0 hive0.13.1 Ubuntu 14.04 LTS java version "1.7.0_60" Oracle10g ***欢迎转载.请注明来 ...

  3. 【甘道夫】Win7x64环境下编译Apache Hadoop2.2.0的Eclipse插件

    目标: 编译Apache Hadoop2.2.0在win7x64环境下的Eclipse插件 环境: win7x64家庭普通版 eclipse-jee-kepler-SR1-win32-x86_64.z ...

  4. 【甘道夫】Win7x64环境下编译Apache Hadoop2.2.0的Eclipse小工具

    目标: 编译Apache Hadoop2.2.0在win7x64环境下的Eclipse插件 环境: win7x64家庭普通版 eclipse-jee-kepler-SR1-win32-x86_64.z ...

  5. 【甘道夫】HBase(0.96以上版本)过滤器Filter详解及实例代码

    说明: 本文参考官方Ref Guide,Developer API和众多博客,并结合实测代码编写,详细总结HBase的Filter功能,并附上每类Filter的相应代码实现. 本文尽量遵从Ref Gu ...

  6. 【甘道夫】MapReduce实现矩阵乘法--实现代码

    之前写了一篇分析MapReduce实现矩阵乘法算法的文章: [甘道夫]Mapreduce实现矩阵乘法的算法思路 为了让大家更直观的了解程序运行,今天编写了实现代码供大家參考. 编程环境: java v ...

  7. 用Python与Watson,将《魔戒》甘道夫的性格可视化!

    全文共4301字,预计学习时长9分钟 图源Unsplash,由Marko Blažević提供 著名心理学家詹姆斯· 彭内贝克曾说:"仔细观察人们通过语言表达思想的方式,会感受到他们的性格特 ...

  8. 【甘道夫】Hadoop2.4.1尝鲜部署+完整版配置文件

    引言 转眼间,Hadoop的stable版本已经升级到2.4.1了,社区的力量真是强大!3.0啥时候release呢? 今天做了个调研,尝鲜了一下2.4.1版本的分布式部署,包括NN HA(目前已经部 ...

  9. 【甘道夫】Hive扩展GIS函数

    阶段一:编译函数包 基于 https://github.com/Esri/spatial-framework-for-hadoop 项目编译产出两个jar包: spatial-sdk-hive-2.1 ...

最新文章

  1. [YTU]_2008( 简单编码)
  2. 一、史上最强hadoop分布式集群的搭建
  3. dubbo通信协议之对比
  4. return view前端怎么获取_Web 前端路由原理解析和功能实现
  5. 牛客练习赛20:A. 礼物(组合数学/小球与盒子问题)
  6. 《统计学习方法》——k近邻法
  7. html5 预渲染,VUE预渲染及遇到的坑_情愫_前端开发者
  8. icem搅拌器网格划分_搅拌器研究所的第六个开放电影项目
  9. 米扑代理:爬虫代理IP哪家好呢
  10. 美国大学计算机系学什么,2017美国大学计算机专业排名
  11. ftp工具绿色版,四款好用的绿色版ftp工具
  12. 王安计算机科学思想,【OHI访谈手记】互联网口述历史访谈计算机先驱John E. Savage...
  13. 关于UML中的Stereotype
  14. 求最大公约数和最小公倍数的多种方法
  15. 计算机桌面底下显示条,详细教您电脑屏幕出现条纹怎么办
  16. AX4.0 SP2本地化的问题---汇兑损益报表打印
  17. 小试X64 inline HOOK,hook explorer.exe---CreateProcessInternalW监视进程创建
  18. 自学python第四天之实现LUR算法
  19. 西行漫记(14):慌神了
  20. FFmpeg中调用av_read_frame函数导致的内存泄漏问题

热门文章

  1. java在线考试系统源码下载_Java在线考试系统 SpringMVC实现 源码下载
  2. 二进制 正负数加减法 计算INT_MIN - 1=INT_MAX
  3. 物联网应用之 - 智能搜索系统
  4. 【开源】基于Blinker的智能WiFi物联网插座
  5. Cortex-M55的单片机AI技术Helium权威指南电子书发布(2020-09-08)
  6. validate和validateField的使用及区别
  7. 01-微服务探讨(摘)
  8. shell 批量修改多个文件中的内容
  9. 由玫琳凯赞助的NFTE世界创新系列赛挑战赛公布致力于解决全球问题的青年获胜者
  10. 选择IT运维工具,拒绝裸奔。