Hadoop生态圈-Azkaban实现文件上传到hdfs并执行MR数据清洗

                                           作者:尹正杰

版权声明:原创作品,谢绝转载!否则将追究法律责任。

  如果你没有Hadoop集群的话也没有关系,我这里给出当时我部署Hadoop集群的笔记:https://www.cnblogs.com/yinzhengjie/p/9154265.html。当然想要了解更多还是请参考官网的部署方案,我部署的环境只是测试开发环境。

 

一.启动Hadoop集群

1>.启动脚本信息

[yinzhengjie@s101 ~]$ more /usr/local/bin/xzk.sh
#!/bin/bash
#@author :yinzhengjie
#blog:http://www.cnblogs.com/yinzhengjie
#EMAIL:y1053419035@qq.com#判断用户是否传参
if [ $# -ne 1 ];thenecho "无效参数,用法为: $0  {start|stop|restart|status}"exit
fi#获取用户输入的命令
cmd=$1#定义函数功能
function zookeeperManger(){case $cmd instart)echo "启动服务"        remoteExecution start;;stop)echo "停止服务"remoteExecution stop;;restart)echo "重启服务"remoteExecution restart;;status)echo "查看状态"remoteExecution status;;*)echo "无效参数,用法为: $0  {start|stop|restart|status}";;esac
}#定义执行的命令
function remoteExecution(){for (( i=102 ; i<=104 ; i++ )) ; dotput setaf 2echo ========== s$i zkServer.sh  $1 ================tput setaf 9ssh s$i  "source /etc/profile ; zkServer.sh $1"done
}#调用函数
zookeeperManger
[yinzhengjie@s101 ~]$ 

zookeeper启动脚本([yinzhengjie@s101 ~]$ more /usr/local/bin/xzk.sh )

[yinzhengjie@s101 ~]$ more /usr/local/bin/xcall.sh
#!/bin/bash
#@author :yinzhengjie
#blog:http://www.cnblogs.com/yinzhengjie
#EMAIL:y1053419035@qq.com#判断用户是否传参
if [ $# -lt 1 ];thenecho "请输入参数"exit
fi#获取用户输入的命令
cmd=$@for (( i=101;i<=105;i++ ))
do#使终端变绿色 tput setaf 2echo ============= s$i $cmd ============#使终端变回原来的颜色,即白灰色tput setaf 7#远程执行命令ssh s$i $cmd#判断命令是否执行成功if [ $? == 0 ];thenecho "命令执行成功"fi
done
[yinzhengjie@s101 ~]$ 

批量执行命令的脚本([yinzhengjie@s101 ~]$ more /usr/local/bin/xcall.sh)

[yinzhengjie@s101 ~]$ more /soft/hadoop/sbin/start-dfs.sh
#!/usr/bin/env bash# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.# Start hadoop dfs daemons.
# Optinally upgrade or rollback dfs state.
# Run this on master node.usage="Usage: start-dfs.sh [-upgrade|-rollback] [other options such as -clusterId]"bin=`dirname "${BASH_SOURCE-$0}"`
bin=`cd "$bin"; pwd`DEFAULT_LIBEXEC_DIR="$bin"/../libexec
HADOOP_LIBEXEC_DIR=${HADOOP_LIBEXEC_DIR:-$DEFAULT_LIBEXEC_DIR}
. $HADOOP_LIBEXEC_DIR/hdfs-config.sh# get arguments
if [[ $# -ge 1 ]]; thenstartOpt="$1"shiftcase "$startOpt" in-upgrade)nameStartOpt="$startOpt";;-rollback)dataStartOpt="$startOpt";;*)echo $usageexit 1;;esac
fi#Add other possible options
nameStartOpt="$nameStartOpt $@"#---------------------------------------------------------
# namenodesNAMENODES=$($HADOOP_PREFIX/bin/hdfs getconf -namenodes)echo "Starting namenodes on [$NAMENODES]""$HADOOP_PREFIX/sbin/hadoop-daemons.sh" \--config "$HADOOP_CONF_DIR" \--hostnames "$NAMENODES" \--script "$bin/hdfs" start namenode $nameStartOpt#---------------------------------------------------------
# datanodes (using default slaves file)if [ -n "$HADOOP_SECURE_DN_USER" ]; thenecho \"Attempting to start secure cluster, skipping datanodes. " \"Run start-secure-dns.sh as root to complete startup."
else"$HADOOP_PREFIX/sbin/hadoop-daemons.sh" \--config "$HADOOP_CONF_DIR" \--script "$bin/hdfs" start datanode $dataStartOpt
fi#---------------------------------------------------------
# secondary namenodes (if any)SECONDARY_NAMENODES=$($HADOOP_PREFIX/bin/hdfs getconf -secondarynamenodes 2>/dev/null)if [ -n "$SECONDARY_NAMENODES" ]; thenecho "Starting secondary namenodes [$SECONDARY_NAMENODES]""$HADOOP_PREFIX/sbin/hadoop-daemons.sh" \--config "$HADOOP_CONF_DIR" \--hostnames "$SECONDARY_NAMENODES" \--script "$bin/hdfs" start secondarynamenode
fi#---------------------------------------------------------
# quorumjournal nodes (if any)SHARED_EDITS_DIR=$($HADOOP_PREFIX/bin/hdfs getconf -confKey dfs.namenode.shared.edits.dir 2>&-)case "$SHARED_EDITS_DIR" in
qjournal://*)JOURNAL_NODES=$(echo "$SHARED_EDITS_DIR" | sed 's,qjournal://\([^/]*\)/.*,\1,g; s/;/ /g; s/:[0-9]*//g')echo "Starting journal nodes [$JOURNAL_NODES]""$HADOOP_PREFIX/sbin/hadoop-daemons.sh" \--config "$HADOOP_CONF_DIR" \--hostnames "$JOURNAL_NODES" \--script "$bin/hdfs" start journalnode ;;
esac#---------------------------------------------------------
# ZK Failover controllers, if auto-HA is enabled
AUTOHA_ENABLED=$($HADOOP_PREFIX/bin/hdfs getconf -confKey dfs.ha.automatic-failover.enabled)
if [ "$(echo "$AUTOHA_ENABLED" | tr A-Z a-z)" = "true" ]; thenecho "Starting ZK Failover Controllers on NN hosts [$NAMENODES]""$HADOOP_PREFIX/sbin/hadoop-daemons.sh" \--config "$HADOOP_CONF_DIR" \--hostnames "$NAMENODES" \--script "$bin/hdfs" start zkfc
fi# eof
[yinzhengjie@s101 ~]$ 

HDFS启动脚本([yinzhengjie@s101 ~]$ more /soft/hadoop/sbin/start-dfs.sh)

[yinzhengjie@s101 ~]$ more /soft/hadoop/sbin/start-yarn.sh
#!/usr/bin/env bash# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.# Start all yarn daemons.  Run this on master node.echo "starting yarn daemons"bin=`dirname "${BASH_SOURCE-$0}"`
bin=`cd "$bin"; pwd`DEFAULT_LIBEXEC_DIR="$bin"/../libexec
HADOOP_LIBEXEC_DIR=${HADOOP_LIBEXEC_DIR:-$DEFAULT_LIBEXEC_DIR}
. $HADOOP_LIBEXEC_DIR/yarn-config.sh# start resourceManager
#"$bin"/yarn-daemon.sh --config $YARN_CONF_DIR  start resourcemanager
"$bin"/yarn-daemons.sh --config $YARN_CONF_DIR --hosts masters start resourcemanager
# start nodeManager
"$bin"/yarn-daemons.sh --config $YARN_CONF_DIR  start nodemanager
# start proxyserver
#"$bin"/yarn-daemon.sh --config $YARN_CONF_DIR  start proxyserver
[yinzhengjie@s101 ~]$ 

资源调度器Yarn启动脚本([yinzhengjie@s101 ~]$ more /soft/hadoop/sbin/start-yarn.sh )

2>.启动zookeeper集群

[yinzhengjie@s101 ~]$ xzk.sh start
启动服务
========== s102 zkServer.sh start ================
ZooKeeper JMX enabled by default
Using config: /soft/zk/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
========== s103 zkServer.sh start ================
ZooKeeper JMX enabled by default
Using config: /soft/zk/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
========== s104 zkServer.sh start ================
ZooKeeper JMX enabled by default
Using config: /soft/zk/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
[yinzhengjie@s101 ~]$
[yinzhengjie@s101 ~]$
[yinzhengjie@s101 ~]$ xcall.sh jps
============= s101 jps ============
2630 AzkabanWebServer
2666 AzkabanExecutorServer
3485 Jps
命令执行成功
============= s102 jps ============
2354 Jps
2319 QuorumPeerMain
命令执行成功
============= s103 jps ============
2332 Jps
2303 QuorumPeerMain
命令执行成功
============= s104 jps ============
2337 Jps
2308 QuorumPeerMain
命令执行成功
============= s105 jps ============
2310 Jps
命令执行成功
[yinzhengjie@s101 ~]$ 

3>.启动HDFS分布式文件系统

[yinzhengjie@s101 ~]$ start-dfs.sh
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/soft/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/soft/apache-hive-2.1.1-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Starting namenodes on [s101 s105]
s101: starting namenode, logging to /soft/hadoop-2.7.3/logs/hadoop-yinzhengjie-namenode-s101.out
s105: starting namenode, logging to /soft/hadoop-2.7.3/logs/hadoop-yinzhengjie-namenode-s105.out
s105: starting datanode, logging to /soft/hadoop-2.7.3/logs/hadoop-yinzhengjie-datanode-s105.out
s103: starting datanode, logging to /soft/hadoop-2.7.3/logs/hadoop-yinzhengjie-datanode-s103.out
s104: starting datanode, logging to /soft/hadoop-2.7.3/logs/hadoop-yinzhengjie-datanode-s104.out
s102: starting datanode, logging to /soft/hadoop-2.7.3/logs/hadoop-yinzhengjie-datanode-s102.out
Starting journal nodes [s102 s103 s104]
s103: starting journalnode, logging to /soft/hadoop-2.7.3/logs/hadoop-yinzhengjie-journalnode-s103.out
s102: starting journalnode, logging to /soft/hadoop-2.7.3/logs/hadoop-yinzhengjie-journalnode-s102.out
s104: starting journalnode, logging to /soft/hadoop-2.7.3/logs/hadoop-yinzhengjie-journalnode-s104.out
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/soft/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/soft/apache-hive-2.1.1-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Starting ZK Failover Controllers on NN hosts [s101 s105]
s101: starting zkfc, logging to /soft/hadoop-2.7.3/logs/hadoop-yinzhengjie-zkfc-s101.out
s105: starting zkfc, logging to /soft/hadoop-2.7.3/logs/hadoop-yinzhengjie-zkfc-s105.out
[yinzhengjie@s101 ~]$
[yinzhengjie@s101 ~]$
[yinzhengjie@s101 ~]$ xcall.sh jps
============= s101 jps ============
4016 Jps
3939 DFSZKFailoverController
2630 AzkabanWebServer
2666 AzkabanExecutorServer
3629 NameNode
命令执行成功
============= s102 jps ============
2480 JournalNode
2405 DataNode
2583 Jps
2319 QuorumPeerMain
命令执行成功
============= s103 jps ============
2382 DataNode
2303 QuorumPeerMain
2463 JournalNode
2559 Jps
命令执行成功
============= s104 jps ============
2465 JournalNode
2387 DataNode
2563 Jps
2308 QuorumPeerMain
命令执行成功
============= s105 jps ============
2547 DFSZKFailoverController
2436 DataNode
2365 NameNode
2655 Jps
命令执行成功
[yinzhengjie@s101 ~]$ 

4>.启动YARN资源调度器

[yinzhengjie@s101 ~]$ start-yarn.sh
starting yarn daemons
s101: starting resourcemanager, logging to /soft/hadoop-2.7.3/logs/yarn-yinzhengjie-resourcemanager-s101.out
s105: starting resourcemanager, logging to /soft/hadoop-2.7.3/logs/yarn-yinzhengjie-resourcemanager-s105.out
s105: starting nodemanager, logging to /soft/hadoop-2.7.3/logs/yarn-yinzhengjie-nodemanager-s105.out
s103: starting nodemanager, logging to /soft/hadoop-2.7.3/logs/yarn-yinzhengjie-nodemanager-s103.out
s104: starting nodemanager, logging to /soft/hadoop-2.7.3/logs/yarn-yinzhengjie-nodemanager-s104.out
s102: starting nodemanager, logging to /soft/hadoop-2.7.3/logs/yarn-yinzhengjie-nodemanager-s102.out
[yinzhengjie@s101 ~]$
[yinzhengjie@s101 ~]$ xcall.sh jps
============= s101 jps ============
3939 DFSZKFailoverController
2630 AzkabanWebServer
4231 Jps
2666 AzkabanExecutorServer
4140 ResourceManager
3629 NameNode
命令执行成功
============= s102 jps ============
2480 JournalNode
2675 Jps
2405 DataNode
2638 NodeManager
2319 QuorumPeerMain
命令执行成功
============= s103 jps ============
2615 NodeManager
2698 Jps
2382 DataNode
2303 QuorumPeerMain
2463 JournalNode
命令执行成功
============= s104 jps ============
2465 JournalNode
2689 Jps
2387 DataNode
2308 QuorumPeerMain
2618 NodeManager
命令执行成功
============= s105 jps ============
2547 DFSZKFailoverController
2915 Jps
2436 DataNode
2365 NameNode
2782 NodeManager
命令执行成功
[yinzhengjie@s101 ~]$ 

5>.检查web服务是否可用正常访问

二.文件上传到hdfs

1>.创建job文件并将其压缩

2>.将压缩后的文件上传至Azkaban的WEB服务器上

3>.执行jpb程序

4>.点击继续

5>.查看细节

6>.查看日志信息

7>.查看HDFS的webUI是否在根下成功创建azkaban目录

8>.查看执行ID 

三.执行MR数据清洗

1>.编辑的配置文件内容

[yinzhengjie@s101 ~]$ more /home/yinzhengjie/yinzhengjie.txt
Security is off.Safemode is off.535 files and directories, 224 blocks = 759 total filesystem object(s).Heap Memory used 74.06 MB of 261.5 MB Heap Memory. Max Heap Memory is 889 MB.Non Heap Memory used 56.54 MB of 57.94 MB Commited Non Heap Memory. Max Non Heap Memory is -1 B.
[yinzhengjie@s101 ~]$ 

源文件内容([yinzhengjie@s101 ~]$ more /home/yinzhengjie/yinzhengjie.txt)

#putFileToHdfs Add by yinzhengjie
type=command
command=hdfs dfs -put /home/yinzhengjie/yinzhengjie.txt /azkaban

将需要进行单词统计的数据上传到HDFS中(putFileToHdfs.job)

#mapreduce.job ADD by yinzhengjie
type=command
command=hadoop jar /soft/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount /azkaban /azkaban_out

执行MapReduce任务对上传的到hdfs的文件进行单词统计(mapreduce.job)

2>.编辑job任务并压缩

3>.将任务上传到azkaban的WEB界面中

4>.选择执行顺序

5>.查看MapReduce的运行状态

6>.查看MapReduce的运行结果

转载于:https://www.cnblogs.com/yinzhengjie/p/9233393.html

Hadoop生态圈-Azkaban实现文件上传到hdfs并执行MR数据清洗相关推荐

  1. 大数据之 将txt文件上传到HDFS并用Hive查询

    在生产上,一般对数据清洗之后直接放到HDFS上,再将目录加载到分区表中,之后通过hive去查询分析数据: 1.准备数据 order_created.txt 用 tab分割 10703007267488 ...

  2. 通过Java程序将“/你的名字拼音缩写/input1/shixun1.txt”文件上传到HDFS的“/你的名字拼音缩写/java/input1/”目录下;通过Java程序将HDFS上的“/你的名字拼音

    题目: 通过Java程序将"/你的名字拼音缩写/input1/shixun1.txt"文件上传到HDFS的"/你的名字拼音缩写/java/input1/"目录下 ...

  3. Zimbra邮件服务器利用XXE漏洞与SSRF完成对目标的文件上传与远程代码执行

    前言 原文地址:https://blog.tint0.com/2019/03/a-saga-of-code-executions-on-zimbra.html 参考文档:https://blog.cs ...

  4. 漏洞分析: WSO2 API Manager 任意文件上传、远程代码执行漏洞

    漏洞描述 某些WSO2产品允许不受限制地上传文件,从而执行远程代码.以WSO2 API Manager 为例,它是一个完全开源的 API 管理平台.它支持API设计,API发布,生命周期管理,应用程序 ...

  5. FTP压缩文件上传到HDFS大小不一致的问题说明(FTP传输模式)

    1.问题:将ftp文件服务器上的压缩文件通过内存流直接写入HDFS内,却发现文件不一致,MD5SUM校验也不一致. 2.分析: FTP的传输有两种方式:ASCII传输模式和二进制数据传输模式. 1)A ...

  6. 6.HDFS文件上传和下载API

    HDFS文件上传和下载API package hdfsAPI;import org.apache.hadoop.conf.Configuration; import org.apache.hadoop ...

  7. python scoket、SocketServer简单实现文件上传下载

    2019独角兽企业重金招聘Python工程师标准>>> 1.描述 实现任何位置文件下载到客户端执行的当前目录下 实现任何位置文件上传到服务端执行的当前目录下服务端: #!/usr/b ...

  8. wordpress漏洞上传php文件,WordPress wp-admin/includes/file.php任意文件上传漏洞

    影响版本: WordPress <= 2.8.5漏洞描述: WordPress是一款免费的论坛Blog系统. WordPress中负责上传文件的代码如下: wp-admin/includes/f ...

  9. WSO2 文件上传 (CVE-2022-29464)漏洞复现

    1.漏洞描述 WSO2是一家成立于 2005 年的开源技术提供商.它提供了一个企业平台,用于在本地和整个 Internet 上 集成应用程序编程接口(API).应用程序和 Web 服务. 某些 WSO ...

  10. CVE-2022-29464 WSO2 任意文件上传漏洞复现

    目录 0x01 声明: 0x02 简介: 0x03 漏洞概述: 0x04 影响版本: 0x05 环境搭建: vulfocus搭建: 0x06 漏洞复现: 1.EXP利用: 2.Burp改包: 0x07 ...

最新文章

  1. QT+VS打包发布流程该怎么做?
  2. MacPro 系统空间竟占90G,如何清理--OmniDiskSweeper
  3. koa --- jwt实现最简单的Token认证
  4. Linux下解压缩包命令
  5. 04-linux下安装neo4j
  6. python 格式化工具_Google的Python代码格式化工具YAPF详解
  7. [Angular 2] Nesting Elements in Angular 2 Components with ng-content (AKA Angular 2 Transclusion)
  8. shiro 同时实现url和按钮的拦截_Shiro是如何拦截未登录请求的(一)
  9. 联邦快递认了:转运华为货件到美国,但称是“失误”!
  10. linux中级之HAProxy基础配置
  11. IO多路复用之select、poll、epoll介绍
  12. 小白学前端之:JavaScript null 和 undefined 的区别
  13. MATLAB建模与仿真(第一章基础--第四章画图)
  14. 如何通过OCJP认证考试
  15. easydarwin 安装_EasyDarwin流媒体服务器
  16. 高等数学---向量解析几何
  17. Uphone开发心得
  18. 5APMP项目管理:PMP考试备考经验学习方法和模拟考题(1-经验篇)
  19. mysql重装系统后以前的数据_重装系统后 如何使用之前mysql数据
  20. 可以多项目协同的项目管理软件

热门文章

  1. Swoft单元测试基本坑
  2. 苹果python环境依赖库,【强迫症系列】【mac】更改 Python 的 pip install 默认安装依赖路径...
  3. CSDN寄送的礼物,博客评选的阳光普照奖
  4. (已解决)MAC JAVA错误:Cocoa AWT: Not running on AppKit thread 0 when expected
  5. java3d/j3d源码位置
  6. 开发人员测试,也必须有测试报告
  7. 开会不要带没用的记事本、笔
  8. 管理感悟:产品功能比别人差,所以不能用?
  9. 人类为何喜欢十进制的数
  10. “春风又绿江南岸”真正的关键是什么?