Hadoop 部署与使用¶
第一日¶
安装Hadoop: 1、将hadoop-3.2.2.tar.gz,拷贝到CentOS中的~/software目录下
2、将Hadoop压缩包,解压缩到用户主目录的bigdata目录下; $ cd ~ $ mkdir bigdata $ cd bigdata $ tar -zxvf ~/software/hadoop-3.2.2.tar.gz
3、打开"/etc/profile"配置文件,配置hadoop环境变量:
export HADOOP_HOME=/home/hduser/dt/hadoop-3.2.2
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
export HADOOP_MAPRED_HOME=${HADOOP_HOME}
export HADOOP_COMMON_HOME=${HADOOP_HOME}
export HADOOP_HDFS_HOME=${HADOOP_HOME}
export YARN_HOME=${HADOOP_HOME}
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib:$HADOOP_HOME/lib/native"
# export CLASSPATH=$($HADOOP_HOME/bin/hadoop classpath):$CLASSPATH # 造成包冲突,影响zeppelin启动
4、保存。然后执行/etc/profile文件(让配置生效): $ source /etc/profile
5、测试环境变量是否配置正确:
$ hadoop version
配置Hadoop: 1、配置$HADOOP_HOME/etc/hadoop/目录中如下的配置文件:
-
hadoop-env.sh:
修改JAVA_HOME属性的值:export JAVA_HOME=/usr/local/jdk1.8.0_251
2) core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://cda:8020/</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>file:///home/hduser/bigdata/hadoop-3.2.2/dfs/tmp</value>
</property>
</configuration>
3) hdfs-site.xml
<configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///home/hduser/bigdata/hadoop-3.2.2/dfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///home/hduser/bigdata/hadoop-3.2.2/dfs/data</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
4) mapred-site.xml
配置内容如下:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.application.classpath</name>
<value>
$HADOOP_HOME/share/hadoop/hdfs/*,
$HADOOP_HOME/share/hadoop/hdfs/lib/*,
$HADOOP_HOME/share/hadoop/yarn/*,
$HADOOP_HOME/share/hadoop/yarn/lib/*,
$HADOOP_HOME/share/hadoop/mapreduce/*,
$HADOOP_HOME/share/hadoop/mapreduce/lib/*
</value>
</property>
</configuration>
5) yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>localhost</value>
</property>
</configuration>
4、格式化hdfs(仅需执行格式化一次)。在终端窗口,执行命令: $ hdfs namenode -format
设置主机名hostname:
-
要求有root权限
-
查看当前hostname: hostnamectl status
-
修改hostname:(注:大小写敏感;总长度不超过63个字符,越短越好) sudo hostnamectl set-hostname master
-
查看当前IP地址: ip addr list
-
然后,打开hosts文件,编辑设置全限定域名(FQDN) vi /etc/hosts
-
增加一行新的内容(映射hostname到IP地址): 192.168.190.145 master1
-
重启 sudo reboot -f
在Hadoop上执行MR程序(伪分布模式): 1、启动HDFS集群: $ start-dfs.sh
2、使用jps命令查看当前节点上运行的服务: $ jps
3、查看块状态报告: $ hdfs dfsadmin -report
4、查看HDFS文件目录: $ hdfs dfs -ls /
5、成功启动后,可以通过Web界面查看NameNode 和 Datanode 信息和HDFS文件系统。 NameNode Web接口:http://xueai8:9870/
6、启动yarn: $ start-yarn.sh $ jps
成功启动后,可以通过Web界面查看ResourceManager集群信息: [http://xueai8:8088](http://xueai8:8088)。
成功启动后,可以通过Web界面查看NodeManager节点信息: [http://xueai8:8042](http://xueai8:8042)。
7、启动historyserver和timelineserver $ mapred -daemon start historyserver $ yarn -daemon start timelineserver
8、运行pi程序: $ cd bigdata/hadoop-3.2.1/share/hadoop/mapreduce $ hadoop jar hadoop-mapreduce-examples-3.2.1.jar pi 10 20
9、可以通过 Web 界面查看: 打开浏览器,在地址栏输入:http://xueai8:8088 查看任务进度:http://xueai8:8088/cluster ,在 Web 界面点击 “Tracking UI” 这一列的 History 连接,可以看到任务的运行信息
10、关闭集群: $ stop-yarn.sh $ stop-dfs.sh $ mapred -daemon stop historyserver $ yarn -daemon stop timelineserver
一、前期准备: 1) 至少有三台机器组成一个集群;
2) 集群中的三台机器,都要有的配置: 2.1 安装jdk 8+,并配置好环境变量
2.2 都要安装SSH服务: (CentOS忽略此步,已经自带有SSH服务) 在每台机器上执行:sudo yum install ssh
3)每台机器的IP修改为唯一,分别为: 192.168.190.162 192.168.190.163 192.168.190.164
4)每台机器的主机名(hostname)改为唯一,分别为:
master
worker01
worker02
在master机器的终端窗口中,执行以下命令修改机器名为master:
$ sudo hostnamectl set-hostname master1
在worker01机器的终端窗口中,执行以下命令修改机器名为worker01:
$ sudo hostnamectl set-hostname worker01
在worker02机器的终端窗口中,执行以下命令修改机器名为worker02:
$ sudo hostnamectl set-hostname worker02
5)设置每台机器的ip与主机名的对应关系:
注:请将原文件最上面的第二行127.0.1.1 删除掉,每台机器都要做
$ sudo nano /etc/hosts
192.168.136.162 master1
192.168.136.163 worker01
192.168.136.164 worker02
6)均关闭防火墙¶
Snail 1、在每台机器上,执行如下命令: $ ssh localhost $ ssh exit # 记得最后通过这个命令退出ssh连接
2、在namenode(master主机)上,使用如下命令生成公私钥: $ cd .ssh $ ssh-keygen -t rsa 然后一路回车,在.ssh下生成公私钥。
3、将namenode(master主机)上的公钥分别加入master、worker01和worker02机器的授权文件中。 执行如下命令: $ ssh-copy-id hduser@master1 $ ssh-copy-id hduser@worker01 $ ssh-copy-id hduser@worker02
4、测试。在master节点机器上,使用ssh分别连接master、worker01和worker02: $ ssh master1 $ exit
$ ssh worker01
$ exit
$ ssh worker02
$ exit
这时会发现不需要输入密码,直接就ssh连接上了这两台机器。
三台设备
- 修改 vim ./hdfs-site.xml 中的value为3
- 添加一个workers文件 写入三台设备的主机名。
- 修改
第二日¶
前提: 在安装Hive之前,要求先: - 安装JDK 8+ - 安装Hadoop-3.1.x - 安装MySQL数据库(CentOS安装Maridb数据库)
一、安装¶
1. 下载hive最新的稳定版本的包到~/software/目录下,然后解压缩到~/bigdata/下:
$ cd ~/bigdata
$ tar -xzvf ~/software/apache-hive-3.1.2-bin.tar.gz
把文件夹的名字改短些:
$ mv apache-hive-3.1.2-bin hive-3.1.2
2. 设置环境变量:
$ sudo nano /etc/profile
在文件的最后,添加如下两句:
export HIVE_HOME=/home/hduser/bigdata/hive-3.1.2
export PATH=$HIVE_HOME/bin:$PATH
然后执行:
$ source /etc/profile
二、配置管理¶
配置3个文件:
hive-env.sh
hive-log4j2.properties
hive-site.xml
1. 修改hive-env.sh。默认没有这个文件,从模板文件复制一个过来:
$ cp hive-env.sh.template hive-env.sh
打开它:
$ nano hive-env.sh
在打开的配置文件最后,添加如下几行:
export JAVA_HOME=/usr/local/jdk1.8.0_181
export HADOOP_HOME=/home/hduser/bigdata/hadoop-3.2.1
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HIVE_HOME=/home/hduser/bigdata/hive-3.1.2
export HIVE_CONF_DIR=$HIVE_HOME/conf
export HIVE_AUX_JARS_PATH=$HIVE_HOME/lib
2. 修改hive-log4j2.properties,配置hive的日志目录。
默认没有这个文件,从模板文件复制一个过来:
$ cp hive-log4j2.properties.template hive-log4j2.properties
$ nano hive-log4j2.properties
修改下面的参数(等号右侧的值):
property.hive.log.dir=/home/hduser/bigdata/hive-3.1.2/logs
-
创建 配置hive-site.xml。默认没有这个文件,所以先把它创建出来: $ nano hive-site.xml
在打开的hive-site.xml中,把以下内容拷贝进行,然后保存退出:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<!- 设置下面这些属性 ->
<property>
<name>hive.exec.scratchdir</name>
<value>/tmp/hive</value>
</property>
<property>
<name>hive.querylog.location</name>
<value>/home/hduser/bigdata/hive-2.3.4/tmp/hduser</value>
</property>
<property>
<name>hive.aux.jars.path</name>
<value>/home/hduser/bigdata/hive-2.3.4/lib,/home/hduser/bigdata/hive-2.3.4/jdbc</value>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://master:9083</value>
</property>
<!-配置Hive Metastore:MySQL连接信息 ->
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://cda:3306/hive?createDatabaseIfNotExist=true&characterEncoding=UTF-8</value>
<description>如果是mariadb:jdbc:mariadb://cda:3306/hive</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>如果是mariadb:org.mariadb.jdbc.Driver</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>admin</value>
</property>
<!-配置hiveserver2主机(这里最好是配置ip地址,以便于从Windows连接)->
<property>
<name>hive.server2.thrift.bind.host</name>
<value>master</value>
<description>Bind host on which to run the HiveServer2 Thrift service.</description>
</property>
<property>
<name>hive.server2.thrift.port</name>
<value>10000</value>
<description>监听的TCP端口号,默认是10000</description>
</property>
<!-配置beeline远程客户端连接时的用户名和密码。这个用户名要在对应的hadoop的配置文件core-site.xml中也配置->
<property>
<name>hive.server2.thrift.client.user</name>
<value>hduser</value>
</property>
<property>
<name>hive.server2.thrift.client.password</name>
<value>hduser</value>
</property>
# 关闭元数据验证
<property>
<name>datanucleus.metadata.validate</name>
<value>false</value>
</property>
# 关闭元数据存储模式验证
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
</property>
<property>
<name>datanucleus.schema.autoCreateAll</name>
<value>ture</value>
</property>
</configuration>
按需修改当前变量 自行配置:
:::warning
<configuration>
<!- 设置下面这些属性 ->
<property>
<name>hive.exec.scratchdir</name>
<value>/tmp/hive</value>
</property>
<property>
<name>hive.querylog.location</name>
<value>/home/hadoopuser/bigdata/hive-3.1.3/tmp/hadoopuser</value>
</property>
<property>
<name>hive.aux.jars.path</name>
<value>/home/hadoopuser/bigdata/hive-3.1.3/lib,/home/hadoopuser/bigdata/hive-3.1.3/jdbc</value>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://master1:9083</value>
</property>
<!-配置Hive Metastore:MySQL连接信息 ->
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mariadb://master1:3306/hive?createDatabaseIfNotExist=true&characterEncoding=UTF-8</value>
<description>如果是mariadb:jdbc:mariadb://cda:3306/hive</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.mariadb.jdbc.Driver</value>
<description>如果是mariadb:org.mariadb.jdbc.Driver</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>111123</value>
</property>
<!-配置hiveserver2主机(这里最好是配置ip地址,以便于从Windows连接)->
<property>
<name>hive.server2.thrift.bind.host</name>
<value>thrift://master1:9083</value>
<description>Bind host on which to run the HiveServer2 Thrift service.</description>
</property>
<property>
<name>hive.server2.thrift.port</name>
<value>10000</value>
<description>监听的TCP端口号,默认是10000</description>
</property>
<!-配置beeline远程客户端连接时的用户名和密码。这个用户名要在对应的hadoop的配置文件core-site.xml中也配置->
<property>
<name>hive.server2.thrift.client.user</name>
<value>hadoopuser</value>
</property>
<property>
<name>hive.server2.thrift.client.password</name>
<value>hadoopuser</value>
</property>
# 关闭元数据验证
<property>
<name>datanucleus.metadata.validate</name>
<value>false</value>
</property>
# 关闭元数据存储模式验证
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
</property>
<property>
<name>datanucleus.schema.autoCreateAll</name>
<value>ture</value>
</property>
</configuration>
-
拷贝jdbc驱动 默认情况下, Hive的元数据保存在了内嵌的derby数据库里, 但一般情况下生产环境使用MySQL来存放Hive元数据。 1) 将 mysql-connector-java-5.1.40.jar 放入 $HIVE_HOME/lib 下。(mysql jdbc驱动程序)
注:如果使用的是mariadb数据库,则配置的驱动程序为:org.mariadb.jdbc.Driver 并且要将mariadb jdbc驱动拷贝到hive的lib目录下:例如 mariadb-java-client-2.2.6.jar
三、为Hive创建HDFS目录¶
-
必须先安装了Hadoop,并配置了HADOOP_HOME环境变量
-
在Hive中创建表之前,需要在HDFS上创建目录/tmp/hive和/user/hive/warehouse,并给它们赋予写权限。 在HDFS中将它们设为chmod 777,然后才可以在Hive中创建表: $ hdfs dfs -mkdir -p /tmp/hive $ hdfs dfs -mkdir -p /user/hive/warehouse
//用户组加写权限 $ hdfs dfs -chmod -R 777 /tmp $ hdfs dfs -chmod -R 777 /user/hive/warehouse
-
四、初始化¶
// 执行初始化操作(会在mysql中创建hive数据库)
$ schematool -dbType mysql -initSchema
说明:如果是使用MySQL数据库,则需要先启动mysql服务器:
$ cd /usr/local/mysql
$ sudo ./support-files/mysql.server start
执行成功后,可以查看MySQL中元数据库hive是否已经创建成功。
执行成功后,可以查看MySQL中元数据库hive是否已经创建成功。 出现错误 :Java.lana.NoSuchMethodError com.aoodle.common.base,Preconditions checkAroument/Zliava/ana /Strinc:liava/ang/Obiect)原因是同一类型的 jar 包有不同版本存在,系统无法决定使用哪一个。 解决方法: 删除 hive_home/lib/中低版本的guava-19.0,jar包,将hadoop中的guava-27.0-jrejar复制到hive的lib目录下即可。 rm /bigdata/hive-3.1.2/lib/guava-19.0-jre.jar cp ~/bigdatahadoop-3.2.4/share/hadoop/common/lib/guava-27.0-jre.jar /bigdata/hive-3.1.2/lib/
五、运行¶
1、启动HDFS集群和YARN集群 $ start-dfs.sh $ start-yarn.sh 或者用start-all.sh 关闭用stop-all.sh
2、启动Hive客户端: 新建一个会话窗口运行,启动后台程序 $ hive -service metastore 另一个窗口中运行,执行前台服务 $ hive
3、创建数据表: 查看当前数据库: show databases;
创建新的数据库:
create database cda;
打开数据库:
use cda;
创建数据表:
CREATE TABLE sales( id int, product string, age int) row format delimited fields terminated by '\t';
查看数据表:
show tables;
insert into t1 values (1,"张三"),(2,"李四"),(3,"王老五");:
4、退出hive cli $ exit;
六、远程连接:
- 配置$HADOOP_HOME/etc/hadoop/目录中如下的配置文件:
在core-site.xml 添加如下配置: 设置一个远程登录账号 hadoopuser1 为用户名 hadoopuser2 为密码
<property> <name>hadoop.proxyuser.hadoopuser1.hosts</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.hadoopuser2.groups</name> <value>*</value> </property>
- 配置hive_HOME/conf/hive-site.xml
<!-配置beeline远程客户端连接时的用户名和密码。这个用户名要在对应的hadoop的配置文件core-site.xml中也配置-> <property> <name>hive.server2.thrift.client.user</name> <value>hadoopuser</value> </property> <property> <name>hive.server2.thrift.client.password</name> <value>hadoopuser</value> </property>
在windows系统上安装 JDK sqldevelop
第三日¶
- 使用一个物价检测数据集,来练习Hive QL语言 -[hduser@cda hive_data]$ head -5 pricewatch.csv -RECORDID,JCLB,JCMC,BQ,SQ,TB,HB -537B9A6E0C836F36E0505BCB8D2E3A3B,,椰菜,2.27,2.261,-0.067,0.004 -537B9A6E0C846F36E0505BCB8D2E3A3B,,东北米,2.863,2.843,-0.067,0.007 -537B9A6E0C856F36E0505BCB8D2E3A3B,,早籼米,3.08,3.044,0.219,0.012 -537B9A6E0C866F36E0505BCB8D2E3A3B,,晚籼米,3.217,3.22,0.081,-0.001
-[hduser@cda hive_data]$ tail -5 pricewatch.csv -7A9CCF598585A4A6E0505BCB8D2E67FF,蔬菜,菜心,6.926,5.679,0.271,0.219 -7A9CCF598586A4A6E0505BCB8D2E67FF,蔬菜,生菜,5.621,3.973,0.168,0.415 -7A9CCF598587A4A6E0505BCB8D2E67FF,蔬菜,大白菜,3.233,2.673,0.27,0.209 -7A9CCF598588A4A6E0505BCB8D2E67FF,蔬菜,菠菜,8.547,7.966,-0.025,0.073 -7A9CCF598589A4A6E0505BCB8D2E67FF,蔬菜,椰菜,3.168,2.83,0.082,0.1
-
创建Hive表 create table pricewatch( recordid string, jclb string, jcmc string, bq decimal(10, 3), sq decimal(10, 3), tb decimal(10, 3), hb decimal(10, 3) )row format delimited fields terminated by ',' tblproperties('skip.header.line.count'='1');
-
加载数据文件到表中 load data inpath '/home/hduser/data/hive_data/pricewatch.csv' into table pricewatch;
-
测试 select * from pricewatch limit 5;
- Snail - 简单查询 - limit查询:限制返回的行数。 select * from pricewatch limit 10; select * from pricewatch limit 10,5;
- 条件查询
-
单条件 select * from pricewatch where bq>5.00;
-
组合条件 select * from pricewatch where bq>5.00 and bq<8.00; select * from pricewatch where bq<=5.00 or bq>=8.00;
-
排序(默认是升序) select * from pricewatch where bq>5.00 and bq<8.00 order by bq; select * from pricewatch where bq>5.00 and bq<8.00 order by bq desc;
-
null值比较:is null 和 is not null select * from pricewatch where jclb is null; select * from pricewatch where jclb is not null;
select * from pricewatch where jclb="";
-
查询检测类别(jclb) - 去重 distinct select jclb from pricewatch; select distinct jclb from pricewatch;
-
范围条件
- in和not in
-
找出检测名称(jcmb)为“生菜”、“芹菜”和“菠菜”的商品 select * from pricewatch where jcmc="生菜" or jcmc="芹菜" or jcmc="菠菜"; select * from pricewatch where jcmc in ("生菜","芹菜","菠菜");
-
找出检测名称(jcmb)不是“生菜”、“芹菜”和“菠菜”的商品
select * from pricewatch where jcmc<>"生菜" and jcmc<>"芹菜" and jcmc<>"菠菜";
select * from pricewatch where jcmc not in ("生菜","芹菜","菠菜");
- between..and...和not between...and...
- 找出 本期价格在2.00-5.00之间 的商品(包不包括2.00?包不包括5.00?) select * from pricewatch where bq between 2.00 and 5.00;
-
找出 本期价格低于2.00或高于5.00之间 的商品 select * from pricewatch where bq not between 2.00 and 5.00;
-
列值之间的运算
- 找出价格环比上涨的商品 select * from pricewatch where bq>sq; select * from pricewatch where hb>0; Snail
- 模糊查询: like, not like, rlike
- (1) like 和not like:两上匹配符号_和%。_表示任意单个字符;%表示任意数量的任意字符。
- (2) rlike:支持标准Java正则表达式符号。
- $:匹配结尾
- ^:匹配开头
- .:匹配任意单个字符
-
表示数量的符号:*(前面的字符至少1个), {m}(前面的字符要出现m次)
-
找出所有检测名称以"米"字结尾的商品 select * from pricewatch where jcmc like '%米'; select * from pricewatch where jcmc rlike '米$';
-
找出所有检测名称以”泰“字开头的商品 select * from pricewatch where jcmc like '泰%'; select * from pricewatch where jcmc rlike '^泰';
-
找出所有检测名称以”花“字开头、以”油“字结尾的商品 select * from pricewatch where jcmc like '花%油'; select * from pricewatch where jcmc rlike '^花.*油$';
-
找出所有检测名称以”米“或”油“字结尾的商品 select * from pricewatch where jcmc rlike '[米油]$';
-
找出所有检测名称为”xx菜“的商品 select * from pricewatch where jcmc like '__菜'; select * from pricewatch where jcmc rlike '^.{2}菜$';
-
找出 价格高于8元 的大米 select * from pricewatch where bq>8.00 and jcmc like '%米';
-
如何保存查询结果
- 有三种方式:
- 1)保存在本地文件中;
- 2)保存在HDFS文件中;
-
3)保存在另一个hive表中;
-
1)保存在本地文件中:insert overwrite local directory ... select ...
-
将查询结果保存到本地的一个csv文件中(导出的文件夹要事先不能存在) insert overwrite local directory '/home/hduser/data/result1' row format delimited fields terminated by ',' select jcmc, jclb, bq, sq from pricewatch;
-
2)保存在HDFS文件中:insert overwrite directory ... select ... insert overwrite directory '/data/result1' row format delimited fields terminated by ',' select jcmc, jclb, bq, sq from pricewatch;
-
3)保存在另一个hive表中。又分为两种情况:
- 当结果表不存在时:CTAS语法(create table xxx as select ....) create table procewatch_result as select jcmc, jclb, bq, sq from pricewatch where bq<5.00;
select * from procewatch_result;
- 当结果表已经存在时:insert into ... select ... insert into procewatch_result select jcmc, jclb, bq, sq from pricewatch where bq>15.00;
select * from procewatch_result; 樊博 org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /tmp/hadoop-yarn/staging/root/.staging/job_1682468625449_0007/libjars/taglibs-standard-impl-1.2.5.jar could only be written to 0 of the 1 minReplication nodes. There are 3 datanode(s) running and 3 node(s) are excluded in this operation. 14:49 Snail -Hive内置函数
-
查看Hive所支持的函数列表 show functions;
-
查看特定函数的帮助和用法(在sql developer中无效)
desc function split; desc function extended split;
desc function to_date;
- 字符串函数
-
length select length('cda_hive') as str_length; select jcmc, length(jcmc) as jcmc_length from pricewatch;
-
concat(c1, c2, c3, ...) select concat('we ','love ','hive') str_concat;
-
concat_ws(sep, c1, c2, c3, ...)
select concat_ws('-', 'we','love','hive') str_concat;
select concat_ws('-',jclb,jcmc) from pricewatch where jclb <> "" limit 5;
- substr(str, start), substr(str, start, len)
- substring
select substr('abcde',3);
select substr('abcde',3,2);
select substr('abcde',-1);
-
trim, ltrim, rtrim select trim(' abc '); select ltrim(' abc '); select rtrim(' abc ');
-
split:分割字符串 select split('abtcdtef','t');
-
regexp_replace:正则表达式替换函数
- regexp_replace(string A, pattern_string B, string C) select regexp_replace('我喜欢和平,反对暴力和枪支','暴力|枪支','**'); - 中**队 中国军队
select regexp_replace('135a-123b4-c5678','\D',''); select regexp_replace('135a-123b4-c5678','[^0-9]','');
- regexp_extract:正则表达式解析函数 group 返回数组[整个字符串,(),()]
- regexp_extract(string subject, string pattern, int index) -select regexp_extract('foothebar','foo(.)(bar)', 2); - the [foothebar,group1, group2, ...] -select regexp_extract('foothebar','foo(.?)(bar)',1); - the -select regexp_extract('foothebar','foo(.?)(bar)',2); - bar select regexp_extract('foothebar','foo(.?)(bar)',0); - foothebar Snail
- parse_url:URL解析函数 https://www.baidu.com/s?tn=baidu&wd=cda
-
https:// www.baidu.com:80 /s ? tn=baidu & wd=cda select parse_url('https://cwiki.apache.org/confluence/display/Hive','PROTOCOL'); select parse_url('https://cwiki.apache.org/confluence/display/Hive?k1=v1&k2=v2','PROTOCOL'); select parse_url('https://cwiki.apache.org/confluence/display/Hive?k1=v1&k2=v2','HOST'); select parse_url('https://cwiki.apache.org/confluence/display/Hive?k1=v1&k2=v2','PATH'); select parse_url('https://cwiki.apache.org/confluence/display/Hive?k1=v1&k2=v2','QUERY'); select parse_url("https://cwiki.apache.org/confluence/display/Hive?k1=v1&k2=v2",'QUERY','k1');
-
get_json_object:JSON解析函数
- json:网络数据交换的轻量格式。它只有两种类型:json对象或json数组
- json对象:'{"key1":"val1","key2":"val2",....}'
- json数组:'[{"k1":"v1","k2":"v2",...},{},...]'
- json数据的所有key,必须是字符串 -'{
- "store":{
- "fruit": [{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],
- "bicycle":{"price":19.95,"color":"red"}
- },
- "email": "aaa@163.com",
-
"owner": "aaa" -}'
-
取单个值 select get_json_object('{ "store":{ "fruit": [{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}], "bicycle":{"price":19.95,"color":"red"} }, "email": "aaa@163.com", "owner": "aaa" }', '$.email');
-
取嵌套值 select get_json_object('{ "store":{ "fruit": [{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}], "bicycle":{"price":19.95,"color":"red"} }, "email": "aaa@163.com", "owner": "aaa" }', '$.store.bicycle.price');
-
取数组值 select get_json_object('{ "store":{ "fruit": [{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}], "bicycle":{"price":19.95,"color":"red"} }, "email": "aaa@163.com", "owner": "aaa" }', '$.store.fruit[0].type'); Snail
-
数学函数
- round: 四舍五入
select round(46.5); select round(46.499); select round(12.3456,2);
- ceil:向上取整
select ceil(87.2); select ceil(-87.2);
- floor:向下整数
select floor(2.89); select floor(-2.09);
-
cast:类型转换函数
-
将数值字符串转换为数值类型 select cast('3.56' as double); select cast('2022-09-25' as date); Snail
- 日期函数
-
to_date():提取日期部分 select to_date('2022-09-25 16:32:45') as sub_date;
-
year(), month(), day(), hour() select year('2022-09-25 16:32:45'),month('2022-09-25 16:32:45');
-
current_date()
select current_date();
- date_format():日期格式化函数
select date_format('2022-09-25 16:32:45', 'MM-dd'); select date_format('2022-09-25 16:32:45', 'MM/dd/yyyy'); select date_format('2022-09-25 16:32:45', 'yyyy年MM月dd日 HH点mm分ss秒');
- unix_timestamp():将一个日期转换为unix时间戳(10-s, 13-ms) 字符串->长整数
select unix_timestamp('2022-09-25 16:32:45'); select unix_timestamp('12/09/2022','dd/MM/yyyy');
-
from_unixtime(): 长整数->字符串 select from_unixtime(1664094765); select from_unixtime(1664094765,'yyyy/MM/dd');
-
非标准日期的转换 '20220618' -> 长整数 -> 格式字符串 select from_unixtime(unix_timestamp('20220618','yyyyMMdd'), 'yyyy-MM-dd');