通过 Sqoop1.4.7 将 Mysql5.7、Hive2.3.4、Hbase1.4.9 之间的数据导入导出 – Lion

目录

1、什么是 Sqoop?

  Sqoop 是一种用于在 Hadoop 和关系数据库或大型机之间传输数据的工具。
  您可以使用 Sqoop 将数据从关系数据库管理系统RDBMS(如 MySQLOracle)导入 Hadoop 分布式文件系统 HDFS,转换 Hadoop MapReduce 中的数据,然后将数据导出回 RDBMS
  Sqoop 自动执行此过程的大部分过程,依靠数据库来描述要导入的数据的模式。Sqoop 使用 MapReduce 导入和导出数据,提供并行操作和容错。
  

2、下载应用程序及配置环境变量

2.1、下载 Sqoop 1.4.7

  通过以下命令下载 Sqoop,解压后,放到/home/work/_app/ 目录中:

[[email protected] _src]# pwd
/home/work/_src
[[email protected] _src]# wget http://mirrors.shu.edu.cn/apache/sqoop/1.4.7/sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz
[[email protected] _src]# tar -xzvf sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz
[[email protected] _src]# mv sqoop-1.4.7.bin__hadoop-2.6.0 /home/work/_app/

  

2.2、设置环境变量

  在每一台机器上设置 Sqoop 环境变量,运行以下命令

echo "" >> /etc/bashrc
echo "# Sqoop 1.4.7" >> /etc/bashrc
echo "export SQOOP_HOME=/home/work/_app/sqoop-1.4.7.bin__hadoop-2.6.0" >> /etc/bashrc

echo "" >> /etc/bashrc
echo "# Path" >> /etc/bashrc
echo "export PATH=$PATH:$SQOOP_HOME/bin" >> /etc/bashrc
source /etc/bashrc

  

2.3、设置安装所需环境

  安装和运行 Sqoop 需要用到 HiveMySqlHadoop环境。可以参考文章:Hadoop 3.1.2(HA)+Zookeeper3.4.13+Hbase1.4.9(HA)+Hive2.3.4+Spark2.4.0(HA)高可用集群搭建
  

3、安装 Sqoop 1.4.7

3.1、修改 Sqoop 配置文件

3.1.1、修改配置文件 sqoop-env.sh

   创建 /home/work/_app/sqoop-1.4.7.bin__hadoop-2.6.0/conf/sqoop-env.sh 文件编辑并保存,内容为空,因为我们在本文的配置环境变量章节中已经配置了环境变量,同时也在<Hadoop 3.1.2(HA)+Zookeeper3.4.13+Hbase1.4.9(HA)+Hive2.3.4+Spark2.4.0(HA)高可用集群搭建>一文中配置了 HiveHadoop 环境变量:

[[email protected] ~]# echo "" > /home/work/_app/sqoop-1.4.7.bin__hadoop-2.6.0/conf/sqoop-env.sh

  
  将 /home/work/_app/hive-2.3.4/lib/ 目录下的 hive-hcatalog-core-2.3.4.jarmysql-connector-java-5.1.47-bin.jarhive-common-2.3.4.jarlibthrift-0.9.3.jar 文件,复制到/home/work/_app/sqoop-1.4.7.bin__hadoop-2.6.0/lib 目录下

[[email protected] ~]# cp /home/work/_app/hive-2.3.4/lib/hive-hcatalog-core-2.3.4.jar /home/work/_app/hive-2.3.4/lib/mysql-connector-java-5.1.47-bin.jar /home/work/_app/hive-2.3.4/lib/libthrift-0.9.3.jar /home/work/_app/hive-2.3.4/lib/hive-common-2.3.4.jar /home/work/_app/sqoop-1.4.7.bin__hadoop-2.6.0/lib/

  

3.1.2、修改配置文件 configure-sqoop

  编辑 /home/work/_app/sqoop-1.4.7.bin__hadoop-2.6.0/bin/configure-sqoop 文件并保存,内容如下:

[[email protected] _src]# cat /home/work/_app/sqoop-1.4.7.bin__hadoop-2.6.0/bin/configure-sqoop
#!/bin/bash
#
# Copyright 2011 The Apache Software Foundation
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# This is sourced in by bin/sqoop to set environment variables prior to
# invoking Hadoop.

bin="$1"

if [ -z "${bin}" ]; then
  bin=`dirname $0`
  bin=`cd ${bin} && pwd`
fi

if [ -z "$SQOOP_HOME" ]; then
  export SQOOP_HOME=${bin}/..
fi

SQOOP_CONF_DIR=${SQOOP_CONF_DIR:-${SQOOP_HOME}/conf}

if [ -f "${SQOOP_CONF_DIR}/sqoop-env.sh" ]; then
  . "${SQOOP_CONF_DIR}/sqoop-env.sh"
fi

# Find paths to our dependency systems. If they are unset, use CDH defaults.

if [ -z "${HADOOP_COMMON_HOME}" ]; then
  if [ -n "${HADOOP_HOME}" ]; then
    HADOOP_COMMON_HOME=${HADOOP_HOME}
  else
    if [ -d "/usr/lib/hadoop" ]; then
      HADOOP_COMMON_HOME=/usr/lib/hadoop
    else
      HADOOP_COMMON_HOME=${SQOOP_HOME}/../hadoop
    fi
  fi
fi

if [ -z "${HADOOP_MAPRED_HOME}" ]; then
  HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
  if [ ! -d "${HADOOP_MAPRED_HOME}" ]; then
    if [ -n "${HADOOP_HOME}" ]; then
      HADOOP_MAPRED_HOME=${HADOOP_HOME}
    else
      HADOOP_MAPRED_HOME=${SQOOP_HOME}/../hadoop-mapreduce
    fi
  fi
fi

# We are setting HADOOP_HOME to HADOOP_COMMON_HOME if it is not set
# so that hcat script works correctly on BigTop
if [ -z "${HADOOP_HOME}" ]; then
  if [ -n "${HADOOP_COMMON_HOME}" ]; then
     HADOOP_HOME=${HADOOP_COMMON_HOME}
     export HADOOP_HOME
  fi
fi

if [ -z "${HBASE_HOME}" ]; then
  if [ -d "/usr/lib/hbase" ]; then
    HBASE_HOME=/usr/lib/hbase
  else
    HBASE_HOME=${SQOOP_HOME}/../hbase
  fi
fi
#if [ -z "${HCAT_HOME}" ]; then
#  if [ -d "/usr/lib/hive-hcatalog" ]; then
#    HCAT_HOME=/usr/lib/hive-hcatalog
#  elif [ -d "/usr/lib/hcatalog" ]; then
#    HCAT_HOME=/usr/lib/hcatalog
#  else
#    HCAT_HOME=${SQOOP_HOME}/../hive-hcatalog
#    if [ ! -d ${HCAT_HOME} ]; then
#       HCAT_HOME=${SQOOP_HOME}/../hcatalog
#    fi
#  fi
#fi
#if [ -z "${ACCUMULO_HOME}" ]; then
#  if [ -d "/usr/lib/accumulo" ]; then
#    ACCUMULO_HOME=/usr/lib/accumulo
#  else
#    ACCUMULO_HOME=${SQOOP_HOME}/../accumulo
#  fi
#fi
if [ -z "${ZOOKEEPER_HOME}" ]; then
  if [ -d "/usr/lib/zookeeper" ]; then
    ZOOKEEPER_HOME=/usr/lib/zookeeper
  else
    ZOOKEEPER_HOME=${SQOOP_HOME}/../zookeeper
  fi
fi
if [ -z "${HIVE_HOME}" ]; then
  if [ -d "/usr/lib/hive" ]; then
    export HIVE_HOME=/usr/lib/hive
  elif [ -d ${SQOOP_HOME}/../hive ]; then
    export HIVE_HOME=${SQOOP_HOME}/../hive
  fi
fi

# Check: If we can't find our dependencies, give up here.
if [ ! -d "${HADOOP_COMMON_HOME}" ]; then
  echo "Error: $HADOOP_COMMON_HOME does not exist!"
  echo 'Please set $HADOOP_COMMON_HOME to the root of your Hadoop installation.'
  exit 1
fi
if [ ! -d "${HADOOP_MAPRED_HOME}" ]; then
  echo "Error: $HADOOP_MAPRED_HOME does not exist!"
  echo 'Please set $HADOOP_MAPRED_HOME to the root of your Hadoop MapReduce installation.'
  exit 1
fi

## Moved to be a runtime check in sqoop.
if [ ! -d "${HBASE_HOME}" ]; then
  echo "Warning: $HBASE_HOME does not exist! HBase imports will fail."
  echo 'Please set $HBASE_HOME to the root of your HBase installation.'
fi

## Moved to be a runtime check in sqoop.
#if [ ! -d "${HCAT_HOME}" ]; then
#  echo "Warning: $HCAT_HOME does not exist! HCatalog jobs will fail."
#  echo 'Please set $HCAT_HOME to the root of your HCatalog installation.'
#fi

#if [ ! -d "${ACCUMULO_HOME}" ]; then
#  echo "Warning: $ACCUMULO_HOME does not exist! Accumulo imports will fail."
#  echo 'Please set $ACCUMULO_HOME to the root of your Accumulo installation.'
#fi
if [ ! -d "${ZOOKEEPER_HOME}" ]; then
  echo "Warning: $ZOOKEEPER_HOME does not exist! Accumulo imports will fail."
  echo 'Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.'
fi

# Where to find the main Sqoop jar
SQOOP_JAR_DIR=$SQOOP_HOME

# If there's a "build" subdir, override with this, so we use
# the newly-compiled copy.
if [ -d "$SQOOP_JAR_DIR/build" ]; then
  SQOOP_JAR_DIR="${SQOOP_JAR_DIR}/build"
fi

function add_to_classpath() {
  dir=$1
  for f in $dir/*.jar; do
    SQOOP_CLASSPATH=${SQOOP_CLASSPATH}:$f;
  done

  export SQOOP_CLASSPATH
}

# Add sqoop dependencies to classpath.
SQOOP_CLASSPATH=""
if [ -d "$SQOOP_HOME/lib" ]; then
  add_to_classpath $SQOOP_HOME/lib
fi

# Add HBase to dependency list
if [ -e "$HBASE_HOME/bin/hbase" ]; then
  TMP_SQOOP_CLASSPATH=${SQOOP_CLASSPATH}:`$HBASE_HOME/bin/hbase classpath`
  SQOOP_CLASSPATH=${TMP_SQOOP_CLASSPATH}
fi

# Add HCatalog to dependency list
if [ -e "${HCAT_HOME}/bin/hcat" ]; then
  TMP_SQOOP_CLASSPATH=${SQOOP_CLASSPATH}:`${HCAT_HOME}/bin/hcat -classpath`
  if [ -z "${HIVE_CONF_DIR}" ]; then
    TMP_SQOOP_CLASSPATH=${TMP_SQOOP_CLASSPATH}:${HIVE_CONF_DIR}
  fi
  SQOOP_CLASSPATH=${TMP_SQOOP_CLASSPATH}
fi

# Add Accumulo to dependency list
if [ -e "$ACCUMULO_HOME/bin/accumulo" ]; then
  for jn in `$ACCUMULO_HOME/bin/accumulo classpath | grep file:.*accumulo.*jar | cut -d':' -f2`; do
    SQOOP_CLASSPATH=$SQOOP_CLASSPATH:$jn
  done
  for jn in `$ACCUMULO_HOME/bin/accumulo classpath | grep file:.*zookeeper.*jar | cut -d':' -f2`; do
    SQOOP_CLASSPATH=$SQOOP_CLASSPATH:$jn
  done
fi

ZOOCFGDIR=${ZOOCFGDIR:-/etc/zookeeper}
if [ -d "${ZOOCFGDIR}" ]; then
  SQOOP_CLASSPATH=$ZOOCFGDIR:$SQOOP_CLASSPATH
fi

SQOOP_CLASSPATH=${SQOOP_CONF_DIR}:${SQOOP_CLASSPATH}

# If there's a build subdir, use Ivy-retrieved dependencies too.
if [ -d "$SQOOP_HOME/build/ivy/lib/sqoop" ]; then
  for f in $SQOOP_HOME/build/ivy/lib/sqoop/*/*.jar; do
    SQOOP_CLASSPATH=${SQOOP_CLASSPATH}:$f;
  done
fi

add_to_classpath ${SQOOP_JAR_DIR}

HADOOP_CLASSPATH="${SQOOP_CLASSPATH}:${HADOOP_CLASSPATH}"
if [ ! -z "$SQOOP_USER_CLASSPATH" ]; then
  # User has elements to prepend to the classpath, forcibly overriding
  # Sqoop's own lib directories.
  export HADOOP_CLASSPATH="${SQOOP_USER_CLASSPATH}:${HADOOP_CLASSPATH}"
fi

export SQOOP_CLASSPATH
export SQOOP_CONF_DIR
export SQOOP_JAR_DIR
export HADOOP_CLASSPATH
export HADOOP_COMMON_HOME
export HADOOP_MAPRED_HOME
export HBASE_HOME
export HCAT_HOME
export HIVE_CONF_DIR
export ACCUMULO_HOME
export ZOOKEEPER_HOME

  

3.2、查看 Sqoop 版本

[[email protected] _src]# sqoop version
2019-03-11 22:30:16,837 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
Sqoop 1.4.7
git commit id 2328971411f57f0cb683dfb79d19d4d19d185dd8
Compiled by maugli on Thu Dec 21 15:59:58 STD 2017

  

4、启动和测试 Sqoop 的数据导入、导出

4.1、Sqoop 通过 Hive 导入数据到 Sqoop

  在 mysql 中创建数据库 testmshk 并授权给 root 用户,同时创建 hive2mysql_mshk

[[email protected] _src]# mysql -uroot -p123456
mysql: [Warning] Using a password on the command line interface can be insecure.
Welcome to the MySQL monitor.  Commands end with ; or g.
Your MySQL connection id is 21
Server version: 5.7.25 MySQL Community Server (GPL)

Copyright (c) 2000, 2019, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or 'h' for help. Type 'c' to clear the current input statement.

mysql> CREATE DATABASE testmshk DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;
Query OK, 1 row affected (0.00 sec)

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| hive               |
| mysql              |
| performance_schema |
| sys                |
| testmshk           |
+--------------------+
6 rows in set (0.02 sec)

mysql> grant select,insert,update,delete,create on testmshk.* to root;
Query OK, 0 rows affected (0.01 sec)

mysql> flush  privileges;
Query OK, 0 rows affected (0.01 sec)


mysql> use testmshk;
Database changed

mysql> create table hive2mysql_mshk(id int,namea varchar(50),nameb varchar(50));
Query OK, 0 rows affected (0.02 sec)

mysql> quit;
Bye

  
  通过 Sqoop 查询 Mysql 中表的内容,这时可以看到表中的内容是空的

[[email protected] ~]# sqoop eval --connect  jdbc:mysql://c0:3306/testmshk?useSSL=false  --username root --password 123456 --query "select * from hive2mysql_mshk"

2019-03-11 23:44:06,894 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
2019-03-11 23:44:06,945 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
2019-03-11 23:44:07,100 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
-------------------------------------------------------------
| id          | namea                | nameb                |
-------------------------------------------------------------
-------------------------------------------------------------

  
  在<Hadoop 3.1.2(HA)+Zookeeper3.4.13+Hbase1.4.9(HA)+Hive2.3.4+Spark2.4.0(HA)高可用集群搭建>一文中,我们在测试 Hive 时创建了测试数据 /hive/warehouse/testtable/testdata001.dat 我们将这个数据,导入到 Mysql

[[email protected] ~]# sqoop export --connect  jdbc:mysql://c0:3306/testmshk?useSSL=false  --username root --password 123456 --table hive2mysql_mshk  --export-dir /hive/warehouse/testtable/testdata001.dat  --input-fields-terminated-by ','
2019-03-11 23:47:10,400 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
2019-03-11 23:47:10,437 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
2019-03-11 23:47:10,571 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
2019-03-11 23:47:10,574 INFO tool.CodeGenTool: Beginning code generation
2019-03-11 23:47:10,914 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `hive2mysql_mshk` AS t LIMIT 1
2019-03-11 23:47:10,943 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `hive2mysql_mshk` AS t LIMIT 1
2019-03-11 23:47:10,952 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /home/work/_app/hadoop-3.1.2
Note: /tmp/sqoop-root/compile/9ea7f54fe87f35ed071ed75c293f25d8/hive2mysql_mshk.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
2019-03-11 23:47:12,652 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/9ea7f54fe87f35ed071ed75c293f25d8/hive2mysql_mshk.jar
2019-03-11 23:47:12,669 INFO mapreduce.ExportJobBase: Beginning export of hive2mysql_mshk
2019-03-11 23:47:12,669 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2019-03-11 23:47:12,804 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
2019-03-11 23:47:14,106 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
2019-03-11 23:47:14,112 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
2019-03-11 23:47:14,112 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
2019-03-11 23:47:14,479 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
2019-03-11 23:47:14,808 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1552315366846_0003
2019-03-11 23:47:16,429 INFO input.FileInputFormat: Total input files to process : 1
2019-03-11 23:47:16,432 INFO input.FileInputFormat: Total input files to process : 1
2019-03-11 23:47:16,513 INFO mapreduce.JobSubmitter: number of splits:4
2019-03-11 23:47:16,577 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
2019-03-11 23:47:16,684 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1552315366846_0003
2019-03-11 23:47:16,686 INFO mapreduce.JobSubmitter: Executing with tokens: []
2019-03-11 23:47:16,924 INFO conf.Configuration: resource-types.xml not found
2019-03-11 23:47:16,924 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2019-03-11 23:47:17,213 INFO impl.YarnClientImpl: Submitted application application_1552315366846_0003
2019-03-11 23:47:17,261 INFO mapreduce.Job: The url to track the job: http://c1:8088/proxy/application_1552315366846_0003/
2019-03-11 23:47:17,262 INFO mapreduce.Job: Running job: job_1552315366846_0003
2019-03-11 23:47:23,359 INFO mapreduce.Job: Job job_1552315366846_0003 running in uber mode : false
2019-03-11 23:47:23,360 INFO mapreduce.Job:  map 0% reduce 0%
2019-03-11 23:47:31,454 INFO mapreduce.Job:  map 75% reduce 0%
2019-03-11 23:47:32,462 INFO mapreduce.Job:  map 100% reduce 0%
2019-03-11 23:47:32,473 INFO mapreduce.Job: Job job_1552315366846_0003 completed successfully
2019-03-11 23:47:32,619 INFO mapreduce.Job: Counters: 32
    File System Counters
        FILE: Number of bytes read=0
        FILE: Number of bytes written=913424
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=782
        HDFS: Number of bytes written=0
        HDFS: Number of read operations=19
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=0
    Job Counters
        Launched map tasks=4
        Data-local map tasks=4
        Total time spent by all maps in occupied slots (ms)=23446
        Total time spent by all reduces in occupied slots (ms)=0
        Total time spent by all map tasks (ms)=23446
        Total vcore-milliseconds taken by all map tasks=23446
        Total megabyte-milliseconds taken by all map tasks=24008704
    Map-Reduce Framework
        Map input records=2
        Map output records=2
        Input split bytes=636
        Spilled Records=0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=582
        CPU time spent (ms)=3960
        Physical memory (bytes) snapshot=830259200
        Virtual memory (bytes) snapshot=11165683712
        Total committed heap usage (bytes)=454557696
        Peak Map Physical memory (bytes)=208502784
        Peak Map Virtual memory (bytes)=2793611264
    File Input Format Counters
        Bytes Read=0
    File Output Format Counters
        Bytes Written=0
2019-03-11 23:47:32,626 INFO mapreduce.ExportJobBase: Transferred 782 bytes in 18.5015 seconds (42.2668 bytes/sec)
2019-03-11 23:47:32,629 INFO mapreduce.ExportJobBase: Exported 2 records.

--export-dir 表示在 HDFS 对应的 Hive 数据库文件位置
–input-fields-terminated-by 表示要处理的间隔符

  
  再次通过 Sqoop 查看 MySql 中的内容,可以看到数据已经成功导入

[[email protected] ~]# sqoop eval --connect  jdbc:mysql://c0:3306/testmshk?useSSL=false  --username root --password 123456 --query "select * from hive2mysql_mshk"
2019-03-11 23:48:56,848 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
2019-03-11 23:48:56,884 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
2019-03-11 23:48:57,024 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
-------------------------------------------------------------
| id          | namea                | nameb                |
-------------------------------------------------------------
| 10086       | my.mshk.top          | you.mshk.top         |
| 12306       | mname                | yname                |
-------------------------------------------------------------

  在 MySql 中能够看到我们创建的 hive2mysql_mshk 表有2行数据
  

  

4.2、Sqoop 通过 MySql 导入数据到 Hive

  刚刚我们创建的 hive2mysql_mshk 表没有任何主键,我们只是从 Hive 中添加了一些记录到 Mysql
  默认情况下,Sqoop 将识别表中的主键列(如果存在)并将其用作拆分列。
  从数据库中检索拆分列的低值和高值,并且映射任务在总范围的大小均匀的组件上运行。
  如果主键的实际值在其范围内不均匀分布,则可能导致任务不平衡。
  您应该使用 --split-by 参数明确选择不同的列。例如-- split-by id

  
  在将 MySql 的数据导入到 Hive中的 Sqoop 命令添加了更多参数:

sqoop import --connect jdbc:mysql://c0:3306/testmshk?useSSL=false --username root --password 123456 --split-by id  --table hive2mysql_mshk --target-dir /hive/warehouse/mysql2hive_mshk  --fields-terminated-by "," --hive-import --hive-table testmshk.mysql2hive_mshk

--split-by <column-name> 用哪个列来拆分
--table 告诉计算机您要从MySQL导入哪个表
--target-dir <dir> HDFS要存储的目录
--hive-import 将表导入Hive
--hive-overwrite 覆盖Hive表中的现有数据
--hive-table <table-name> 设置导入Hive时要使用的表名
--fields-terminated-by <char> 设置字段分隔符

  接下来 Sqoop 的操作是一个 map-reduce 工作。

[[email protected] _src]# sqoop import --connect jdbc:mysql://c0:3306/testmshk?useSSL=false --username root --password 123456 --split-by id  --table hive2mysql_mshk --target-dir /hive/warehouse/mysql2hive_mshk  --fields-terminated-by "," --hive-import --hive-table testmshk.mysql2hive_mshk --hive-overwrite
2019-03-12 20:21:05,060 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
2019-03-12 20:21:05,137 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
2019-03-12 20:21:05,337 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
2019-03-12 20:21:05,348 INFO tool.CodeGenTool: Beginning code generation
2019-03-12 20:21:05,785 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `hive2mysql_mshk` AS t LIMIT 1
2019-03-12 20:21:05,821 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `hive2mysql_mshk` AS t LIMIT 1
2019-03-12 20:21:05,831 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /home/work/_app/hadoop-3.1.2
Note: /tmp/sqoop-root/compile/202a5bda3950a7ccc6782f3cfcc3a99d/hive2mysql_mshk.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
2019-03-12 20:21:08,747 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/202a5bda3950a7ccc6782f3cfcc3a99d/hive2mysql_mshk.jar
2019-03-12 20:21:08,761 WARN manager.MySQLManager: It looks like you are importing from mysql.
2019-03-12 20:21:08,761 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
2019-03-12 20:21:08,761 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
2019-03-12 20:21:08,762 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
2019-03-12 20:21:08,764 INFO mapreduce.ImportJobBase: Beginning import of hive2mysql_mshk
2019-03-12 20:21:08,765 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2019-03-12 20:21:08,928 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
2019-03-12 20:21:09,656 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
2019-03-12 20:21:10,332 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
2019-03-12 20:21:10,688 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1552315366846_0011
2019-03-12 20:21:12,618 INFO db.DBInputFormat: Using read commited transaction isolation
2019-03-12 20:21:12,619 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`id`), MAX(`id`) FROM `hive2mysql_mshk`
2019-03-12 20:21:12,622 INFO db.IntegerSplitter: Split size: 555; Num splits: 4 from: 10086 to: 12306
2019-03-12 20:21:12,696 INFO mapreduce.JobSubmitter: number of splits:4
2019-03-12 20:21:13,137 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1552315366846_0011
2019-03-12 20:21:13,140 INFO mapreduce.JobSubmitter: Executing with tokens: []
2019-03-12 20:21:13,443 INFO conf.Configuration: resource-types.xml not found
2019-03-12 20:21:13,443 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2019-03-12 20:21:13,533 INFO impl.YarnClientImpl: Submitted application application_1552315366846_0011
2019-03-12 20:21:13,593 INFO mapreduce.Job: The url to track the job: http://c1:8088/proxy/application_1552315366846_0011/
2019-03-12 20:21:13,594 INFO mapreduce.Job: Running job: job_1552315366846_0011
2019-03-12 20:21:20,705 INFO mapreduce.Job: Job job_1552315366846_0011 running in uber mode : false
2019-03-12 20:21:20,727 INFO mapreduce.Job:  map 0% reduce 0%
2019-03-12 20:21:29,927 INFO mapreduce.Job:  map 50% reduce 0%
2019-03-12 20:21:29,930 INFO mapreduce.Job: Task Id : attempt_1552315366846_0011_m_000000_0, Status : FAILED
[2019-03-12 20:21:28.236]Container [pid=19941,containerID=container_e15_1552315366846_0011_01_000002] is running 539445760B beyond the 'VIRTUAL' memory limit. Current usage: 199.9 MB of 1 GB physical memory used; 2.6 GB of 2.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_e15_1552315366846_0011_01_000002 :
    |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
    |- 20026 19941 19941 19941 (java) 950 81 2678403072 50861 /opt/jdk1.8.0_201/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx820m -Djava.io.tmpdir=/home/work/_data/hadoop-3.1.2/nm-local-dir/usercache/root/appcache/application_1552315366846_0011/container_e15_1552315366846_0011_01_000002/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/home/work/_logs/hadoop-3.1.2/userlogs/application_1552315366846_0011/container_e15_1552315366846_0011_01_000002 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog org.apache.hadoop.mapred.YarnChild 192.168.11.32 41274 attempt_1552315366846_0011_m_000000_0 16492674416642
    |- 19941 19939 19941 19941 (bash) 1 2 115900416 307 /bin/bash -c /opt/jdk1.8.0_201/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN   -Xmx820m -Djava.io.tmpdir=/home/work/_data/hadoop-3.1.2/nm-local-dir/usercache/root/appcache/application_1552315366846_0011/container_e15_1552315366846_0011_01_000002/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/home/work/_logs/hadoop-3.1.2/userlogs/application_1552315366846_0011/container_e15_1552315366846_0011_01_000002 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog org.apache.hadoop.mapred.YarnChild 192.168.11.32 41274 attempt_1552315366846_0011_m_000000_0 16492674416642 1>/home/work/_logs/hadoop-3.1.2/userlogs/application_1552315366846_0011/container_e15_1552315366846_0011_01_000002/stdout 2>/home/work/_logs/hadoop-3.1.2/userlogs/application_1552315366846_0011/container_e15_1552315366846_0011_01_000002/stderr

[2019-03-12 20:21:28.324]Container killed on request. Exit code is 143
[2019-03-12 20:21:28.335]Container exited with a non-zero exit code 143.

2019-03-12 20:21:30,978 INFO mapreduce.Job:  map 75% reduce 0%
2019-03-12 20:21:37,021 INFO mapreduce.Job:  map 100% reduce 0%
2019-03-12 20:21:37,032 INFO mapreduce.Job: Job job_1552315366846_0011 completed successfully
2019-03-12 20:21:37,145 INFO mapreduce.Job: Counters: 33
    File System Counters
        FILE: Number of bytes read=0
        FILE: Number of bytes written=915840
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=425
        HDFS: Number of bytes written=49
        HDFS: Number of read operations=24
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=8
    Job Counters
        Failed map tasks=1
        Launched map tasks=5
        Other local map tasks=5
        Total time spent by all maps in occupied slots (ms)=31981
        Total time spent by all reduces in occupied slots (ms)=0
        Total time spent by all map tasks (ms)=31981
        Total vcore-milliseconds taken by all map tasks=31981
        Total megabyte-milliseconds taken by all map tasks=32748544
    Map-Reduce Framework
        Map input records=2
        Map output records=2
        Input split bytes=425
        Spilled Records=0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=318
        CPU time spent (ms)=6520
        Physical memory (bytes) snapshot=815542272
        Virtual memory (bytes) snapshot=11174408192
        Total committed heap usage (bytes)=437780480
        Peak Map Physical memory (bytes)=206934016
        Peak Map Virtual memory (bytes)=2795565056
    File Input Format Counters
        Bytes Read=0
    File Output Format Counters
        Bytes Written=49
2019-03-12 20:21:37,154 INFO mapreduce.ImportJobBase: Transferred 49 bytes in 27.4776 seconds (1.7833 bytes/sec)
2019-03-12 20:21:37,159 INFO mapreduce.ImportJobBase: Retrieved 2 records.
2019-03-12 20:21:37,159 INFO mapreduce.ImportJobBase: Publishing Hive/Hcat import job data to Listeners for table hive2mysql_mshk
2019-03-12 20:21:37,188 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `hive2mysql_mshk` AS t LIMIT 1
2019-03-12 20:21:37,208 INFO hive.HiveImport: Loading uploaded data into Hive
2019-03-12 20:21:37,220 INFO conf.HiveConf: Found configuration file file:/home/work/_app/hive-2.3.4/conf/hive-site.xml
2019-03-12 20:21:49,491 INFO hive.HiveImport:
2019-03-12 20:21:49,492 INFO hive.HiveImport: Logging initialized using configuration in jar:file:/home/work/_app/hive-2.3.4/lib/hive-common-2.3.4.jar!/hive-log4j2.properties Async: true
2019-03-12 20:21:56,558 INFO hive.HiveImport: OK
2019-03-12 20:21:56,561 INFO hive.HiveImport: Time taken: 5.954 seconds
2019-03-12 20:21:57,005 INFO hive.HiveImport: Loading data to table testmshk.mysql2hive_mshk
2019-03-12 20:21:58,181 INFO hive.HiveImport: OK
2019-03-12 20:21:58,181 INFO hive.HiveImport: Time taken: 1.619 seconds
2019-03-12 20:21:58,681 INFO hive.HiveImport: Hive import complete.

    
  最后,让我们验证 Hive 中的输出:

[[email protected] ~]# hive

Logging initialized using configuration in jar:file:/home/work/_app/hive-2.3.4/lib/hive-common-2.3.4.jar!/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive> show databases;
OK
default
testmshk
Time taken: 3.512 seconds, Fetched: 2 row(s)
hive> use testmshk;
OK
Time taken: 0.032 seconds
hive> show tables;
OK
mysql2hive_mshk
testtable
Time taken: 0.083 seconds, Fetched: 2 row(s)
hive> select * from mysql2hive_mshk;
OK
10086   my.mshk.top you.mshk.top
12306   mname   yname
Time taken: 1.634 seconds, Fetched: 2 row(s)
hive> quit;

  
  同时我们在 HDFS 中也可以看到创建的数据:
  
  

4.3、Sqoop 通过 MySql 导入数据到 Hbase

  接下来我们将 MySql 中的表 hive2mysql_mshk 数据,导入到 Hbase ,同时在 Hbase 中创建表 mysql2hase_mshk

[[email protected] ~]# sqoop import --connect jdbc:mysql://c0:3306/testmshk?useSSL=false --username root --password 123456 --split-by id --table hive2mysql_mshk --hbase-table mysql2hase_mshk --hbase-create-table --hbase-row-key id --column-family id
2019-03-13 12:04:33,647 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
2019-03-13 12:04:33,694 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
2019-03-13 12:04:33,841 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
2019-03-13 12:04:33,841 INFO tool.CodeGenTool: Beginning code generation
2019-03-13 12:04:34,162 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `hive2mysql_mshk` AS t LIMIT 1
2019-03-13 12:04:34,197 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `hive2mysql_mshk` AS t LIMIT 1
...
2019-03-13 12:05:13,782 INFO mapreduce.Job:  map 75% reduce 0%
2019-03-13 12:05:15,813 INFO mapreduce.Job:  map 100% reduce 0%
2019-03-13 12:05:16,827 INFO mapreduce.Job: Job job_1552397454797_0002 completed successfully
2019-03-13 12:05:16,942 INFO mapreduce.Job: Counters: 33
    File System Counters
        FILE: Number of bytes read=0
        FILE: Number of bytes written=1041632
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=425
        HDFS: Number of bytes written=0
        HDFS: Number of read operations=4
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=0
    Job Counters
        Failed map tasks=5
        Launched map tasks=9
        Other local map tasks=9
        Total time spent by all maps in occupied slots (ms)=68882
        Total time spent by all reduces in occupied slots (ms)=0
        Total time spent by all map tasks (ms)=68882
        Total vcore-milliseconds taken by all map tasks=68882
        Total megabyte-milliseconds taken by all map tasks=70535168
    Map-Reduce Framework
        Map input records=2
        Map output records=2
        Input split bytes=425
        Spilled Records=0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=801
        CPU time spent (ms)=15480
        Physical memory (bytes) snapshot=1097326592
        Virtual memory (bytes) snapshot=11271196672
        Total committed heap usage (bytes)=629669888
        Peak Map Physical memory (bytes)=295751680
        Peak Map Virtual memory (bytes)=2828283904
    File Input Format Counters
        Bytes Read=0
    File Output Format Counters
        Bytes Written=0
2019-03-13 12:05:16,949 INFO mapreduce.ImportJobBase: Transferred 0 bytes in 38.224 seconds (0 bytes/sec)
2019-03-13 12:05:16,954 INFO mapreduce.ImportJobBase: Retrieved 2 records.

  
  这时,我们再用 shell 测试连接 Hbase ,查看我们刚刚导入的数据,能够看到 mysql2hase_mshk 已经存在,并且可以获取其中的数据

[[email protected] ~]# hbase shell
HBase Shell
Use "help" to get list of supported commands.
Use "exit" to quit this interactive shell.
Version 1.4.9, rd625b212e46d01cb17db9ac2e9e927fdb201afa1, Wed Dec  5 11:54:10 PST 2018

hbase(main):001:0> list
TABLE
mysql2hase_mshk
1 row(s) in 0.1870 seconds

=> ["mysql2hase_mshk"]
hbase(main):002:0> scan 'mysql2hase_mshk'
ROW                                                        COLUMN+CELL
 10086                                                     column=id:namea, timestamp=1552449912494, value=my.mshk.top
 10086                                                     column=id:nameb, timestamp=1552449912494, value=you.mshk.top
 12306                                                     column=id:namea, timestamp=1552449906986, value=mname
 12306                                                     column=id:nameb, timestamp=1552449906986, value=yname
2 row(s) in 0.1330 seconds

hbase(main):003:0> hbase(main):003:0> get 'mysql2hase_mshk','10086'
COLUMN                                                     CELL
 id:namea                                                  timestamp=1552449912494, value=my.mshk.top
 id:nameb                                                  timestamp=1552449912494, value=you.mshk.top
1 row(s) in 0.0230 seconds

hbase(main):004:0>

  
  如何在 HbaseHive 中互相导入、导出数据,请参考文章:Hadoop 3.1.2(HA)+Zookeeper3.4.13+Hbase1.4.9(HA)+Hive2.3.4+Spark2.4.0(HA)高可用集群搭建 中的 9.2.49.2.5 章节。
  
  希望您发现它很有用,感谢您的支持和阅读我的博客。


博文作者:迦壹
博客地址:通过 Sqoop1.4.7 将 Mysql5.7、Hive2.3.4、Hbase1.4.9 之间的数据导入导出
转载声明:可以转载, 但必须以超链接形式标明文章原始出处和作者信息及版权声明,谢谢合作!
  
假设您认为这篇文章对您有帮助,可以通过以下方式进行捐赠,谢谢!

比特币地址:1KdgydfKMcFVpicj5w4vyn3T88dwjBst6Y
以太坊地址:0xbB0a92d634D7b9Ac69079ed0e521CC2e0a97c420




You must enable javascript to see captcha here!

Copyright © All Rights Reserved · Green Hope Theme by Sivan & schiy · Proudly powered by WordPress

无觅相关文章插件,快速提升流量