问题导读:
1、使用Sqoop哪个工具实现多表导入?
2、满足多表导入的三个条件是?
3、如何指定导入HDFS某个目录?如何指定导入Hive某个数据库?
一、介绍
有时候我们需要将关系型数据库中多个表一起导入到HDFS或者hive中,这个时候可以使用Sqoop的另一个工具sqoop-import-all-tables。每个表数据被分别存储在以表名命名的HDFS上的不同目录中。
在使用多表导入之前,以下三个条件必须同时满足:
1、每个表必须都只有一个列作为主键;
2、必须将每个表中所有的数据导入,而不是部分;
3、你必须使用默认分隔列,且WHERE子句无任何强加的条件
--table, --split-by, --columns, 和 --where 参数在sqoop-import-all-tables命令中是不合法的。--exclude-tables:可以用来排除导入某个表。具体的导入用法和单表导入差不多。
二、关系数据表
我的spice数据库中,有以下四张表:
- mysql> show tables;
- +-----------------+
- | Tables_in_spice |
- +-----------------+
- | servers |
- | users |
- | vmLog |
- | vms |
- +-----------------+
- 4 rows in set (0.00 sec)
三、多表同时导入HDFS中- [hadoopUser@secondmgt ~]$ sqoop-import-all-tables --connect jdbc:mysql://secondmgt:3306/spice --username hive --password hive --as-textfile --warehouse-dir /output/
- Warning: /usr/lib/hcatalog does not exist! HCatalog jobs will fail.
- Please set $HCAT_HOME to the root of your HCatalog installation.
- 15/01/19 20:21:15 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
- 15/01/19 20:21:15 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
- 15/01/19 20:21:15 INFO tool.CodeGenTool: Beginning code generation
- 15/01/19 20:21:15 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `servers` AS t LIMIT 1
- 15/01/19 20:21:15 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `servers` AS t LIMIT 1
- 15/01/19 20:21:15 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /home/hadoopUser/cloud/hadoop/programs/hadoop-2.2.0
- Note: /tmp/sqoop-hadoopUser/compile/0bdbced5e58f170e1670516db3339f91/servers.java uses or overrides a deprecated API.
- Note: Recompile with -Xlint:deprecation for details.
- 15/01/19 20:21:16 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hadoopUser/compile/0bdbced5e58f170e1670516db3339f91/servers.jar
- 15/01/19 20:21:16 WARN manager.MySQLManager: It looks like you are importing from mysql.
- 15/01/19 20:21:16 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
- 15/01/19 20:21:16 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
- 15/01/19 20:21:16 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
- 15/01/19 20:21:16 INFO mapreduce.ImportJobBase: Beginning import of servers
- 15/01/19 20:21:16 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
- SLF4J: Class path contains multiple SLF4J bindings.
- SLF4J: Found binding in [jar:file:/home/hadoopUser/cloud/hadoop/programs/hadoop-2.2.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
- SLF4J: Found binding in [jar:file:/home/hadoopUser/cloud/hbase/hbase-0.96.2-hadoop2/lib/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
- SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
- SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
- 15/01/19 20:21:17 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
- 15/01/19 20:21:17 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
- 15/01/19 20:21:17 INFO client.RMProxy: Connecting to ResourceManager at secondmgt/192.168.2.133:8032
- 15/01/19 20:21:18 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`src_id`), MAX(`src_id`) FROM `servers`
- 15/01/19 20:21:18 INFO mapreduce.JobSubmitter: number of splits:3
- 15/01/19 20:21:19 INFO Configuration.deprecation: mapred.job.classpath.files is deprecated. Instead, use mapreduce.job.classpath.files
- 15/01/19 20:21:19 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name
- 15/01/19 20:21:19 INFO Configuration.deprecation: mapred.cache.files.filesizes is deprecated. Instead, use mapreduce.job.cache.files.filesizes
- 15/01/19 20:21:19 INFO Configuration.deprecation: mapred.cache.files is deprecated. Instead, use mapreduce.job.cache.files
- 15/01/19 20:21:19 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
- 15/01/19 20:21:19 INFO Configuration.deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class
- 15/01/19 20:21:19 INFO Configuration.deprecation: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class
- 15/01/19 20:21:19 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name
- 15/01/19 20:21:19 INFO Configuration.deprecation: mapreduce.inputformat.class is deprecated. Instead, use mapreduce.job.inputformat.class
- 15/01/19 20:21:19 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
- 15/01/19 20:21:19 INFO Configuration.deprecation: mapreduce.outputformat.class is deprecated. Instead, use mapreduce.job.outputformat.class
- 15/01/19 20:21:19 INFO Configuration.deprecation: mapred.cache.files.timestamps is deprecated. Instead, use mapreduce.job.cache.files.timestamps
- 15/01/19 20:21:19 INFO Configuration.deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
- 15/01/19 20:21:19 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
- 15/01/19 20:21:19 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1421373857783_0035
- 15/01/19 20:21:19 INFO impl.YarnClientImpl: Submitted application application_1421373857783_0035 to ResourceManager at secondmgt/192.168.2.133:8032
- 15/01/19 20:21:19 INFO mapreduce.Job: The url to track the job: http://secondmgt:8088/proxy/application_1421373857783_0035/
- 15/01/19 20:21:19 INFO mapreduce.Job: Running job: job_1421373857783_0035
- 15/01/19 20:21:32 INFO mapreduce.Job: Job job_1421373857783_0035 running in uber mode : false
- 15/01/19 20:21:32 INFO mapreduce.Job: map 0% reduce 0%
- 15/01/19 20:21:43 INFO mapreduce.Job: map 33% reduce 0%
- 15/01/19 20:21:46 INFO mapreduce.Job: map 67% reduce 0%
- 15/01/19 20:21:48 INFO mapreduce.Job: map 100% reduce 0%
- 15/01/19 20:21:48 INFO mapreduce.Job: Job job_1421373857783_0035 completed successfully
- 15/01/19 20:21:48 INFO mapreduce.Job: Counters: 27
- File System Counters
- FILE: Number of bytes read=0
- FILE: Number of bytes written=275913
- FILE: Number of read operations=0
- FILE: Number of large read operations=0
- FILE: Number of write operations=0
- HDFS: Number of bytes read=319
- HDFS: Number of bytes written=39
- HDFS: Number of read operations=12
- HDFS: Number of large read operations=0
- HDFS: Number of write operations=6
- Job Counters
- Launched map tasks=3
- Other local map tasks=3
- Total time spent by all maps in occupied slots (ms)=127888
- Total time spent by all reduces in occupied slots (ms)=0
- Map-Reduce Framework
- Map input records=3
- Map output records=3
- Input split bytes=319
- Spilled Records=0
- Failed Shuffles=0
- Merged Map outputs=0
- GC time elapsed (ms)=201
- CPU time spent (ms)=7470
- Physical memory (bytes) snapshot=439209984
- Virtual memory (bytes) snapshot=2636304384
- Total committed heap usage (bytes)=252706816
- File Input Format Counters
- Bytes Read=0
- File Output Format Counters
- Bytes Written=39
- 15/01/19 20:21:48 INFO mapreduce.ImportJobBase: Transferred 39 bytes in 30.8769 seconds (1.2631 bytes/sec)
- 15/01/19 20:21:48 INFO mapreduce.ImportJobBase: Retrieved 3 records.
- 15/01/19 20:21:48 INFO tool.CodeGenTool: Beginning code generation
- 15/01/19 20:21:48 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `users` AS t LIMIT 1
- 15/01/19 20:21:48 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /home/hadoopUser/cloud/hadoop/programs/hadoop-2.2.0
- Note: /tmp/sqoop-hadoopUser/compile/0bdbced5e58f170e1670516db3339f91/users.java uses or overrides a deprecated API.
- Note: Recompile with -Xlint:deprecation for details.
- 15/01/19 20:21:48 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hadoopUser/compile/0bdbced5e58f170e1670516db3339f91/users.jar
- 15/01/19 20:21:48 INFO mapreduce.ImportJobBase: Beginning import of users
- 15/01/19 20:21:48 INFO client.RMProxy: Connecting to ResourceManager at secondmgt/192.168.2.133:8032
- 15/01/19 20:21:49 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`id`), MAX(`id`) FROM `users`
- 15/01/19 20:21:49 INFO mapreduce.JobSubmitter: number of splits:4
- 15/01/19 20:21:49 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1421373857783_0036
- 15/01/19 20:21:49 INFO impl.YarnClientImpl: Submitted application application_1421373857783_0036 to ResourceManager at secondmgt/192.168.2.133:8032
- 15/01/19 20:21:49 INFO mapreduce.Job: The url to track the job: http://secondmgt:8088/proxy/application_1421373857783_0036/
- 15/01/19 20:21:49 INFO mapreduce.Job: Running job: job_1421373857783_0036
- 15/01/19 20:22:02 INFO mapreduce.Job: Job job_1421373857783_0036 running in uber mode : false
- 15/01/19 20:22:02 INFO mapreduce.Job: map 0% reduce 0%
- 15/01/19 20:22:13 INFO mapreduce.Job: map 25% reduce 0%
- 15/01/19 20:22:18 INFO mapreduce.Job: map 75% reduce 0%
- 15/01/19 20:22:23 INFO mapreduce.Job: map 100% reduce 0%
- 15/01/19 20:22:23 INFO mapreduce.Job: Job job_1421373857783_0036 completed successfully
- 15/01/19 20:22:23 INFO mapreduce.Job: Counters: 27
- File System Counters
- FILE: Number of bytes read=0
- FILE: Number of bytes written=368040
- FILE: Number of read operations=0
- FILE: Number of large read operations=0
- FILE: Number of write operations=0
- HDFS: Number of bytes read=401
- HDFS: Number of bytes written=521
- HDFS: Number of read operations=16
- HDFS: Number of large read operations=0
- HDFS: Number of write operations=8
- Job Counters
- Launched map tasks=4
- Other local map tasks=4
- Total time spent by all maps in occupied slots (ms)=175152
- Total time spent by all reduces in occupied slots (ms)=0
- Map-Reduce Framework
- Map input records=13
- Map output records=13
- Input split bytes=401
- Spilled Records=0
- Failed Shuffles=0
- Merged Map outputs=0
- GC time elapsed (ms)=257
- CPU time spent (ms)=10250
- Physical memory (bytes) snapshot=627642368
- Virtual memory (bytes) snapshot=3547209728
- Total committed heap usage (bytes)=335544320
- File Input Format Counters
- Bytes Read=0
- File Output Format Counters
- Bytes Written=521
- 15/01/19 20:22:23 INFO mapreduce.ImportJobBase: Transferred 521 bytes in 34.6285 seconds (15.0454 bytes/sec)
- 15/01/19 20:22:23 INFO mapreduce.ImportJobBase: Retrieved 13 records.
- 15/01/19 20:22:23 INFO tool.CodeGenTool: Beginning code generation
- 15/01/19 20:22:23 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `vmLog` AS t LIMIT 1
- 15/01/19 20:22:23 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /home/hadoopUser/cloud/hadoop/programs/hadoop-2.2.0
- Note: /tmp/sqoop-hadoopUser/compile/0bdbced5e58f170e1670516db3339f91/vmLog.java uses or overrides a deprecated API.
- Note: Recompile with -Xlint:deprecation for details.
- 15/01/19 20:22:23 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hadoopUser/compile/0bdbced5e58f170e1670516db3339f91/vmLog.jar
- 15/01/19 20:22:23 INFO mapreduce.ImportJobBase: Beginning import of vmLog
- 15/01/19 20:22:23 INFO client.RMProxy: Connecting to ResourceManager at secondmgt/192.168.2.133:8032
- 15/01/19 20:22:24 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`id`), MAX(`id`) FROM `vmLog`
- 15/01/19 20:22:24 INFO mapreduce.JobSubmitter: number of splits:4
- 15/01/19 20:22:24 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1421373857783_0037
- 15/01/19 20:22:24 INFO impl.YarnClientImpl: Submitted application application_1421373857783_0037 to ResourceManager at secondmgt/192.168.2.133:8032
- 15/01/19 20:22:24 INFO mapreduce.Job: The url to track the job: http://secondmgt:8088/proxy/application_1421373857783_0037/
- 15/01/19 20:22:24 INFO mapreduce.Job: Running job: job_1421373857783_0037
- 15/01/19 20:22:37 INFO mapreduce.Job: Job job_1421373857783_0037 running in uber mode : false
- 15/01/19 20:22:37 INFO mapreduce.Job: map 0% reduce 0%
- 15/01/19 20:22:47 INFO mapreduce.Job: map 25% reduce 0%
- 15/01/19 20:22:52 INFO mapreduce.Job: map 75% reduce 0%
- 15/01/19 20:22:58 INFO mapreduce.Job: map 100% reduce 0%
- 15/01/19 20:22:58 INFO mapreduce.Job: Job job_1421373857783_0037 completed successfully
- 15/01/19 20:22:59 INFO mapreduce.Job: Counters: 27
- File System Counters
- FILE: Number of bytes read=0
- FILE: Number of bytes written=367872
- FILE: Number of read operations=0
- FILE: Number of large read operations=0
- FILE: Number of write operations=0
- HDFS: Number of bytes read=398
- HDFS: Number of bytes written=635
- HDFS: Number of read operations=16
- HDFS: Number of large read operations=0
- HDFS: Number of write operations=8
- Job Counters
- Launched map tasks=4
- Other local map tasks=4
- Total time spent by all maps in occupied slots (ms)=171552
- Total time spent by all reduces in occupied slots (ms)=0
- Map-Reduce Framework
- Map input records=23
- Map output records=23
- Input split bytes=398
- Spilled Records=0
- Failed Shuffles=0
- Merged Map outputs=0
- GC time elapsed (ms)=182
- CPU time spent (ms)=10480
- Physical memory (bytes) snapshot=588107776
- Virtual memory (bytes) snapshot=3523424256
- Total committed heap usage (bytes)=337641472
- File Input Format Counters
- Bytes Read=0
- File Output Format Counters
- Bytes Written=635
- 15/01/19 20:22:59 INFO mapreduce.ImportJobBase: Transferred 635 bytes in 35.147 seconds (18.067 bytes/sec)
- 15/01/19 20:22:59 INFO mapreduce.ImportJobBase: Retrieved 23 records.
- 15/01/19 20:22:59 INFO tool.CodeGenTool: Beginning code generation
- 15/01/19 20:22:59 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `vms` AS t LIMIT 1
- 15/01/19 20:22:59 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /home/hadoopUser/cloud/hadoop/programs/hadoop-2.2.0
- Note: /tmp/sqoop-hadoopUser/compile/0bdbced5e58f170e1670516db3339f91/vms.java uses or overrides a deprecated API.
- Note: Recompile with -Xlint:deprecation for details.
- 15/01/19 20:22:59 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hadoopUser/compile/0bdbced5e58f170e1670516db3339f91/vms.jar
- 15/01/19 20:22:59 INFO mapreduce.ImportJobBase: Beginning import of vms
- 15/01/19 20:22:59 INFO client.RMProxy: Connecting to ResourceManager at secondmgt/192.168.2.133:8032
- 15/01/19 20:23:00 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`id`), MAX(`id`) FROM `vms`
- 15/01/19 20:23:00 INFO mapreduce.JobSubmitter: number of splits:4
- 15/01/19 20:23:00 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1421373857783_0038
- 15/01/19 20:23:00 INFO impl.YarnClientImpl: Submitted application application_1421373857783_0038 to ResourceManager at secondmgt/192.168.2.133:8032
- 15/01/19 20:23:00 INFO mapreduce.Job: The url to track the job: http://secondmgt:8088/proxy/application_1421373857783_0038/
- 15/01/19 20:23:00 INFO mapreduce.Job: Running job: job_1421373857783_0038
- 15/01/19 20:23:13 INFO mapreduce.Job: Job job_1421373857783_0038 running in uber mode : false
- 15/01/19 20:23:13 INFO mapreduce.Job: map 0% reduce 0%
- 15/01/19 20:23:24 INFO mapreduce.Job: map 25% reduce 0%
- 15/01/19 20:23:28 INFO mapreduce.Job: map 75% reduce 0%
- 15/01/19 20:23:34 INFO mapreduce.Job: map 100% reduce 0%
- 15/01/19 20:23:35 INFO mapreduce.Job: Job job_1421373857783_0038 completed successfully
- 15/01/19 20:23:35 INFO mapreduce.Job: Counters: 27
- File System Counters
- FILE: Number of bytes read=0
- FILE: Number of bytes written=367932
- FILE: Number of read operations=0
- FILE: Number of large read operations=0
- FILE: Number of write operations=0
- HDFS: Number of bytes read=401
- HDFS: Number of bytes written=240
- HDFS: Number of read operations=16
- HDFS: Number of large read operations=0
- HDFS: Number of write operations=8
- Job Counters
- Launched map tasks=4
- Other local map tasks=4
- Total time spent by all maps in occupied slots (ms)=168328
- Total time spent by all reduces in occupied slots (ms)=0
- Map-Reduce Framework
- Map input records=8
- Map output records=8
- Input split bytes=401
- Spilled Records=0
- Failed Shuffles=0
- Merged Map outputs=0
- GC time elapsed (ms)=210
- CPU time spent (ms)=10990
- Physical memory (bytes) snapshot=600018944
- Virtual memory (bytes) snapshot=3536568320
- Total committed heap usage (bytes)=335544320
- File Input Format Counters
- Bytes Read=0
- File Output Format Counters
- Bytes Written=240
- 15/01/19 20:23:35 INFO mapreduce.ImportJobBase: Transferred 240 bytes in 35.9131 seconds (6.6828 bytes/sec)
- 15/01/19 20:23:35 INFO mapreduce.ImportJobBase: Retrieved 8 records.
由导入过程日志可知,其实多表导入过程也是一个个单表导入的整合。
查看导入结果:
- [hadoopUser@secondmgt ~]$ hadoop fs -ls /output/
- Found 4 items
- drwxr-xr-x - hadoopUser supergroup 0 2015-01-19 20:21 /output/servers
- drwxr-xr-x - hadoopUser supergroup 0 2015-01-19 20:22 /output/users
- drwxr-xr-x - hadoopUser supergroup 0 2015-01-19 20:22 /output/vmLog
- drwxr-xr-x - hadoopUser supergroup 0 2015-01-19 20:23 /output/vms
四、多表导入Hive中
我们将上述四个表导入到Hive中,如下:
- [hadoopUser@secondmgt ~]$ sqoop-import-all-tables --connect jdbc:mysql://secondmgt:3306/spice --username hive --password hive --hive-import --as-textfile --create-hive-table
查看结果:
- hive> show tables;
- OK
- servers
- users
- vmlog
- vms
- Time taken: 0.022 seconds, Fetched: 4 row(s)
默认是导入到default数据库中,如果想指定导入到某个数据库中,
可以使用--hive-database参数,如下:
- [hadoopUser@secondmgt ~]$ sqoop-import-all-tables --connect jdbc:mysql://secondmgt:3306/spice --username hive --password hive --hive-import --hive-database test --as-textfile --create-hive-table
本站仅提供存储服务,所有内容均由用户发布,如发现有害或侵权内容,请
点击举报。