Skip to content

Latest commit

 

History

History
202 lines (164 loc) · 12.7 KB

README.md

File metadata and controls

202 lines (164 loc) · 12.7 KB

Hive-TiDB-Storage-Handler

1 Environment

Component Version
JDK 8
Maven 3.6+
Hive 1.x / 2.x / 3.x

Hive Version to TiBigData Module:

Hive Version TiBigData Module
1.0.0 - 1.2.2 hive-2.2.0
2.0.0 - 2.2.0 hive-2.2.0
2.3.0 - 2.3.6 hive-2.3.6
3.0.0 - 3.1.2 hive-3.1.2

2 Compile Hive Storage Handler

# clone project
git clone [email protected]:tidb-incubator/TiBigData.git
cd TiBigData

# compile Hive Storage Handler,using hive-2.2.0 as an example
mvn clean package -DskipTests -am -pl hive/hive-2.2.0

The following parameters are available for compiling:

parameter default description
-Dmysql.driver.scope test Whether the dependency mysql-jdbc-driver is included

3 Install Hive Storage Handler

Both of the following are available.

3.1 Using ADD JAR syntax

# upload TiBigData jar to HDFS
hadoop fs -put $TIBIGDATA_HOME/hive/hive-2.2.0/target/hive-tidb-storage-handler-2.2.0-0.0.5-SNAPSHOT.jar /tmp
# run add jar sql in hive shell or beeline
ADD JAR hdfs:///tmp/hive-tidb-storage-handler-2.2.0-0.0.5-SNAPSHOT.jar;

3.2 Copying lib to Hive

cp $TIBIGDATA_HOME/hive/hive-2.2.0/target/hive-tidb-storage-handler-2.2.0-0.0.5-SNAPSHOT.jar $HIVE_HOME/lib/

4 Reading from TiDB by Hive

Using mysql client to create a table in TiDB:

# connect to TiDB
mysql --host 127.0.0.1 --port 4000 -uroot --database test

Insert data to TiDB:

CREATE TABLE `people`(
  `id` int,
  `name` varchar(16)
);

INSERT INTO `people`(`id`,`name`) VALUES (1,'zs');

Create the TiDB mapping table in Hive, where the columns are arbitrary and TiBigData will automatically correct the columns.

CREATE TEMPORARY TABLE `people`(
 `fake_column` int
)
STORED BY
  'io.tidb.bigdata.hive.TiDBStorageHandler'
TBLPROPERTIES (
  'tidb.database.url'='jdbc:mysql://localhost:4000/',
  'tidb.database.name'='test',
  'tidb.table.name'='people',
  'tidb.username'='root',
  'tidb.password'=''
 );

Using hive sql to query TiDB schema:

SHOW CREATE TABLE `people`;

We can see that the columns of the table have been automatically corrected:

hive (default)> SHOW CREATE TABLE `people`;
OK
createtab_stmt
CREATE TABLE `people`(
  `id` int COMMENT 'from deserializer',
  `name` string COMMENT 'from deserializer')
ROW FORMAT SERDE
  'io.tidb.bigdata.hive.TiDBSerde'
STORED BY
  'io.tidb.bigdata.hive.TiDBStorageHandler'
WITH SERDEPROPERTIES (
  'serialization.format'='1')
LOCATION
  'hdfs://XXX:8020/user/hive/warehouse/people'
TBLPROPERTIES (
  'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}',
  'numFiles'='0',
  'numRows'='0',
  'rawDataSize'='0',
  'tidb.database.name'='test',
  'tidb.database.url'='jdbc:mysql://localhost:4000/',
  'tidb.password'='',
  'tidb.table.name'='people',
  'tidb.username'='root',
  'totalSize'='0',
  'transient_lastDdlTime'='1642779466')
Time taken: 2.276 seconds, Fetched: 23 row(s)
hive (default)>

Query TiDB in Hive:

SELECT * FROM `people`;

output:

hive (default)> SELECT * FROM `people`;
OK
people.id	people.name
1	zs
Time taken: 4.979 seconds, Fetched: 1 row(s)
hive (default)>

5 DataTypes

TiDB Hive
TINYINT INT
SMALLINT INT
MEDIUMINT INT
INT INT
BIGINT BIGINT
CHAR STRING
VARCHAR STRING
TINYTEXT STRING
MEDIUMTEXT STRING
TEXT STRING
LONGTEXT STRING
BINARY BYTES
VARBINARY BYTES
TINYBLOB BYTES
MEDIUMBLOB BYTES
BLOB BYTES
LONGBLOB BYTES
FLOAT FLOAT
DOUBLE DOUBLE
DECIMAL(p,s) DECIMAL(p,s)
DATE DATE
TIME STRING
DATETIME TIMESTAMP
TIMESTAMP TIMESTAMP
YEAR INT
BOOL BOOLEAN
JSON STRING
ENUM STRING
SET STRING

6 Configuration

Configuration Default Value Description
tidb.database.url - You should provide your own TiDB server address with a jdbc url format: jdbc:mysql://host:port/database or jdbc:tidb://host:port/database. If you have multiple TiDB server addresses and the amount of data to be inserted is huge, it would be better to use TiDB jdbc driver rather then MySQL jdbc driver. TiDB driver is a load-balancing driver, it will query all TiDB server addresses and pick one randomly when establishing connections.
tidb.username - JDBC username.
tidb.password null JDBC password.
tidb.jdbc.connection-provider-impl io.tidb.bigdata.tidb.JdbcConnectionProviderFactory.BasicJdbcConnectionProvider JDBC connection provider implements: set 'io.tidb.bigdata.tidb.JdbcConnectionProviderFactory.HikariDataSourceJdbcConnectionProvider', TiBigData will use JDBC connection pool implemented by HikariCP to provider connection; set 'io.tidb.bigdata.tidb.JdbcConnectionProviderFactory.BasicJdbcConnectionProvider', connection pool will not be used.
tidb.maximum.pool.size 10 Connection pool size.
tidb.minimum.idle.size 10 The minimum number of idle connections that HikariCP tries to maintain in the pool.
tidb.replica-read leader Read data from specified role. The optional roles are leader, follower and learner. You can also specify multiple roles, and we will pick the roles you specify in order.
tidb.replica-read.label null Only select TiKV store match specified labels. Format: label_x=value_x,label_y=value_y
tidb.replica-read.whitelist null Only select TiKV store with given ip addresses.
tidb.replica-read.blacklist null Do not select TiKV store with given ip addresses.
tidb.snapshot_timestamp null It is available for TiDB connector to read snapshot. You could set it by SET SESSION tidb.snapshot_timestamp='2021-01-01T14:00:00+08:00' and unset it by SET SESSION tidb.snapshot_timestamp='' within a session. The format of timestamp may refer to java.time.format.DateTimeFormatter#ISO_ZONED_DATE_TIME.
tidb.dns.search null Append dns search suffix to host names. It's especially necessary to map K8S cluster local name to FQDN.
tidb.regions-per-split 1 Number of TiDB regions one InputSplit handles.