Load Data from Hive Data Warehouse

Hive data warehouse is built on the HDFS of the Hadoop cluster, so the data in the Hive data warehouse is also stored in HDFS. Currently, SynxDB Cloud supports writing data to and reading data from HDFS (see Load Data from Object Storage and HDFS) as well as reading data from Hive via the Hive Connector.

The Hive Connector loads tables from the Hive cluster as foreign tables in SynxDB Cloud, which store the paths to the data in HDFS. datalake_fdw reads data from these external tables, thus loading data from Hive into SynxDB Cloud.

The general steps to use the Hive Connector are as follows.

Step 1. Configure Hive and HDFS information on SynxDB Cloud

On containerized SynxDB Cloud, you need to configure the connection details for your external Hive and HDFS clusters. This is now done through the DBaaS Admin Console.

  • Provide Hive and HDFS connection details. In the DBaaS Admin Console’s Datebase Config page, you need to create configurations for your HDFS cluster and your Hive Metastore service. For detailed, step-by-step instructions, see:

  • If using Kerberos, provide KDC and keytab details. If your Hive cluster uses Kerberos authentication, you must also provide the Kerberos configuration (krb5.conf) and the necessary keytab files.

    • Provide krb5.conf content: In the KDC tab of the Datebase Config page, create a new KDC configuration. You can paste the content of your krb5.conf file into the Manual Input field. For detailed instructions, see Configure a Kerberos connection.

    • Upload keytab files: When you configure the HDFS and Hive connections with Kerberos Authentication in the DBaaS Admin Console, you will be prompted to upload the corresponding hdfs.keytab and hive.keytab files directly in the user interface.

Attention

The default port for the data_lake agent has been changed from 5888 to 3888 to avoid conflict with PXF.

Step 2. Create foreign data wrapper and Hive Connector extension

Before synchronization, load the datalake_fdw extension used for reading HDFS, and create the foreign data wrapper for reading external tables.

  1. Create the necessary extensions.

    CREATE EXTENSION dfs_tablespace;
    CREATE EXTENSION gp_toolkit;
    CREATE EXTENSION datalake_fdw;
    
  2. Create the foreign data wrapper.

    CREATE FOREIGN DATA WRAPPER datalake_fdw
    HANDLER datalake_fdw_handler
    VALIDATOR datalake_fdw_validator
    OPTIONS (mpp_execute 'all segments');
    
  3. Before calling the function, you need to load the Hive Connector extension.

    CREATE EXTENSION hive_connector;
    

Step 3. Create server and user mapping

After creating the foreign data wrapper and Hive Connector, you need to create the server and user mapping, as shown in the following example:

SELECT public.create_foreign_server('sync_server', 'gpadmin', 'datalake_fdw', 'hdfs-cluster-1');

In the above example, the create_foreign_server function takes the form as follows:

create_foreign_server(serverName,
                     userMapName,
                     dataWrapName,
                     hdfsClusterName);

This function creates a server and user mapping pointing to an HDFS cluster, which can be used by the Hive Connector to create foreign tables. The datalake_fdw uses the server configuration to read data from the corresponding HDFS cluster when accessing external tables.

The parameters in the function are explained as follows:

  • serverName: The name of the server to be created.

  • userMapName: The name of the user to be created on the server.

  • dataWrapName: The name of the data wrapper used for reading HDFS data.

  • hdfsClusterName: The name of the HDFS cluster where the Hive cluster is located, as specified in the configuration file.

Tip

By default, the datalake_fdw accesses HDFS using the system role gpadmin. You can use the user option in CREATE USER MAPPING to control which HDFS user will be used when accessing the file system. This allows finer access control to HDFS resources.

Example:

CREATE SERVER foreign_server
FOREIGN DATA WRAPPER datalake_fdw
OPTIONS (
   protocol 'hdfs',
   hdfs_namenodes 'hadoop-nn',
   hdfs_port '9000',
   hdfs_auth_method 'simple',
   hadoop_rpc_protection 'authentication');

CREATE USER MAPPING FOR current_user SERVER foreign_server
OPTIONS (user 'hdfs_reader');

In this example, the HDFS storage will be accessed with the hdfs_reader user rather than the default gpadmin. This method is recommended for managing access permissions in multi-tenant or multi-user environments.

Step 4. Sync Hive objects to SynxDB Cloud

There are two ways to synchronize Hive metadata to SynxDB Cloud:

  • Manual synchronization (function-based): Use SQL functions to sync tables or partitions on demand. This is suitable for one-time loads or batch updates.

  • Automatic synchronization (event-based): Use the Hive Metadata Auto Sync feature to automatically capture changes from Hive Metastore via Kafka. This is recommended for scenarios requiring real-time metadata consistency. Auto sync requires additional setup on the Hive side (installing a listener plugin and configuring hive-site.xml) and an available Kafka broker; see the prerequisites in the link below before using it.

For configuration details of automatic synchronization, refer to Configure Hive Metadata Auto Sync.

Manual synchronization

Syncing a Hive table

To sync a table from Hive to SynxDB Cloud, see the following example:

-- Syncs Hive tables in psql.

gpadmin=# select public.sync_hive_table('hive-cluster-1', 'mytestdb', 'weblogs', 'hdfs-cluster-1', 'myschema.weblogs', 'sync_server');
 sync_hive_table
-----------------
 t
(1 row)

The above example uses the sync_hive_table function to perform the synchronization. The general form of the function is as follows:

sync_hive_table(hiveClusterName,
               hiveDatabaseName,
               hiveTableName,
               hdfsClusterName,
               destTableName,
               serverName,
               forceSync);

This function syncs a table to SynxDB Cloud, with both non-forced and forced modes available. When forceSync is set to true, the sync is forced, which means that if a table with the same name already exists in SynxDB Cloud, the existing table is dropped before syncing. If the forceSync parameter is not provided or is set to false, an error will occur if a table with the same name exists.

The parameters are explained as follows:

  • hiveClusterName: The name of the Hive cluster where the table to be synced is located, as specified in the configuration file.

  • hiveDatabaseName: The name of the database in Hive where the table to be synced belongs.

  • hiveTableName: The name of the table to be synced.

  • hdfsClusterName: The name of the HDFS cluster where the Hive cluster is located, as specified in the configuration file.

  • destTableName: The name of the table in SynxDB Cloud where the data will be synced.

  • serverName: The name of the server to be used when creating the foreign table with the datalake_fdw extension.

  • forceSync: Optional parameter. Default value is false. Indicates whether the sync should be forced.

Sync a partitioned Hive table using sync_hive_partition_table

SynxDB Cloud supports synchronizing only the latest partition of a Hive table using the sync_hive_partition_table function. This function is used to sync a single partition specified by the highest-level partition key (for example, prov if the table is partitioned by prov, month, and day). It does not support specifying lower-level partition keys directly (such as month or day), and will return an error if you attempt to do so.

Function prototype:

CREATE OR REPLACE FUNCTION sync_hive_partition_table(
   hiveClusterName text,
   hiveDatabaseName text,
   hiveTableName text,
   hivePartitionValue text,
   hdfsClusterName text,
   destTableName text
) RETURNS boolean
AS '$libdir/hive_connector', 'sync_hive_partition_table'
LANGUAGE C STRICT EXECUTE ON MASTER;

The parameter hivePartitionValue means the value for the highest-level partition key. It must be the first key in the partition column list.

Example Hive table:

CREATE TABLE hive_table (
   id int,
   name string
)
PARTITIONED BY (
   prov int,
   month int,
   day int
);

Example usage:

SELECT sync_hive_partition_table(
   'hive-cluster-1',
   'mydb',
   'hive_table',
   '06',
   'hdfs-cluster-1',
   'myschema.hive_table_06'
);

This call will sync only the partition data under prov=06. If you try to specify values like month=06 or day=15, the function will return an error.

Note: This function only supports specifying the value of the first partition key. Multi-level partition value specification is currently not supported.

Resulting external table structure:

CREATE TABLE mpp_table (
   id int,
   name string,
   prov int,
   month int,
   day int
)
LOCATION('gphdfs://example/prov=06/ hdfs_cluster_name=paa_cluster partitonkey=month,day partitionvalue=06')
FORMAT 'xxx';

More examples

Sync a Hive text table

  1. Create the following text table in Hive.

    -- Creates the Hive table in Beeline.
    
    CREATE TABLE weblogs
    (
        client_ip           STRING,
        full_request_date   STRING,
        day                 STRING,
        month               STRING,
        month_num           INT,
        year                STRING,
        referrer            STRING,
        user_agent          STRING
    ) STORED AS TEXTFILE;
    
  2. Sync the text table to SynxDB Cloud.

    -- Syncs the Hive table in psql.
    
    gpadmin=# select public.sync_hive_table('hive-cluster-1', 'mytestdb', 'weblogs', 'hdfs-cluster-1', 'myschema.weblogs', 'sync_server');
    sync_hive_table
    -----------------
    t
    (1 row)
    
  3. Query the external table.

    SELECT * FROM myschema.weblogs LIMIT 10;
    

Sync a Hive ORC table

  1. Create an ORC table in Hive.

    -- Creates the Hive table in Beeline.
    CREATE TABLE test_all_type
    (
        column_a tinyint,
        column_b smallint,
        column_c int,
        column_d bigint,
        column_e float,
        column_f double,
        column_g string,
        column_h timestamp,
        column_i date,
        column_j char(20),
        column_k varchar(20),
        column_l decimal(20, 10)
    ) STORED AS ORC;
    
  2. Sync the ORC table to SynxDB Cloud:

    -- Syncs the Hive table in psql.
    
    gpadmin=# select public.sync_hive_table('hive-cluster-1', 'mytestdb', 'test_all_type', 'hdfs-cluster-1', 'myschema.test_all_type', 'sync_server');
    sync_hive_table
    -----------------
    t
    (1 row)
    
  3. Query the external table.

    SELECT * FROM myschema.test_all_type LIMIT 10;
    

Sync a Hive ORC partitioned table

  1. Create an ORC partitioned table in Hive.

    -- Creates the Hive table in Beeline.
    
    CREATE TABLE test_partition_1_int
    (
        a tinyint,
        b smallint,
        c int,
        d bigint,
        e float,
        f double,
        g string,
        h timestamp,
        i date,
        j char(20),
        k varchar(20),
        l decimal(20, 10)
    )
    PARTITIONED BY
    (
        m int
    )
    STORED AS ORC;
    INSERT INTO test_partition_1_int VALUES (1, 1, 1, 1, 1, 1, '1', '2020-01-01 01:01:01', '2020-01-01', '1', '1', 10.01, 1);
    INSERT INTO test_partition_1_int VALUES (2, 2, 2, 2, 2, 2, '2', '2020-02-02 02:02:02', '2020-02-01', '2', '2', 11.01, 2);
    INSERT INTO test_partition_1_int VALUES (3, 3, 3, 3, 3, 3, '3', '2020-03-03 03:03:03', '2020-03-01', '3', '3', 12.01, 3);
    INSERT INTO test_partition_1_int VALUES (4, 4, 4, 4, 4, 4, '4', '2020-04-04 04:04:04', '2020-04-01', '4', '4', 13.01, 4);
    INSERT INTO test_partition_1_int VALUES (5, 5, 5, 5, 5, 5, '5', '2020-05-05 05:05:05', '2020-05-01', '5', '5', 14.01, 5);
    
  2. Sync the ORC partitioned table to SynxDB Cloud.

    -- psql syncs the Hive partitioned tables as one foreign table.
    
    gpadmin=# select public.sync_hive_table('hive-cluster-1', 'mytestdb', 'test_partition_1_int', 'hdfs-cluster-1', 'myschema.test_partition_1_int', 'sync_server');
    sync_hive_table
    -----------------
    t
    (1 row)
    
  3. Query the external table.

    SELECT * FROM myschema.test_partition_1_int LIMIT 10;
    

Supported usage and limitations

Supported Hive file formats

You can load files in TEXT, CSV, ORC, or PARQUET formats from Hive into SynxDB Cloud.

Data type mapping

The following table shows the one-to-one mapping between table data types on a Hive cluster and table data types in SynxDB Cloud.

Hive

SynxDB Cloud

binary

bytea

tinyint

smallint

smallint

smallint

int

int

bigint

bigint

float

float4

double

double precision

string

text

timestamp

timestamp

date

date

char

char

varchar

varchar

decimal

decimal

Usage limitations

  • Synchronizing Hive external tables is not supported.

  • Synchronizing Hive table statistics is not supported.

  • SynxDB Cloud can read data from HDFS and write data to HDFS, but the written data cannot be read by Hive.

  • When using sync_hive_partition_table, only the first-level partition key is supported. Specifying a value from a secondary or lower-level partition key will result in an error.

Note

Q: How is write and update on HDFS synchronized to |product_name|? Are there any limitations?

A: The data is still stored in HDFS, and the Foreign Data Wrapper only reads the data from HDFS.