Configure Hive Metadata Auto Sync

Warning

Hive Metadata Auto Sync is an experimental feature in the current version. Do not use it in production environments.

SynxDB Cloud can synchronize Hive metadata in real time through Kafka. The feature listens for Hive Metastore change events and updates the matching external table definitions in SynxDB Cloud without operator action. It complements the manual synchronization functions.

Hive Metadata Auto Sync runs as an independent component, managed separately from the database cluster. You configure it through the Hive Meta Sync tab on the Database Config page of the DBaaS Admin Console, but the feature also needs preparation on the Hive cluster and inside the target SynxDB Cloud database. This document covers the full setup end to end.

How it works

The synchronization pipeline has four components:

  • Hive Metastore with the SynxDB Cloud listener plugin installed. The plugin intercepts metadata change events such as CREATE TABLE, ALTER TABLE, and DROP TABLE, and publishes them to a Kafka topic.

  • Kafka broker that transports the metadata change events.

  • Meta Sync component running in the SynxDB Cloud cluster. It consumes events from Kafka and translates each event into a matching CREATE FOREIGN TABLE or DROP FOREIGN TABLE statement.

  • Target database in SynxDB Cloud (typically named hivedb), where the foreign tables are created against a pre-provisioned foreign server named __hive_auto_sync_server.

The Kafka topic name uses the format <hdw.catalog.name>_fdb-<catalog>_hms, where:

  • hdw.catalog.name is the value of cn.cbdb.apiary.kafka.hdw.catalog.name configured in hive-site.xml on the Hive side. The same value must appear as hdw.catalog.name in the Meta Sync configuration.

  • <catalog> is each entry in cn.cbdb.apiary.kafka.autosync.catalogs. The same value must appear as hive.catalog.name in the Meta Sync configuration so that the consumer subscribes to the correct topic.

Prerequisites

Before configuring Hive Metadata Auto Sync, confirm the following:

  • The listener plugin is installed on the Hive Metastore. Contact SynxDB Cloud technical support to obtain the kafka-metastore-listener-<version>-all.jar file, then install it as described in Step 1.

  • A Kafka cluster is reachable from both the Hive Metastore and the SynxDB Cloud cluster. Both PLAINTEXT and SASL_PLAINTEXT security protocols are supported. When SASL_PLAINTEXT is used, the supported SASL mechanism is SCRAM-SHA-256.

  • The HDFS connection and Hive Connector are already configured in the DBaaS Admin Console. Complete Configure an HDFS connection and Configure a Hive connection first.

  • A Hive Meta Sync profile is available on the Profile page. If none exists, create one before proceeding.

Step 1. Install the listener on the Hive Metastore

You perform this step on the Hive cluster, not in the SynxDB Cloud console.

  1. Place the listener jar in the Hive Metastore classpath (typically $HIVE_HOME/lib/). If clients connect through HiveServer2, install the jar there as well.

  2. Add the following properties to hive-site.xml on every Hive Metastore node:

    <property>
      <name>hive.metastore.event.listeners</name>
      <value>cn.cbdb.apiary.kafka.listener.HiveMetaStoreEventListener</value>
    </property>
    <property>
      <name>cn.cbdb.apiary.kafka.bootstrap.servers</name>
      <value>kafka-host:9092</value>
    </property>
    <property>
      <name>cn.cbdb.apiary.kafka.hdw.catalog.name</name>
      <value>hdw_catalog</value>
    </property>
    <property>
      <name>cn.cbdb.apiary.kafka.hive.cluster.name</name>
      <value>cluster-1</value>
    </property>
    <property>
      <name>cn.cbdb.apiary.kafka.autosync.catalogs</name>
      <value>hive</value>
    </property>
    <property>
      <name>cn.cbdb.apiary.kafka.autosync.databases.hive</name>
      <value>*</value>
    </property>
    <property>
      <name>cn.cbdb.apiary.kafka.sync.catalog.wise</name>
      <value>true</value>
    </property>
    

    For Kafka brokers that require SASL authentication, also add:

    <property>
      <name>cn.cbdb.apiary.kafka.security.protocol</name>
      <value>SASL_PLAINTEXT</value>
    </property>
    <property>
      <name>cn.cbdb.apiary.kafka.sasl.mechanism</name>
      <value>SCRAM-SHA-256</value>
    </property>
    <property>
      <name>cn.cbdb.apiary.kafka.sasl.jaas.config</name>
      <value>org.apache.kafka.common.security.scram.ScramLoginModule required username="kafka-user" password="kafka-password";</value>
    </property>
    

    Property reference:

    Property

    Description

    hive.metastore.event.listeners

    Fully qualified class name of the SynxDB Cloud listener. Must be cn.cbdb.apiary.kafka.listener.HiveMetaStoreEventListener.

    cn.cbdb.apiary.kafka.bootstrap.servers

    Kafka broker address or addresses, comma-separated.

    cn.cbdb.apiary.kafka.hdw.catalog.name

    Catalog identifier used as the topic prefix. Must match hdw.catalog.name in the Meta Sync configuration.

    cn.cbdb.apiary.kafka.hive.cluster.name

    Logical name of this Hive cluster.

    cn.cbdb.apiary.kafka.autosync.catalogs

    Comma-separated list of Hive catalogs to publish. Each value in this list must also appear as hive.catalog.name in the Meta Sync configuration that consumes from it.

    cn.cbdb.apiary.kafka.autosync.databases.<catalog>

    Comma-separated list of Hive databases to publish for the given catalog. Use * to publish all databases.

    cn.cbdb.apiary.kafka.sync.catalog.wise

    Whether a separate topic is used per catalog. Must match sync.catalog.wise in the Meta Sync configuration.

    cn.cbdb.apiary.kafka.security.protocol

    PLAINTEXT or SASL_PLAINTEXT.

    cn.cbdb.apiary.kafka.sasl.mechanism

    SCRAM-SHA-256 when SASL is used.

    cn.cbdb.apiary.kafka.sasl.jaas.config

    JAAS login string for the Kafka client used by the listener.

  3. Restart the Hive Metastore process to load the listener. To confirm the listener is publishing events, run CREATE TABLE in Hive and check that an event appears in the matching Kafka topic:

    kafka-console-consumer.sh \
        --bootstrap-server <broker-address> \
        --topic <hdw.catalog.name>_fdb-<catalog>_hms \
        --from-beginning --max-messages 1
    

Step 2. Prepare the target database

The Meta Sync component writes foreign tables into a designated database in SynxDB Cloud. You need to create this database and provision a foreign server inside it before any sync events arrive.

  1. Create the target database. Run the following statement from the DBaaS User Console worksheet or through psql. The database name must match the value you set as hdw.database in the Meta Sync configuration.

    CREATE DATABASE hivedb;
    
  2. Create the foreign server. Switch to the target database and create the foreign server that Meta Sync uses:

    \c hivedb
    
    SELECT public.create_foreign_server(
        '__hive_auto_sync_server',     -- exact name required by Meta Sync; do not change
        'gpadmin',                      -- existing role for the initial user mapping
        'datalake_fdw',                 -- foreign data wrapper
        'hdfs-cluster-1'                -- must match hdfs.gp.name in the Meta Sync configuration
    );
    

    Note

    The Meta Sync component requires the server name __hive_auto_sync_server. Use this exact name.

  3. Grant privileges to the Meta Sync database user. When the Meta Sync pod connects to SynxDB Cloud, it logs in as a database user that belongs to the account you will select in Step 4. Look up that user in the DBaaS Admin Console under Organizations > your organization > your account > Users; the username is shown in the Name column (for example, 123123).

    In the target database, grant privileges to that user and create the user mapping. The example uses <sync_user> as a placeholder; substitute the real username:

    GRANT USAGE ON FOREIGN SERVER __hive_auto_sync_server TO "<sync_user>";
    GRANT ALL   ON SCHEMA public                          TO "<sync_user>";
    CREATE USER MAPPING FOR "<sync_user>" SERVER __hive_auto_sync_server;
    

    Warning

    Double-quote the username in SQL to avoid identifier parsing errors.

Step 3. Access the Hive Meta Sync tab

  1. Log in to the DBaaS Admin Console.

  2. In the left navigation pane, click Database Config.

  3. Click the Hive Meta Sync tab at the top of the page. This page lists all current Hive Metadata Auto Sync configurations.

Step 4. Create a configuration (basic information)

  1. Click + Create in the upper-right corner of the list.

  2. In the Basic Information step, provide the following details:

    • Organization: Select the organization for this configuration.

    • Account: Select the account for this configuration.

    • Service Configuration Template: Select the appropriate template (for example, Hive Meta Sync Template).

  3. Click Next.

Step 5. Configure sync parameters

  1. Select a Profile for the Hive Meta Sync component. The profile determines the resource allocation (CPU, memory, and storage) for the sync service. This field is required.

  2. In the Hive Meta Sync Content input area, choose Manual Input and provide the YAML body. Every line must start at column 0; see the YAML formatting caveat below.

    Template for a PLAINTEXT Kafka broker:

    bootstrap.servers:
      - kafka-host:9092
    hdw.catalog.name: hdw_catalog
    security.protocol: PLAINTEXT
    prometheus.port: 15888
    sync.catalog.wise: true
    
    hive.clusters:
      - hive.gp.name: hive-cluster-1
        hive.cluster.name: hive
        hive.catalog.list:
          - hive.catalog.name: hive
            hdw.database: hivedb
            hive.partition.prov_id: 001
            hdw.auth.user:
              - <sync_user>
        hdfs.gp.name: hdfs-cluster-1
    

    Template for a SASL_PLAINTEXT Kafka broker:

    bootstrap.servers:
      - kafka-host:9092
    hdw.catalog.name: hdw_catalog
    security.protocol: SASL_PLAINTEXT
    sasl.mechanism: SCRAM-SHA-256
    sasl.jaas.config: 'org.apache.kafka.common.security.scram.ScramLoginModule required username="kafka-user" password="kafka-password";'
    prometheus.port: 15888
    sync.catalog.wise: true
    
    hive.clusters:
      - hive.gp.name: hive-cluster-1
        hive.cluster.name: hive
        hive.catalog.list:
          - hive.catalog.name: hive
            hdw.database: hivedb
            hive.partition.prov_id: 001
            hdw.auth.user:
              - <sync_user>
        hdfs.gp.name: hdfs-cluster-1
    

    Key parameter descriptions:

    • bootstrap.servers: Kafka broker address or addresses. Must be reachable from inside the SynxDB Cloud cluster network.

    • hdw.catalog.name: Must match cn.cbdb.apiary.kafka.hdw.catalog.name set on the Hive side. Otherwise the consumer subscribes to a topic that no producer writes to.

    • security.protocol, sasl.mechanism, sasl.jaas.config: Authentication settings. Must match the Kafka broker’s configuration.

    • sync.catalog.wise: Must match cn.cbdb.apiary.kafka.sync.catalog.wise on the Hive side.

    • hive.gp.name: Must match the connection name created on the Hive Connector tab (for example, hive-cluster-1).

    • hive.cluster.name: Logical cluster name for display purposes.

    • hive.catalog.name (inner): Must appear in cn.cbdb.apiary.kafka.autosync.catalogs on the Hive side. The topic name is derived as <hdw.catalog.name>_fdb-<hive.catalog.name>_hms.

    • hdw.database: Target database name in SynxDB Cloud. Must already exist (see Step 2).

    • hdw.auth.user: List of SynxDB Cloud users that automatically receive SELECT permission on synchronized schemas.

    • hdfs.gp.name: Must match both the connection name created on the HDFS tab and the hdfsClusterName argument passed to create_foreign_server in Step 2.

  3. Optionally, select an Environment Spec to specify the Kubernetes runtime environment for the sync component.

  4. Click Next.

YAML formatting caveat

Follow these two rules when filling in the Hive Meta Sync Content field:

  • No leading whitespace on any line. Top-level keys, including bootstrap.servers: and hive.clusters:, must start at column 0.

  • Single-quote the sasl.jaas.config value. It contains double quotes, semicolons, and equals signs that can be misparsed without single quotes.

Breaking either rule causes the Meta Sync pod to fail to start with a YAML parse error. If this happens, edit the configuration to satisfy both rules and resubmit.

Step 6. Preview and submit

  1. In the Configuration Preview step, review the following sections:

    • Basic Information: Confirms the account and service configuration template.

    • Hive Meta Sync Content: Shows the selected profile, environment spec, and a full preview of the sync configuration content.

  2. If everything is correct, click Submit to create the synchronization configuration. SynxDB Cloud provisions the Meta Sync pod automatically, and the pod begins consuming from the configured Kafka topic.

Verify the synchronization

  1. From a Hive client, create a test table in a database listed in cn.cbdb.apiary.kafka.autosync.databases.<catalog>:

    -- in beeline
    CREATE TABLE default.sync_test (id INT, name STRING) STORED AS PARQUET;
    
  2. After a few seconds, list foreign tables in the target SynxDB Cloud database. The new table appears in the public schema:

    -- in psql, connected to hivedb
    \det+
    

    Expected output:

     Schema |   Table   |         Server          | FDW options
    --------+-----------+-------------------------+-------------
     public | sync_test | __hive_auto_sync_server | ...
    
  3. Verify drop synchronization as well:

    -- in beeline
    DROP TABLE default.sync_test;
    

    After a few seconds, \det in hivedb no longer lists the table.

If the expected foreign table does not appear, check the Meta Sync pod log for errors.

Manage sync tasks

After the task is created, you can manage the synchronization task from the list view on the Hive Meta Sync tab. The list displays the following columns: ID, Account Name, Status, Created, Active/Deactivate, and Action.

Status

Each Hive Meta Sync task has one of the following statuses:

  • Pending: The task has been created but is not yet running.

  • Running: The task is actively synchronizing Hive metadata.

  • Suspended: The task has been deactivated and is not processing metadata changes.

Available operations

  • Active/Deactivate toggle: Activates or deactivates the sync task. When you deactivate the task, SynxDB Cloud stops synchronizing metadata changes and the status changes to Suspended.

  • Edit: Opens the edit form, where you can modify the profile, sync configuration content, and environment spec. Click Submit to save your changes.

  • Delete: Permanently removes the sync task.