Hive connector with Alluxio
The hive
can read and write tables stored in the Alluxio Data
Orchestration
System,
leveraging Alluxio's distributed block-level read/write caching
functionality. The tables must be created in the Hive metastore with the
alluxio://
location prefix (see Running Apache Hive with
Alluxio
for details and examples).
Trino queries will then transparently retrieve and cache files or objects from a variety of disparate storage systems including HDFS and S3.
Alluxio client-side configuration
To configure Alluxio client-side properties on Trino, append the Alluxio
configuration directory (${ALLUXIO_HOME}/conf
) to the Trino JVM
classpath, so that the Alluxio properties file alluxio-site.properties
can be loaded as a resource. Update the Trino jvm_config
file
etc/jvm.config
to include the following:
-Xbootclasspath/a:<path-to-alluxio-conf>
The advantage of this approach is that all the Alluxio properties are
set in the single alluxio-site.properties
file. For details, see
Customize Alluxio Presto
Properties.
Alternatively, add Alluxio configuration properties to the Hadoop
configuration files (core-site.xml
, hdfs-site.xml
) and configure the
Hive connector to use the Hadoop configuration
files via the hive.config.resources
connector
property.
Deploy Alluxio with Trino
To achieve the best performance running Trino on Alluxio, it is recommended to collocate Trino workers with Alluxio workers. This allows reads and writes to bypass the network (short-circuit). See Performance Tuning Tips for Presto with Alluxio for more details.
Alluxio catalog service
An alternative way for Trino to interact with Alluxio is via the Alluxio catalog service. The primary benefits for using the Alluxio catalog service are simpler deployment of Alluxio with Trino, and enabling schema-aware optimizations such as transparent caching and transformations. Currently, the catalog service supports read-only workloads.
The Alluxio catalog service is a metastore that can cache the information from different underlying metastores. It currently supports the Hive metastore as an underlying metastore. In order for the Alluxio catalog to manage the metadata of other existing metastores, the other metastores must be "attached" to the Alluxio catalog. To attach an existing Hive metastore to the Alluxio catalog, simply use the Alluxio CLI attachdb command. The appropriate Hive metastore location and Hive database name need to be provided.
./bin/alluxio table attachdb hive thrift://HOSTNAME:9083 hive_db_name
Once a metastore is attached, the Alluxio catalog can manage and serve
the information to Trino. To configure the Hive connector for Alluxio
catalog service, simply configure the connector to use the Alluxio
metastore type, and provide the location to the Alluxio cluster. For
example, your etc/catalog/alluxio.properties
should include the
following:
connector.name=hive
hive.metastore=alluxio
hive.metastore.alluxio.master.address=HOSTNAME:PORT
Replace HOSTNAME
with the Alluxio master hostname, and replace PORT
with the Alluxio master port. An example of an Alluxio master address is
master-node:19998
. Now, Trino queries can take advantage of the
Alluxio catalog service, such as transparent caching and transparent
transformations, without any modifications to existing Hive metastore
deployments.