Skip to main content

Elasticsearch connector

elasticsearch.png

The Elasticsearch Connector allows access to Elasticsearch data from Trino. This document describes how to setup the Elasticsearch Connector to run SQL queries against Elasticsearch.

note

Elasticsearch 6.0.0 or later is required.

Configuration

To configure the Elasticsearch connector, create a catalog properties file etc/catalog/elasticsearch.properties with the following contents, replacing the properties as appropriate:

connector.name=elasticsearch
elasticsearch.host=localhost
elasticsearch.port=9200
elasticsearch.default-schema-name=default

Configuration properties

Elasticsearch configuration properties
Property nameDescriptionDefault
elasticsearch.hostHostname of the Elasticsearch node to connect to. This property is required.
elasticsearch.portPort of the Elasticsearch node to connect to.9200
elasticsearch.default-schema-nameThe schema that contains all tables defined without a qualifying schema name.default
elasticsearch.scroll-sizeSets the maximum number of hits that can be returned with each Elasticsearch scroll request.1000
elasticsearch.scroll-timeoutAmount of time Elasticsearch keeps the search context alive for scroll requests.1m
elasticsearch.request-timeoutTimeout value for all Elasticsearch requests.10s
elasticsearch.connect-timeoutTimeout value for all Elasticsearch connection attempts.1s
elasticsearch.backoff-init-delayThe minimum duration between backpressure retry attempts for a single request to Elasticsearch. Setting it too low might overwhelm an already struggling ES cluster.500ms
elasticsearch.backoff-max-delayThe maximum duration between backpressure retry attempts for a single request to Elasticsearch.20s
elasticsearch.max-retry-timeThe maximum duration across all retry attempts for a single request to Elasticsearch.20s
elasticsearch.node-refresh-intervalHow often the list of available Elasticsearch nodes is refreshed.1m
elasticsearch.ignore-publish-addressDisables using the address published by Elasticsearch to connect for queries.

TLS security

The Elasticsearch connector provides additional security options to support Elasticsearch clusters that have been configured to use TLS.

If your cluster has globally-trusted certificates, you should only need to enable TLS. If you require custom configuration for certificates, the connector supports key stores and trust stores in PEM or Java Key Store (JKS) format.

The allowed configuration values are:

TLS Security Properties
Property nameDescription
elasticsearch.tls.enabledEnables TLS security.
elasticsearch.tls.keystore-pathThe path to the PEM or JKS key store. This file must be readable by the operating system user running Trino.
elasticsearch.tls.truststore-pathThe path to PEM or JKS trust store. This file must be readable by the operating system user running Trino.
elasticsearch.tls.keystore-passwordThe key password for the key store specified byelasticsearch.tls.keystore-path.
elasticsearch.tls.truststore-passwordThe key password for the trust store specified byelasticsearch.tls.truststore-path.

Data types

The data type mappings are as follows:

Primitive types

ElasticsearchTrino
binaryVARBINARY
booleanBOOLEAN
doubleDOUBLE
floatREAL
byteTINYINT
shortSMALLINT
integerINTEGER
longBIGINT
keywordVARCHAR
textVARCHAR
dateTIMESTAMP
ipIPADDRESS
(all others)(unsupported)

Array types

Fields in Elasticsearch can contain zero or more values , but there is no dedicated array type. To indicate a field contains an array, it can be annotated in a Trino-specific structure in the _meta section of the index mapping.

For example, you can have an Elasticsearch index that contains documents with the following structure:

{
"array_string_field": ["trino","the","lean","machine-ohs"],
"long_field": 314159265359,
"id_field": "564e6982-88ee-4498-aa98-df9e3f6b6109",
"timestamp_field": "1987-09-17T06:22:48.000Z",
"object_field": {
"array_int_field": [86,75,309],
"int_field": 2
}
}

The array fields of this structure can be defined by using the following command to add the field property definition to the _meta.trino property of the target index mapping.

curl --request PUT \
--url localhost:9200/doc/_mapping \
--header 'content-type: application/json' \
--data '
{
"_meta": {
"trino":{
"array_string_field":{
"isArray":true
},
"object_field":{
"array_int_field":{
"isArray":true
}
},
}
}
}'
note

It is not allowed to use asRawJson and isArray flags simultaneously for the same column.

Date types

Elasticsearch supports a wide array of date formats including built-in date formats and also custom date formats. The Elasticsearch connector supports only the default date type. All other date formats including built-in date formats and custom date formats are not supported. Dates with the format property are ignored.

Raw JSON transform

There are many occurrences where documents in Elasticsearch have more complex structures that are not represented in the mapping. For example, a single keyword field can have widely different content including a single keyword value, an array, or a multidimensional keyword array with any level of nesting.

curl --request PUT \
--url localhost:9200/doc/_mapping \
--header 'content-type: application/json' \
--data '
{
"properties": {
"array_string_field":{
"type": "keyword"
}
}
}'

Notice for the array_string_field that all the following documents are legal for Elasticsearch. See the Elasticsearch array documentation for more details.

[
{
"array_string_field": "trino"
},
{
"array_string_field": ["trino","is","the","besto"]
},
{
"array_string_field": ["trino",["is","the","besto"]]
},
{
"array_string_field": ["trino",["is",["the","besto"]]]
}
]

Further, Elasticsearch supports types, such as dense_vector, that are not supported in Trino. New types are constantly emerging which can cause parsing exceptions for users that use of these types in Elasticsearch. To manage all of these scenarios, you can transform fields to raw JSON by annotating it in a Trino-specific structure in the _meta section of the index mapping. This indicates to Trino that the field, and all nested fields beneath, need to be cast to a VARCHAR field that contains the raw JSON content. These fields can be defined by using the following command to add the field property definition to the _meta.presto property of the target index mapping.

curl --request PUT \
--url localhost:9200/doc/_mapping \
--header 'content-type: application/json' \
--data '
{
"_meta": {
"presto":{
"array_string_field":{
"asRawJson":true
}
}
}
}'

This preceding configurations causes Trino to return the array_string_field field as a VARCHAR containing raw JSON. You can parse these fields with the built-in JSON functions.

note

It is not allowed to use asRawJson and isArray flags simultaneously for the same column.

Special columns

The following hidden columns are available:

ColumnDescription
_idThe Elasticsearch document ID
_scoreThe document score returned by the Elasticsearch query
_sourceThe source of the original document

Full text queries

Trino SQL queries can be combined with Elasticsearch queries by providing the full text query as part of the table name, separated by a colon. For example:

SELECT * FROM "tweets: +trino SQL^2"

Predicate push down

The connector supports predicate push down of below data types:

ElasticsearchTrinoSupports
binaryVARBINARYNO
booleanBOOLEANYES
doubleDOUBLEYES
floatREALYES
byteTINYINTYES
shortSMALLINTYES
integerINTEGERYES
longBIGINTYES
keywordVARCHARYES
textVARCHARNO
dateTIMESTAMPYES
ipIPADDRESSNO
(all others)(unsupported)(unsupported)

Pass-through queries

The Elasticsearch connector allows you to embed any valid Elasticsearch query, that uses the Elasticsearch Query DSL in your SQL query.

The results can then be used in any SQL statement, wrapping the Elasticsearch query. The syntax extends the syntax of the enhanced Elasticsearch table names with the following:

SELECT * FROM es.default."<index>$query:<es-query>"

The Elasticsearch query string es-query is base32-encoded to avoid having to deal with escaping quotes and case sensitivity issues in table identifiers.

The result of these query tables is a table with a single row and a single column named result of type VARCHAR. It contains the JSON payload returned by Elasticsearch, and can be processed with the built-in JSON functions.

AWS authorization

To enable AWS authorization using IAM policies, the elasticsearch.security option needs to be set to AWS. Additionally, the following options need to be configured appropriately:

Property NameDescription

elasticsearch.aws.region

AWS region or the Elasticsearch endpoint. This option is required.

elasticsearch.aws.access-key

AWS access key to use to connect to the Elasticsearch domain. If not set, the Default AWS Credentials Provider chain will be used.

elasticsearch.aws.secret-key

AWS secret key to use to connect to the Elasticsearch domain. If not set, the Default AWS Credentials Provider chain will be used.

elasticsearch.aws.iam-role

Optional ARN of an IAM Role to assume to connect to the Elasticsearch domain. Note: the configured IAM user has to be able to assume this role.

elasticsearch.aws.external-idOptional external ID to pass while assuming an AWS IAM Role.

Password authentication

To enable password authentication, the elasticsearch.security option needs to be set to PASSWORD. Additionally the following options need to be configured appropriately:

Property NameDescription
elasticsearch.auth.userUser name to use to connect to Elasticsearch.
elasticsearch.auth.passwordPassword to use to connect to Elasticsearch.

SQL support

The connector provides globally available and read operation statements to access data and metadata in the Elasticsearch catalog.