This document describes the asynchronous WebSockets based API of BoilingData. JSON messages are sent (published) as per this specification over authenticated WebSocket connection(s), and the client will receive (subscribe) messages that carry JSON query results.
The service sends query processing informatiion such as query run times, Lambda container state events that enable you to follow the lifecycle of data and queries if you like.
Building in a JavaScript environment?First, create an account for BoilingData. Then your username and password can be used to login to the service (AWS Cognito). Your AWS credentials that are then used to sign the BoilingData application WebSockets URL and connect to the service. See NodeJS/JS SDK as an example.
As the API is asynchronous, an ID must be assigned to each query and this ID will be returned with all response messages. BoilingData may send multiple responses for each query. Each response will have batch information so that the API consumer can collect all response batches, and declare the query finished (please note that multiple identical batches maybe received). Communication is done using SQL. For instance:
-- Show user's configuration metadata
SELECT * FROM boilingdata;
-- (optional) Based on the configuration metadata, create IAM Role on you AWS
-- Account that BoilingData can assume
PRAGMA s3AccessRoleArn='arn:aws:iam::123456789012:role/bdS3';
-- List accessible S3 Buckets and their contents
-- (optional) BoilingData uses the IAM Role to read your S3 Bucket(s)
SELECT * FROM list('s3://');
SELECT * FROM list('s3://mybucket/');
-- List BoilingData specific PRAGMAs, like the s3AccessRoleArn
-- NOTE: The list does not contain real in-use values, only examples!
SELECT * FROM pragmas;
-- Get all shared data sets from/to you
SELECT * FROM boilingshares;
You can also get the schema of your response data by prepending DESCRIBE
into your data query that produces data. When this happens,
we return the schema rows instead of the data rows. Note that metadata queries currently do not support DESCRIBE
.
DESCRIBE SELECT * FROM parquet_scan('s3://boilingdata-demo/demo.parquet');
Parquet have 1st level support on Boiling, and they are accessed with parquet_scan()
. CSV files with read_csv_auto()
.
The data in your S3 Buckets can be queried directly in-place. Or you can query data sets shared to you even without any AWS access. For accessing your own S3 Bucket(s), an IAMRole and Policy needs to be created on your AWS Account that is assumable by the BoilingData AWS Account, and requiring the externalId set to your BoilingData account's externalId parameter. In other words, the IAM Role should be only assumable from the BoilingData AWS Account if the externalId parameter matches your BoilingData externalId parameter.
As the IAM role is created by you, you always control the access to your data (via the IAM role permissions). BoilingData will use the externalId parameter in the sts:assumeRole API call to assume your IAM role, populated with the Cognito username based hash value (i.e. unique to your username).
Here is an example trust policy that you need to provide for the IAM Role. Replace the placeholders with real values you get from the BoilingData API (see below).
Note that BDCLI can do all this for you, like create the IAM Role with the help of your Boiling account details.
{
"Version": "2012-10-17",
"Statement": {
"Effect": "Allow",
"Principal": {
"AWS": "AWS_ACCOUNT_ID"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "BOILINGDATA_EXTERNAL_ID"
}
}
}
}
The BoilngData service AWS Account Id, and your own externalId parameter are available in the app via the API.
SELECT * FROM boilingdata;
The response looks like this (not real values):
{
"awsAccountId": "589434896614",
"externalId": "MjEzNDZiZjItNmMzMS00Y2FmLThlN2UtOTgzMjIwNWZmZGFhCg==",
}
To give your newly created IAM role permission to access your files on S3, an IAM Policy will need to be created.
Note that BDCLI can do all this for you, like create the IAM Role with the help of your Boiling account details and a YAML configuration file you created. It also supports multiple profiles so you can configure multiple users at the same time.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "BoilingData0",
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:ListBucket",
"s3:GetBucketLocation",
"s3:GetBucketRequestPayment"
],
"Resource": [
"arn:aws:s3:::BUCKET-NAME/*",
"arn:aws:s3:::BUCKET-NAME"
]
},
{
"Sid": "BoilingData1",
"Effect": "Allow",
"Action": "s3:ListAllMyBuckets",
"Resource": "*"
}
]
}
(replace BUCKET-NAME
with your bucket name)
A production environment will ideally have the most restrictive permissions. In this case, the S3 object paths will be known in advance, and traversing buckets or uploading is not needed. The minimum required permissions to allow querying of a known S3 object are s3:GetObject, s3:GetBucketLocation, and s3:GetBucketRequestPayment.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "BoilingData0",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:GetBucketLocation",
"s3:GetBucketRequestPayment"
],
"Resource": [
"arn:aws:s3:::BUCKET-NAME",
"arn:aws:s3:::BUCKET-NAME/*"
]
}
]
}
We have added some additional PRAGMA statements on top of standard SQL supports, for instance:
PRAGMA s3AccessRoleArn='arn:aws:iam::123456789012:role/bdS3';
This PRAGMA can be used to set the IAM assume role ARN for accessing S3 from the BoilingData service. The setting is persisted on the service side, so needs to be set once only.
All of the custom BoilingData PRAGMA statements can be queried with
SELECT * FROM pragmas;
Send SQL queries.
With this API call you send the SQL queries, including any PRAGMA statements.
Accepts the following message:
User/client sends SQL queries with this message
BoilingData PRAGMA to set IAM role for accessing your data on your S3 Data Lake.
{
"messageType": "SQL_QUERY",
"sql": "PRAGMA s3AccessRoleArn='arn:aws:iam::123456789012:role/bdS3';",
"requestId": "reqId1"
}
Special function to list S3 bucket(s)
{
"messageType": "SQL_QUERY",
"sql": "SELECT * FROM list('s3://');",
"requestId": "reqId2"
}
Special function (non-recursive) to list S3 bucket contents when bucket is defined.
{
"messageType": "SQL_QUERY",
"sql": "SELECT * FROM list('s3://myBucket/');",
"requestId": "reqId3"
}
parquet_scan function to access Parquet files on S3.
{
"messageType": "SQL_QUERY",
"sql": "SELECT COUNT(*) FROM parquet_scan('s3://boilingdata-demo/demo2.parquet');",
"requestId": "reqId4",
"readCache": "NONE",
"tags": [
{
"name": "CostCenter",
"value": "930"
},
{
"name": "ProjectId",
"value": "Top secret Area 53"
}
]
}
Include additional key "keys", which is a list of S3 URLs including the s3:// prefix and full key path
{
"messageType": "SQL_QUERY",
"sql": "SELECT * FROM duckdb('s3://KEY','referers') LIMIT 1;",
"keys": [
"s3://boilingdata-demo/referers.1.duckdb.zst",
"s3://boilingdata-demo/referers.49.duckdb.zst"
],
"requestId": "reqId5"
}
Include additional key "keys", which is a list of Glue Tables with "glue." prefix.
{
"messageType": "SQL_QUERY",
"sql": "SELECT * FROM parquet_scan('s3://KEY','referers') LIMIT 1;",
"keys": [
"glue.default.nyctaxis1",
"glue.default.nyctaxis2"
],
"requestId": "reqId6"
}
Query response
Query responses, at least one for each requestId.
Accepts the following message:
These messages carry the actual response data for the query. Responses may come in multiple batches and sub batches to match with the distributed streaming design.
{
"messageType": "DATA",
"requestId": "test1",
"batchSerial": 1,
"totalBatches": 1024,
"subBatchSerial": 3,
"totalSubBatches": 16,
"data": [
{
"VendorID": 1,
"tpep_pickup_datetime": 1556669690000,
"tpep_dropoff_datetime": 1556669808000,
"passenger_count": 1,
"trip_distance": 0,
"RatecodeID": 1,
"store_and_fwd_flag": "N",
"PULocationID": 145,
"DOLocationID": 145,
"payment_type": 2,
"fare_amount": 3,
"extra": 0.5,
"mta_tax": 0.5,
"tip_amount": 0,
"tolls_amount": 0,
"improvement_surcharge": 0.3,
"total_amount": 4.3,
"congestion_surcharge": 0
},
{
"VendorID": 1,
"tpep_pickup_datetime": 1556670954000,
"tpep_dropoff_datetime": 1556671047000,
"passenger_count": 1,
"trip_distance": 1.5,
"RatecodeID": 1,
"store_and_fwd_flag": "N",
"PULocationID": 145,
"DOLocationID": 145,
"payment_type": 2,
"fare_amount": 3,
"extra": 0.5,
"mta_tax": 0.5,
"tip_amount": 0,
"tolls_amount": 0,
"improvement_surcharge": 0.3,
"total_amount": 4.3,
"congestion_surcharge": 0
}
]
}
Query processing information events
You get query processing information events from the service.
Accepts the following message:
Information about query progress and processing
{
"messageType": "INFO",
"requestId": "test1",
"info": [
{
"name": "PROCESSING_SQL",
"value": "SELECT 1;"
}
]
}
{
"messageType": "INFO",
"requestId": "test1",
"info": [
{
"name": "DATA_LOAD_MS",
"value": "71"
}
]
}
{
"messageType": "INFO",
"requestId": "test1",
"info": [
{
"name": "QUERY_TIME_MS",
"value": "31"
}
]
}
{
"messageType": "INFO",
"requestId": "test1",
"info": [
{
"name": "LAMBDA_TOTAL_EXECUTION_TIME_ESTIMATE_MS",
"value": "97"
}
]
}
Information events about Lambda Assured Warm Concurrency
With these events you can follow what happens with the Lambda containers behind the scenes. These events reflect the state of the warm Lambdas, how many there are and their lifecycles.
Accepts the following message:
Information about Lambda containers, their lifecycles, and related hot datasets.
{
"messageType": "LAMBDA_EVENT",
"lambdaEvent": {
"instanceId": "1638791651869__808",
"username": "aac5c1d9-a0a9-4855-b896-0f3998b2f16b",
"dataset": "aac5c1d9-a0a9-4855-b896-0f3998b2f16b__s3://myBucket/demo.parquet",
"status": "warm"
}
}
General logging messages
Logging messages with varying levels.
Accepts the following message:
General logging information. May also be unrelated to query.
{
"messageType": "LOG_MESSAGE",
"logLevel": "ERROR",
"logMessage": "string",
"requestId": "string"
}
User/client sends SQL queries with this message
Information about Lambda containers, their lifecycles, and related hot datasets.
These messages carry the actual response data for the query. Responses may come in multiple batches and sub batches to match with the distributed streaming design.
Information about query progress and processing
General logging information. May also be unrelated to query.