BoilingData API v1.0.1

Introduction

This document describes the asynchronous WebSockets based API of BoilingData. JSON messages are sent (published) as per this specification over authenticated WebSocket connection(s), and the client will receive (subscribe) messages that carry JSON query results.

The service sends query processing informatiion such as query run times, Lambda container state events that enable you to follow the lifecycle of data and queries if you like.

Building in a JavaScript environment?
Check out our JS/TS SDK - @boilingdata/node-boilingdata on GitHub.

Registering with BoilingData

First, create an account for BoilingData. Then your username and password can be used to login to the service (AWS Cognito). Your AWS credentials that are then used to sign the BoilingData application WebSockets URL and connect to the service. See NodeJS/JS SDK as an example.

Metadata Queries

As the API is asynchronous, an ID must be assigned to each query and this ID will be returned with all response messages. BoilingData may send multiple responses for each query. Each response will have batch information so that the API consumer can collect all response batches, and declare the query finished (please note that multiple identical batches maybe received). Communication is done using SQL. For instance:

-- Show user's configuration metadata
SELECT * FROM boilingdata;
-- (optional) Based on the configuration metadata, create IAM Role on you AWS
-- Account that BoilingData can assume
PRAGMA s3AccessRoleArn='arn:aws:iam::123456789012:role/bdS3';
-- List accessible S3 Buckets and their contents
-- (optional) BoilingData uses the IAM Role to read your S3 Bucket(s)
SELECT * FROM list('s3://');
SELECT * FROM list('s3://mybucket/');
-- List BoilingData specific PRAGMAs, like the s3AccessRoleArn
-- NOTE: The list does not contain real in-use values, only examples!
SELECT * FROM pragmas;
-- Get all shared data sets from/to you
SELECT * FROM boilingshares;

You can also get the schema of your response data by prepending DESCRIBE into your data query that produces data. When this happens, we return the schema rows instead of the data rows. Note that metadata queries currently do not support DESCRIBE.

DESCRIBE SELECT * FROM parquet_scan('s3://boilingdata-demo/demo.parquet');

Reading various data formats

Parquet have 1st level support on Boiling, and they are accessed with parquet_scan(). CSV files with read_csv_auto().

Accessing your Data with BoilingData

Use BDCLI for configuring S3 sandboxes
Check out our @boilingdata/boilingdata-bdcli on GitHub.

The data in your S3 Buckets can be queried directly in-place. Or you can query data sets shared to you even without any AWS access. For accessing your own S3 Bucket(s), an IAMRole and Policy needs to be created on your AWS Account that is assumable by the BoilingData AWS Account, and requiring the externalId set to your BoilingData account's externalId parameter. In other words, the IAM Role should be only assumable from the BoilingData AWS Account if the externalId parameter matches your BoilingData externalId parameter.

As the IAM role is created by you, you always control the access to your data (via the IAM role permissions). BoilingData will use the externalId parameter in the sts:assumeRole API call to assume your IAM role, populated with the Cognito username based hash value (i.e. unique to your username).

Here is an example trust policy that you need to provide for the IAM Role. Replace the placeholders with real values you get from the BoilingData API (see below).

Note that BDCLI can do all this for you, like create the IAM Role with the help of your Boiling account details.

{
  "Version": "2012-10-17",
  "Statement": {
    "Effect": "Allow",
    "Principal": {
      "AWS": "AWS_ACCOUNT_ID"
    },
    "Action": "sts:AssumeRole",
    "Condition": {
      "StringEquals": {
        "sts:ExternalId": "BOILINGDATA_EXTERNAL_ID"
      }
    }
  }
}

The BoilngData service AWS Account Id, and your own externalId parameter are available in the app via the API.

SELECT * FROM boilingdata;

The response looks like this (not real values):

{
  "awsAccountId": "589434896614",
  "externalId": "MjEzNDZiZjItNmMzMS00Y2FmLThlN2UtOTgzMjIwNWZmZGFhCg==",
}

Permissions Policy

To give your newly created IAM role permission to access your files on S3, an IAM Policy will need to be created.

Note that BDCLI can do all this for you, like create the IAM Role with the help of your Boiling account details and a YAML configuration file you created. It also supports multiple profiles so you can configure multiple users at the same time.

Development Environment / Just Testing

The most permissive policy will allow BoilingData to see all of your buckets, list the objects in a bucket, read them, and upload new files. This policy is perfect if you are just playing around with the BoilingData interface or are using a dev account.
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "BoilingData0",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:ListBucket",
                "s3:GetBucketLocation",
                "s3:GetBucketRequestPayment"
            ],
            "Resource": [
                "arn:aws:s3:::BUCKET-NAME/*",
                "arn:aws:s3:::BUCKET-NAME"
            ]
        },
        {
            "Sid": "BoilingData1",
            "Effect": "Allow",
            "Action": "s3:ListAllMyBuckets",
            "Resource": "*"
        }
    ]
}

(replace BUCKET-NAME with your bucket name)

Production Environment

A production environment will ideally have the most restrictive permissions. In this case, the S3 object paths will be known in advance, and traversing buckets or uploading is not needed. The minimum required permissions to allow querying of a known S3 object are s3:GetObject, s3:GetBucketLocation, and s3:GetBucketRequestPayment.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "BoilingData0",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:GetBucketLocation",
                "s3:GetBucketRequestPayment"
            ],
            "Resource": [
                "arn:aws:s3:::BUCKET-NAME",
                "arn:aws:s3:::BUCKET-NAME/*"
            ]
        }
    ]
}

BoilingData PRAGMAs

We have added some additional PRAGMA statements on top of standard SQL supports, for instance:

PRAGMA s3AccessRoleArn='arn:aws:iam::123456789012:role/bdS3';

This PRAGMA can be used to set the IAM assume role ARN for accessing S3 from the BoilingData service. The setting is persisted on the service side, so needs to be set once only.

All of the custom BoilingData PRAGMA statements can be queried with

SELECT * FROM pragmas;

Operations

  • RECEIVE DataQuery

    Send SQL queries.

    With this API call you send the SQL queries, including any PRAGMA statements.

    Operation IDpublish

    Accepts the following message:

    Data QueryDataQuery
    Message IDDataQuery

    User/client sends SQL queries with this message

    object

    Examples

  • SEND DataResponse

    Query response

    Query responses, at least one for each requestId.

    Operation IDsubscribe

    Accepts the following message:

    Data ResponseDataResponse
    Message IDDataResponse

    These messages carry the actual response data for the query. Responses may come in multiple batches and sub batches to match with the distributed streaming design.

    object

    Examples

  • SEND QueryInfo

    Query processing information events

    You get query processing information events from the service.

    Operation IDsubscribe

    Accepts the following message:

    Query InformationQueryInfo
    Message IDQueryInfo

    Information about query progress and processing

    object

    Examples

  • SEND LambdaEvent

    Information events about Lambda Assured Warm Concurrency

    With these events you can follow what happens with the Lambda containers behind the scenes. These events reflect the state of the warm Lambdas, how many there are and their lifecycles.

    Operation IDsubscribe

    Accepts the following message:

    Lambda Event InformationLambdaEvent
    Message IDLambdaEvent

    Information about Lambda containers, their lifecycles, and related hot datasets.

    object

    Examples

  • SEND LogMessage

    General logging messages

    Logging messages with varying levels.

    Operation IDsubscribe

    Accepts the following message:

    Log MessageLogMessage
    Message IDLogMessage

    General logging information. May also be unrelated to query.

    object

    Examples

Messages

  • #1Data QueryDataQuery
    Message IDDataQuery

    User/client sends SQL queries with this message

    object
  • #2Lambda Event InformationLambdaEvent
    Message IDLambdaEvent

    Information about Lambda containers, their lifecycles, and related hot datasets.

    object
  • #3Data ResponseDataResponse
    Message IDDataResponse

    These messages carry the actual response data for the query. Responses may come in multiple batches and sub batches to match with the distributed streaming design.

    object
  • #4Query InformationQueryInfo
    Message IDQueryInfo

    Information about query progress and processing

    object
  • #5Log MessageLogMessage
    Message IDLogMessage

    General logging information. May also be unrelated to query.

    object