Decrypting data with BigQuery Remote Functions

by Yash Mehta, Consultant

Data encryption is a growing concern among all cloud users and most enterprises have policies to ensure that data is always encrypted on the cloud. But, in some use cases, you need to be able to decrypt data and hence necessary actions needs to be taken to ensure PII data is handled properly. In this article, we will discuss how to decrypt data on BigQuery without having to materialise un-encrypted data.

  1. Introduction to remote functions
  2. Implementation workflow
  3. Other use cases
  4. Design considerations
  5. Summary

1. Introduction

With BigQuery’s remote functions, you can integrate Google SQL with external processing like Cloud Run or Cloud Functions. On the backend, BigQuery sends an HTTP/s request to the deployed endpoint and then waits for a response from the service. For a query user, remote function acts exactly like user-defined functions (UDF).

2. Implementation workflow

Define the input and output for Cloud Run/Function. Deploy Cloud Run/Function. Create external connection Create remote function

2.1. Input and Output format for the endpoint

The most important keys on the request to the endpoint are are userDefinedContext and calls. User defined context contains any override flags you want to the endpoint on how to process the current batch. Calls contains all the input data for the endpoint.

{
 "requestId": "124ab1c",
 "caller": "//bigquery.googleapis.com/projects/myproject/jobs/myproject:US.bquxjob_5b4c112c_17961fafeaf",
 "sessionUser": "test-user@test-company.com",
 "userDefinedContext": {
  "key1": "value1",
  "key2": "v2"
 },
 "calls": [
  [null, 1, "", "abc"],
  ["abc", "9007199254740993", null, null]
 ]
}

Replies will be parsed by BigQuery and used within the query, errorMessage is useful to identify the cause for failure which will enable BigQuery to setup retries. In certain scenarios where there was an internal failure, BigQuery will retry based on the error code.

{
  "replies": [
    1,
    0
  ],
  "errorMessage": "Received but not expected that the argument 0 be null".
}

2.2. Deploy Cloud Run / Function

A sample python code to base64 decrypt BigQuery data. This is not the ideal use case as BigQuery has base64 functions built-in but this give you an idea on how to write a function to accept BigQuery requests. In practice, your decryption function will be using a sophisticated algorithm and would have much complex code. You will also have to manage the keys for you decryption function.

from flask import jsonify
import base64

def decrypt(message):
	return base64.b64decode(message.encode("ascii")).decode("ascii")

def accept_request(request):
    request_json = request.get_json()
    calls = request_json['calls']
    return_value = []
    try:
        for call in calls:
            return_value.append(decrypt(call[0]))
            return_json = jsonify( { "replies":  return_value } )
        return return_json
    except Exception as e:
        return jsonify( { "errorMessage": str(e) } ), 400

Deploying Cloud Run: https://cloud.google.com/run/docs/quickstarts/deploy-container Deploying Cloud Functions: https://cloud.google.com/functions/docs/deploy

Secretes management is a whole topic in itself but for the context of this article, you can mount secretes from Secret Manager to your Cloud Run or Cloud Function instance.

https://cloud.google.com/functions/docs/configuring/secrets

2.3. Create external connection

If you are not familiar with external connections on BigQuery, they are essentially gateway to use non-BigQuery resources like Cloud SQL, S3, Azure Blob storage, BigLake, and many more.

Creating an external connection can be achieved by either Clickops (Console), BQ CLI, and Terraform (recommended as it can be tied back to the entire deployment process)

resource "google_bigquery_connection" "connection" {
    connection_id = "CONNECTION_ID"
    project = "PROJECT_ID"
    location = "REGION"
    cloud_resource {}
}   

Speaker Highlight

NOTE: Cloud Function or Cloud Run will be triggered by the Service account created for external connection. This SA will not be created by you, you can only grant it permissions like triggering the end point.

2.4. Create remote function

Currently, the only way to create a remote function is via SQL or BigQuery DataFrames. To ensure that the entire process can be automated, we deploy the function via Terraform.

resource "google_bigquery_job" "job" {
  job_id     = "job_query"
  ....
  ...
  query {
    query = <<EOT
      CREATE FUNCTION PROJECT_ID.DATASET_ID.remote_add(x INT64, y INT64) RETURNS INT64
      REMOTE WITH CONNECTION PROJECT_ID.LOCATION.CONNECTION_NAME
      OPTIONS (
        endpoint = 'ENDPOINT_URL'
      )
    EOT
}

3. Other use Cases

With remote functions, the use cases are limitless. One needs to consider aspects where we can integrate non-SQL with SQL workflows. Some examples of using remote functions to streamline processing are:

  • Machine Learning
    • Image processing
    • Adding structure to image datasets
  • NLP
    • Sentiment analysis
    • Translation
  • ….
  • Encryption / Decryption on the fly
  • API access
  • Real-time data access
  • Implement complex logic on traditional ETL tools

4. Design Considerations

As with any service, there are limitations and drawbacks to remote functions. Although with remote functions, you are not managing any infrastructure (serverless), there are many considerations to account for before architecting the solution.

4.1 Cloud Run vs Cloud Function

A simple rule to decide between these services is how long will it take for the process to complete. If the process is quick (less than 10 minutes) – use Cloud Function, else Cloud Run. Another factor to consider is, can the code run on standard libraries or doe it require complex setup which can be handled by a container. If it needs to be containerised, then use Cloud Run, else Cloud Function.

4.2 Cold Start

Cloud Run and Cloud Functions can scale rapidly depending on the requests, but its not always quick. As will all serverless offerings, this will run into cold start. There are few ways to ensure faster start-up times.

Write idempotent functions which yield the same results when re-triggered. This will ensure that in case of a time-out failure, the function can be re-triggered and return the expected result. Avoiding background activities on the execution node Add concise requirements and avoid importing unnecessary modules. Set minimum instances

4.3 Parallelism and Scaling

Cloud Run and Cloud Functions can scale very quickly depending on the requests. It is ideal to set these parameters when deploying. For both these services we can set max instances that will ensure that they don’t scale infinitely. Another option is to set batch size when creating remote function. This will ensure that the endpoint receives input of the size of max_batching_rows.

CREATE FUNCTION PROJECT_ID.DATASET_ID.FUNCTION_NAME(INPUT_NAME DATA_TYPE) RETURNS DATA_TYPE
REMOTE WITH CONNECTION PROJECT_ID.LOCATION.CONNECTION_NAME
OPTIONS (
  endpoint = 'ENDPOINT_URL'
  max_batching_rows = BATCH_SIZE
)

Speaker Highlight

NOTE: the maximum value is 50,000

4.4 Security and networking

Given BigQuery and Cloud Run / Cloud Functions are serverless offerings, it is impossible to identify IP addresses in advance. To ensure that the data between these services do not flow over public internet, we need to setup VPC Service Controls (VPC SC) and set internal traffic as the ingress / egress settings.

5. Summary

In this post, you learned how to setup BigQuery remote functions and also some of the best practices when architecting solutions using it. As a best practice, it is ideal to integrate all these steps into Terraform to ensure that organisational policies are adhered to, and apply IAM policies to restrict access to these functions. The applications are limitless, keep exploring and reach out if there are any questions!

For code sample and starter guide, checkout my Github repository for BigQuery Remote functions. https://github.com/yashmehta10/terraform_bigquery_remote_functions

More articles

Experiencing GDG Sydney x Google I/O Extended: A Deep Dive into Innovation

Last Saturday, on a sun-soaked morning, my colleagues and I attended the GDG Sydney x Google I/O Extended event. We met a diverse group of people from various fields and regions, enjoying the chance to network and reconnect with familiar organisers. For those who couldn’t make it to the event, we’ve penned this article to share our experiences and key takeaways. Whether you're a developer, IT professional, or simply curious about the latest tech trends, read on to discover what GDG Sydney x Google I/O Extended was all about and what exciting innovations were unveiled.

Read more

Retrieval-Augmented Generation: A High-Level Overview

In the rapidly evolving landscape of artificial intelligence, Retrieval-Augmented Generation (RAG) stands out as a groundbreaking approach to enhancing the capabilities of language models. By integrating information retrieval with advanced generation techniques, RAG systems can provide more accurate, contextually relevant, and informative responses. This article provides a high-level overview of RAG, delving into its architectural principles, the role of word embeddings and vector stores, and the process of preparing and managing data.

Read more

Tell us about your project

Our offices

  • Sydney
    Level 7, 117 York Street
    Sydney, NSW 2000
  • Melbourne
    Level 27, 101 Collins St
    Melbourne, VIC 3000
  • Brisbane
    Level 23, 300 Queen St
    Brisbane, QLD 4000
  • New York
    132 West 31st St, 9th Floor
    New York, NY 10001
  • Washington, DC
    1900 Reston Metro Plaza
    Reston, VA 20190
  • London
    United Kingdom
    1 St Katharines Way, E1W 1UN
worldmap