Decrypting data with BigQuery Remote Functions
by Yash Mehta, Consultant
Data encryption is a growing concern among all cloud users and most enterprises have policies to ensure that data is always encrypted on the cloud. But, in some use cases, you need to be able to decrypt data and hence necessary actions needs to be taken to ensure PII data is handled properly. In this article, we will discuss how to decrypt data on BigQuery without having to materialise un-encrypted data.
- Introduction to remote functions
- Implementation workflow
- Other use cases
- Design considerations
- Summary
1. Introduction
With BigQuery’s remote functions, you can integrate Google SQL with external processing like Cloud Run or Cloud Functions. On the backend, BigQuery sends an HTTP/s request to the deployed endpoint and then waits for a response from the service. For a query user, remote function acts exactly like user-defined functions (UDF).
2. Implementation workflow
Define the input and output for Cloud Run/Function. Deploy Cloud Run/Function. Create external connection Create remote function
2.1. Input and Output format for the endpoint
The most important keys on the request to the endpoint are are userDefinedContext and calls. User defined context contains any override flags you want to the endpoint on how to process the current batch. Calls contains all the input data for the endpoint.
{
"requestId": "124ab1c",
"caller": "//bigquery.googleapis.com/projects/myproject/jobs/myproject:US.bquxjob_5b4c112c_17961fafeaf",
"sessionUser": "test-user@test-company.com",
"userDefinedContext": {
"key1": "value1",
"key2": "v2"
},
"calls": [
[null, 1, "", "abc"],
["abc", "9007199254740993", null, null]
]
}
Replies will be parsed by BigQuery and used within the query, errorMessage is useful to identify the cause for failure which will enable BigQuery to setup retries. In certain scenarios where there was an internal failure, BigQuery will retry based on the error code.
{
"replies": [
1,
0
],
"errorMessage": "Received but not expected that the argument 0 be null".
}
2.2. Deploy Cloud Run / Function
A sample python code to base64 decrypt BigQuery data. This is not the ideal use case as BigQuery has base64 functions built-in but this give you an idea on how to write a function to accept BigQuery requests. In practice, your decryption function will be using a sophisticated algorithm and would have much complex code. You will also have to manage the keys for you decryption function.
from flask import jsonify
import base64
def decrypt(message):
return base64.b64decode(message.encode("ascii")).decode("ascii")
def accept_request(request):
request_json = request.get_json()
calls = request_json['calls']
return_value = []
try:
for call in calls:
return_value.append(decrypt(call[0]))
return_json = jsonify( { "replies": return_value } )
return return_json
except Exception as e:
return jsonify( { "errorMessage": str(e) } ), 400
Deploying Cloud Run: https://cloud.google.com/run/docs/quickstarts/deploy-container Deploying Cloud Functions: https://cloud.google.com/functions/docs/deploy
Secretes management is a whole topic in itself but for the context of this article, you can mount secretes from Secret Manager to your Cloud Run or Cloud Function instance.
https://cloud.google.com/functions/docs/configuring/secrets
2.3. Create external connection
If you are not familiar with external connections on BigQuery, they are essentially gateway to use non-BigQuery resources like Cloud SQL, S3, Azure Blob storage, BigLake, and many more.
Creating an external connection can be achieved by either Clickops (Console), BQ CLI, and Terraform (recommended as it can be tied back to the entire deployment process)
resource "google_bigquery_connection" "connection" {
connection_id = "CONNECTION_ID"
project = "PROJECT_ID"
location = "REGION"
cloud_resource {}
}
Speaker Highlight
NOTE: Cloud Function or Cloud Run will be triggered by the Service account created for external connection. This SA will not be created by you, you can only grant it permissions like triggering the end point.
2.4. Create remote function
Currently, the only way to create a remote function is via SQL or BigQuery DataFrames. To ensure that the entire process can be automated, we deploy the function via Terraform.
resource "google_bigquery_job" "job" {
job_id = "job_query"
....
...
query {
query = <<EOT
CREATE FUNCTION PROJECT_ID.DATASET_ID.remote_add(x INT64, y INT64) RETURNS INT64
REMOTE WITH CONNECTION PROJECT_ID.LOCATION.CONNECTION_NAME
OPTIONS (
endpoint = 'ENDPOINT_URL'
)
EOT
}
3. Other use Cases
With remote functions, the use cases are limitless. One needs to consider aspects where we can integrate non-SQL with SQL workflows. Some examples of using remote functions to streamline processing are:
- Machine Learning
- Image processing
- Adding structure to image datasets
- …
- NLP
- Sentiment analysis
- Translation
- ….
- Encryption / Decryption on the fly
- API access
- Real-time data access
- Implement complex logic on traditional ETL tools
4. Design Considerations
As with any service, there are limitations and drawbacks to remote functions. Although with remote functions, you are not managing any infrastructure (serverless), there are many considerations to account for before architecting the solution.
4.1 Cloud Run vs Cloud Function
A simple rule to decide between these services is how long will it take for the process to complete. If the process is quick (less than 10 minutes) – use Cloud Function, else Cloud Run. Another factor to consider is, can the code run on standard libraries or doe it require complex setup which can be handled by a container. If it needs to be containerised, then use Cloud Run, else Cloud Function.
4.2 Cold Start
Cloud Run and Cloud Functions can scale rapidly depending on the requests, but its not always quick. As will all serverless offerings, this will run into cold start. There are few ways to ensure faster start-up times.
Write idempotent functions which yield the same results when re-triggered. This will ensure that in case of a time-out failure, the function can be re-triggered and return the expected result. Avoiding background activities on the execution node Add concise requirements and avoid importing unnecessary modules. Set minimum instances
4.3 Parallelism and Scaling
Cloud Run and Cloud Functions can scale very quickly depending on the requests. It is ideal to set these parameters when deploying. For both these services we can set max instances that will ensure that they don’t scale infinitely. Another option is to set batch size when creating remote function. This will ensure that the endpoint receives input of the size of max_batching_rows.
CREATE FUNCTION PROJECT_ID.DATASET_ID.FUNCTION_NAME(INPUT_NAME DATA_TYPE) RETURNS DATA_TYPE
REMOTE WITH CONNECTION PROJECT_ID.LOCATION.CONNECTION_NAME
OPTIONS (
endpoint = 'ENDPOINT_URL'
max_batching_rows = BATCH_SIZE
)
Speaker Highlight
NOTE: the maximum value is 50,000
4.4 Security and networking
Given BigQuery and Cloud Run / Cloud Functions are serverless offerings, it is impossible to identify IP addresses in advance. To ensure that the data between these services do not flow over public internet, we need to setup VPC Service Controls (VPC SC) and set internal traffic as the ingress / egress settings.
5. Summary
In this post, you learned how to setup BigQuery remote functions and also some of the best practices when architecting solutions using it. As a best practice, it is ideal to integrate all these steps into Terraform to ensure that organisational policies are adhered to, and apply IAM policies to restrict access to these functions. The applications are limitless, keep exploring and reach out if there are any questions!
For code sample and starter guide, checkout my Github repository for BigQuery Remote functions. https://github.com/yashmehta10/terraform_bigquery_remote_functions