DMTN-317
Technical Design for the Prompt Products Database (PPDB) in BigQuery#
Abstract
This technote presents technical requirements and design notes for the Prompt Products Database (PPDB) in BigQuery.
Overview#
The Prompt Products Database (PPDB) will store Prompt Data Products, as described in the Data Products Definition Document [Jurić et al., 2023], with the goal of making the data easily accessible for scientific analysis and querying. The data will be ingested from the Alert Production Database (APDB) with the current plan being to host the data in Google BigQuery. The justification for using BigQuery compared with other options is discussed at length in DMTN-308 [McCormick, 2025].
Some work has already been done to prototype the ingestion of data from the APDB to BigQuery, as well as to set up a TAP service for querying the data. As much as possible, this technote will attempt to make it clear which parts of the design are still under consideration and which parts have already been implemented. Even those aspects that have been implemented are still subject to change as the design is refined.
Requirements#
Functional Requirements#
The following requirements may be assumed based on Rubin design documents:
Comprehensive Level 1 Data Access: Provide rapid access to the complete history of Level 1 catalogs, including DIASource, DIAObject, SSObject, and DIAForcedSource, along with associated metadata and provenance [LSE-163, LSE-61].
Low-Latency User Query Performance: Support at least 20 simultaneous users for live queries against the Prompt Products Database, with each query completing within 10 seconds [LSE-61]. Additionally, low-volume, ad-hoc user queries must be answered within a few seconds, supporting approximately 100 such queries simultaneously [LDM-135].
Public Data Availability: Ensure that the public-facing view of updated Level 1 database content (DIASources, DIAObjects, SSObjects) is available within 24 hours of the corresponding observation or orbit determination [LSE-61].
Reproducible & Complex Querying: Enable queries on all Level 1 data products to be reproducible over time [LDM-555, LSE-61] and support complex queries, including spatial correlations and time series comparisons, using ADQL (a superset of SQL92) [LDM-135, LSE-61, LDM-555].
High System Availability and Data Integrity: The database system must maintain at least 98% uptime and must not lose data due to hardware/software failures, incorporating fault-tolerance features and avoiding extended offline periods for maintenance [LSE-61]. (Unclear whether this is referring to the APDB or PPDB.)
Adaptable Database Schema: The Level 1 database schema must be modifiable during the lifetime of the survey without altering query results, and data access services must accommodate the evolution of the LSST data model across Data Releases [LSE-61].
Comprehensive Logging and Reporting: Every query of LSST databases must be logged, including user, query text, start/end times, and results size, with history available to the user. Nightly Data Quality, DMS Performance, and Calibration Reports must be generated within 4 hours [LSE-61].
These should be considered preliminary requirements and are subject to change. The APDB and PPDB are used interchangeably in some design documents, in particular within the DPDD, since these were not two distinct systems when the documents were written. Further clarification is needed to determine which requirements apply to which database, though some are shared between both systems. Finally, some of the requirements are generic ones that apply to any Rubin database system, though they may still be relevant to the PPDB.
Non-Functional Requirements#
Data Processing#
Creating a robust and error-tolerant data processing pipeline for ingesting data is critical to the success of the PPDB.
The following non-functional requirements may be assumed in this area:
Robust error handling should be implemented for each step of the data processing pipeline. It should be straightforward to identify and retry failed jobs without causing duplicate data.
Processes should be designed to keep up with projected data rates from the APDB (TODO: include projections here???). In particular, bottlenecks should be avoided for each process, which may involve careful monitoring and optimization work.
The data ingestion pipeline should eventually be designed to handle different versions of the schema gracefully and seemlessly. This may involve using different BigQuery datasets for different versions of the schema or developing tools for converting from one schema version to another. Migration tools may also be needed to convert existing data to a new schema.
Each step of the data processing process should be idempotent, meaning that if a step is retried, it will not cause duplicate data or errors. Implementing this is highly involved, as it requires careful tracking of state and ensuring that each operation can be safely retried. (In particular, explicit checks must be performed so that duplicate data is not inserted.)
Each component in the system should be visible for monitoring and debugging purposes, allowing for easy identification of issues and performance bottlenecks. This should be achievable using logging, metrics, and monitoring tools such as Google Cloud Logging or custom dashboards.
The processes which have been implemented so far in the prototype system generally do _not_ satisfy all of of these requirements; in particular, achieving an adequate level of imdepotency and robustness to failure will require a significant amount of additional development and testing to achieve.
Data Processing#
Overview#
Data processing involves the ingestion of data from the APDB into the PPDB. (Other terms could be used but this document will generally use “data processing” to refer to the entire process.) The APDB exports data in “replica chunks,” (hereafter referred to as just “chunks”) where each chunk contains a set of records from multiple tables that were created or modified within a certain time window, e.g., ~10 minutes. These chunks need to be ingested into the PPDB in a way that maintains data integrity, which heavily influences the overall design of the system in terms of data ingestion and processing. For instance, later replica chunks may contain records that reference records in earlier chunks, so it is important that chunks are inserted into the production system in order. In order to satisfy this requirement, the data ingestion to BigQuery should occur in two stages: staging and promotion. The staging process involves copying data into staging tables, one per production table, where each row is associated with a chunk ID; data in these tables is not visible to users. After the data has been staged, it should be copied from these staging tables to production tables in an atomic operation that ensures data integrity and consistency. For efficiency, the promotion process should operate on multiple chunks at a time, inserting them in a batch operation, while ensuring that chunks are promoted in order.
Because the APDB is not externally accessible, the data processing is a hybrid system with both on-premises and cloud components. The on-premises components run on the Rubin Science Platform (RSP) at the USDF and are responsible for exporting data from the APDB and uploading it to Google Cloud Storage (GCS). The cloud components run on Google Cloud Platform (GCP) and are responsible for all subsequent steps after data upload, including staging the data into temporary tables in BigQuery and promoting the data to production tables. Any components which must be accessed from both the on-premises and cloud environments will generally be stood up as cloud services, as accessing RSP resources from GCP is more involved and harder to manage administratively.
The data processing is currently implemented in Python, with both the on-premises and cloud components using version 3.11. However, Apache Dataflow is constrained to Python 3.9 at this time, and it is unclear when support for Python 3.11 will be added. In practice, this should not be an issue, because the current Dataflow job used for data staging runs as a separate process and need not share any code with the other processes.
Chunk Tracking Database#
A Postgres database (hereafter referred to as the “chunk tracking database”) will be used to track the status of each chunk as it moves through the data processing pipeline and is used to coordinate the various processes involved in exporting, uploading, staging, and promoting chunks. Though this could be labeled as part of the PPDB (and is a component of it), we use separate terminology in this document for clarity so that it is not confused with the actual data stored in BigQuery.
The chunk tracking database has a single PpdbReplicaChunk
table with one
row per chunk and columns for the chunk ID, timestamp, status, local export
directory, unique ID, and status.
Because it must be accessible from both the on-premises and cloud environments,
it should be hosted on a managed Postgres instance on GCP.
This could be implemented with a Cloud SQL instance or (as in the current
prototype implementation) a Cloud Native Postgres (CPNG) instance on
Kubernetes.
The chunk tracking database is accessible from the RSP via an external IP address, which is secured using a firewall to only allow connections from specific IP addresses; a VPC connector is also used for additional security. Within GCP, the Cloud Run functions and Dataflow jobs can access the database using its internal IP address, which should eventually be converted to use a private DNS name. (The external address is not itself accessible from within GCP.)
Export and Upload#
The data ingestion process begins with exporting data from the APDB into local Parquet files, which are organized into directories by chunk ID. These chunks are then uploaded to a Google Cloud Storage (GCS) bucket to a specific prefix, where they can be accessed for staging the data into temporary tables.
The process for exporting and uploading the data from the APDB is described below:
The APDB is continually queried by a long-running process to find chunks that are ready to be exported.
For each chunk that is ready, the data is exported from the APDB to Parquet files, one per table, with a configurable compression codec.
The information for the exported replica chunk is inserted into the chunk tracking database as a new record.
A separate process reads from the chunk tracking database to find chunks that have been exported to disk but not yet uploaded.
For each chunk that is ready, the Parquet files are uploaded to a Google Cloud Storage (GCS) bucket under a unique prefix (“folder”), along with a sidecar metadata file containing useful information about the chunk.
After the upload for a particular chunk is complete, the uploader publishes a Pub/Sub message with information necessary for importing the chunk’s data, such as the GCS bucket and path, chunk ID, and timestamp.
As currently implemented, the export process runs serially and writes each chunk to Parquet one after another; the uploader also processes chunks one at a time, though the copying of the Parquet files into GCS is parallelized. Though it is unknown at this time whether these processes will be able to keep up with production data rates from the APDB using only a single instance of each, ideally they could be made to do so. Single process optimization should be aggressively pursued before each of these are split into multiple parallel processes, which would significantly complicate the system, likely requiring an additional orchestration component.
Staging#
Once data has been copied to GCS, a cloud-native process will stage the data into temporary tables in BigQuery.
This process involves a few steps:
On Google Cloud Platform, a Cloud Run function which is subscribed to the appropriate Pub/Sub topic is triggered when a new chunk upload is complete. (The uploader from the previous section publishes a Pub/Sub message after each successful upload.)
The
stage_chunk
function unpacks the Pub/Sub message, extracts the necessary information (either from the message of the chunk’s manifest file), and launches an Apache Beam job on Google Dataflow to process the chunk.Based on the job parameters, the Beam job reads the Parquet files from GCS and copies their data into the appropriate staging tables in BigQuery.
The Dataflow jobs execute asynchronously and in parallel, as multiple chunks can be staged at the same time without issue. Some experience with this system has shown that the total time to launch and complete a Dataflow job is on the order of 8-15 minutes and varies based on the size of the chunk. While overall limits on BigQuery insert limits must be considered, it is unlikely that the staging process will be a bottleneck in the overall system, given that many chunks can be staged in parallel and the insert Operations into BigQuery are highly optimized.
The staging tables themselves are long-lived rather than temporary, as they
are continually receiving data from new chunks generated by different
Dataflow jobs.
These tables have the same schema as the production ones with the addition
of a apdb_replica_chunk
column which keeps track of which chunk each
record came from.
Promotion#
After data has been uploaded to the staging tables, a separate process will promote the data to production tables. This process is described in more detail below.
A Cloud Run promotion function runs on a periodic schedule, e.g., every 15 minutes. The interval is configurable and can be set in Cloud Scheduler.
The next set of promotable chunks is identified from the chunk tracking database. This is the next sequence of chunks that have been staged where all prior chunks have already been promoted. If there are no promotable chunks, the process logs this and exits.
For each table, a temporary table is created from a snapshot of the current production table. This is a logical “zero copy” operation in BigQuery, which is very fast and efficient and will not incur additional storage costs.
Data is copied from the staging tables to their corresponding temporary table.
It is likely that during this step, some computed or derived columns will be calculated and populated. This may include temporal columns for partitioning and spatial columns for clustering.
An atomic table swap is performed to promote the temporary table to production, ensuring that the promotion is atomic and does not cause any downtime or data inconsistency.
A
bq cp
command is used to copy the production table to a new temporary table, e.g.,DiaObject_promotion_tmp
.Using the BigQuery Python API, data is copied from the staging tables to their corresponding temporary table.
A BigQuery copy job is used with
write_disposition=WRITE_TRUNCATE
to atomically replace the production table with the temporary table. (Though this is not strictly atomic, it is very fast and should not cause any significant downtime beyond a few seconds.)Optionally, the previous production table can be archived before swapping, e.g., by copying it to a backup table with a timestamp suffix.
The records for the chunks that were promoted are deleted from the staging tables.
The time cadence of the promotion process has not been finalized. Near real-time promotion is infeasible given the data rates and processing times, as well as the requirement that the chunks be inserted in order. A daily promotion job, starting in the morning or early afternoon (Chilean time) after nightly data taking, would satisfy the formal requirement for L1 data to be made available within 24 hours of acquisition. However, some discussion has suggested that a more frequent promotion time would be desirable so that data was available to scientists sooner. It is likely that the time between promotions jobs could be run as frequently as once per hour, but much less than this, e.g., every 15 minutes, is probably too often given the operations which need to be performed during this process. Experimentation can be done to determine the optimal cadence for this process while taking into account the data rates, processing times, and desire to make data available as soon as possible.
Updating Existing Data#
The system must also take into account that updates to existing records may
occur in the APDB, which will need to be propagated to the PPDB.
One known example of this is updating the validity end time of DiaObject
records, which is not being done currently in the prototype system.
DiaObject records may have multiple versions, whose uniqueness is distinguished
by the concatenation of diaObjectId
and the validity start time.
When a new version of a DiaObject is created and written to the chunk data, the
previous version’s validity end time should be updated to be the new version’s
validity start time.
This should occur not as a separate operation but as part of the data ingestion
pipeline, likely when the data is being promoted from staging to production.
There are several other update operations that may need to be performed as well, including but not limited to:
Re-assigning a fraction of DiaSource records to SSObjects.
Deletion of DiaObject records, e.g., those without matching DiaSource records after re-assignment.
Withdrawl of DiaSource or DiaForcedSource records by setting the
time_withdrawn
column to a valid date.
An additional complexity is that updates may need to be ordered in time, even within the same chunk. Results of multiple operations may depend on the order in which they are applied. The DM-50190 Jira ticket provides more information on all of these operations and scenarios as well as a place to track progress on implementing them.
User Access#
TAP Service#
The PPDB will be accessed by users via a TAP service (CITATION: TAP), which will allow users to query the data using ADQL, similar to how other Rubin Science Data Model (SDM) databases are accessed. A BigQuery TAP service was developed by Burwood Group in collaboration with Google and is described in the TAP BigQuery document. This implementation has been used for testing and is a fork of the open-source CADC TAP repository, a Java implementation of the server-side TAP protocol. Connectivity is provided using the Simba driver for JDBC which can accept configuration parameters via a connection string. In the production system, a service account key file (JSON format) will be used for authenticating to an IAM service account with appropriate permissions to query the datasets. The contents of this key file can be stored securely in a secret management system and retrieved at runtime by the TAP service.
A test version of this TAP service is currently deployed standalone (not inside an RSP environment) on GCP and has been accessed successfully through the RSP’s portal and notebook aspects. The authentication in this test environment currently uses network-level restrictions, which would not be used in the fully integrated production system. Initial testing with this deployment on a small dataset, with about 15 GB of data in the DiaObject table, has delivered low latency, with results of simple queries returned in ~3 seconds. Actual performance will depend heavily on the size of the dataset and the complexity of the query.
This TAP service implementation is still a prototype and an unintegrated fork of the primary CADC TAP repository. Much work remains to be done to improve the features of this TAP service, including but not limited to:
Allowing runtime configuration of the BigQuery dataset mapping where
ppdb
, or some other generic name utilized by users in their queries, is mapped to the fully qualified dataset name. (This is currently hardcoded in the prototype’s implementation but it should be configurable in some way.)Uploading of results to cloud storage, as the current results store is set to a local directory. (This is not a standard feature of the CADC TAP implementation but is needed for integration with the RSP.)
Support for optimized spatial queries using the BigQuery GIS functions with
GEOGRAPHY
columns (covered in the next section).Integration with other Rubin extensions and modifications to the TAP service used on the RSP, some of which are described in SQR-099 [Voutsinas, 2025].
Overall, the current TAP service implementation should be viewed as a proof of concept rather than a production-ready system, and it may undergo significant revision before it is suitable for an operational environment.
TAP_SCHEMA#
The TAP service will need to implement the standard TAP_SCHEMA
tables
(CITATION: TAP) to describe the tables and columns available for querying, as
well as to parse the query for correctness (such as validating table and column
names).
Instead of the TAP_SCHEMA database being stored in BigQuery, the prototype TAP
service uses a Postgres implementation, which has been tested and confirmed to
work, in particular in the generic TAP query interface of the RSP portal.
The database is implemented in Postgres primarily for performance reasons, as
the TAP service needs to query these tables frequently with low latency, and
Postgres is better suited for this type of workload.
Eventually, tools should be developed for populating the TAP_SCHEMA tables
from the BigQuery table schema for a given version.
There are a few “quirks” having to do with datatypes which may necessitate
usage of custom tooling rather than the standard Felis (CITATION: felis)
command-line interface.
For instance, the VOTable (CITATION: VOTable) float
type does not currently
map correctly to BigQuery’s FLOAT64
and so double
should be substituted
instead, which does (Q: Is this because of the Simba driver?).
Similarly, all of BigQuery’s integer types are 64-bit, so the VOTable long
type should always be used instead of int
or short
, which will not be
interpreted correctly.
It is possible that further work on the TAP service and the mapping between
VOTable and BigQuery types could resolve some of these issues around the
datatype mappings so that the standard tooling could be used for generating the
TAP_SCHEMA data, instead.
Spatial Query Support#
Spatial queries are a commonly used pattern when searching astronomical databases, and the PPDB will need to support them efficiently.
Cone Search#
One of the most common spatial query patterns is a cone search, where all
objects within a certain radius of a given position on the sky are returned.
In ADQL, this is typically expressed using the CONTAINS
, POINT
, and
CIRCLE
functions.
Below is an example of a cone search that finds all DiaObject
records
with spatial coordinates that are within 1.0 degrees of a given position:
SELECT *
FROM ppdb.DiaObject
WHERE CONTAINS(
POINT('ICRS', ra, dec),
CIRCLE('ICRS', 186.8, 7.0, 1.0)) = 1
The ADQL functions are rewritten to use BigQuery GIS functions in the TAP service so that the query can be executed natively.
When using the numeric (FLOAT64
) ra
and dec
columns, this type of
query is not optimized for performance and will generally result in a full
table scan.
Work is ongoing to optimize this type of query for performance by adding
GEOGRAPHY
columns to relevant tables, such as DiaObject
, which
represent spherical points on the sky.
These GEOGRAPHY
columns can then be used in the query instead of the
numeric ra
and dec
columns, which will allow the query to be executed
much more efficiently.
Nearest Neighbor Search#
Another common spatial query pattern is a nearest neighbor search, where the closest object to a given position on the sky is returned. This may also be expressed in ADQL using a query such as the following:
SELECT o1.diaObjectId AS id1, o2.diaObjectId AS id2,
DISTANCE(POINT('ICRS', o1.ra, o1.dec), POINT('ICRS', o2.ra, o2.dec)) AS d
FROM ppdb.DiaObject o1
JOIN ppdb.DiaObject o2
ON o1.diaObjectId < o2.diaObjectId
WHERE CONTAINS(POINT('ICRS', o1.ra, o1.dec),
CIRCLE('ICRS', 186.5, 7.05, 0.5))=1
AND DISTANCE(POINT('ICRS', o1.ra, o1.dec),
POINT('ICRS', o2.ra, o2.dec)) < 0.5;
These types of searches are typically constrained to a certain region of the
sky to limit the number of objects that need to be considered.
As with cone searches, using GEOGRAPHY
columns instead of numeric ra
and dec
columns will allow these types of queries to be executed much more
efficiently.
The above query will not actually work correctly though, as the diaObjectId
column in DiaObject
is not unique by itself in the APDB schema, so the join
condition will not work as intended.
It is likely that another table with deduplicated objects will need to be
created to support this type of query, such as a DiaObjectLast
table,
showing only the most recent version of each object, or the query itself will
need to be rewritten to use the full primary key (diaObjectId
plus validity
start) in order to ensure uniqueness.
Additional Query Patterns#
In addition to spatial queries, the PPDB will need to support other common query patterns. These will not be covered exhaustively here, but several common ones will be considered.
Single Object Selection#
Sometimes users will want to select a single object by its unique identifier.
This is typically done using a simple WHERE
clause, as in the following
example:
SELECT * FROM ppdb.DiaObject WHERE diaObjectId=24624704620855420
On unoptimized tables without clustering or partitioning, this type of query
will result in a full table scan.
Clustering on the ID column should significantly improve the performance of
this type of query, though there will be trade-offs to consider when choosing
which columns to cluster on and in which order.
If diaobjectI
is the first clustering column, this will optimize single
object queries, but it may not be optimal for other types of query patterns
such as spatial queries.
Table Joins#
Joins between tables are a common query pattern, such as joining DiaSource
to DiaObject
to get the object information for a given source.
These types of queries can be expensive, as they typically require a full table
scan of one or both tables.
For instance, the following query joins DiaSource
to DiaObject
to get
all sources for objects within a certain region of the sky:
SELECT s.*, o.*
FROM ppdb.DiaSource s
JOIN ppdb.DiaObject o
ON s.diaObjectId = o.diaObjectId
WHERE CONTAINS(
POINT('ICRS', o.ra, o.dec),
CIRCLE('ICRS', 186.8, 7.0, 1.0)) = 1
Some experimentation has been performed with clustering both the DiaSource
and DiaObject
tables on the diaObjectId
column, though this did not
seem to significantly improve the performance.
Further experimentation is needed to determine the optimal clustering strategy
or other techniques that could be used to reduce data scanned when joining
tables.
Table Scans#
Some types of queries will naturally require a full table scan, such as filtering on a magnitude column without any spatial or ID constraints.
One example of this is the following query which searches for all objects where the mean PSF flux in the r-band is less than a certain value:
SELECT * FROM ppdb.DiaObject WHERE r_psfFluxMean < 1000.0
Unless the table is clustered on the r_psfFluxMean
column, this type of
query will always require a full table scan.
Backup and Recovery#
The PPDB will need to implement a backup and recovery strategy to ensure data integrity and availability in the event of a failure or data loss. BigQuery provides built-in features for data backup and recovery, including table snapshots and point-in-time recovery. Table snapshots allow for creating a read-only copy of a table at a specific point in time, which can be used for recovery in the event of accidental data deletion or corruption. Point-in-time recovery allows for restoring a table to a specific point in time within the last 7 days. These features can be used to implement a backup and recovery strategy for the PPDB, ensuring that data is protected and can be recovered in the event of a failure or data loss.
Table snapshots may be stored in a separate dataset or project for additional protection against data loss and for de-cluttering the main dataset. An automated process can be implemented using a scheduled Cloud Function or Cloud Run job to create snapshots of the production tables at regular intervals. A daily snapshot before bulk data processing occurs, such as the promotion job where it to run once per day, would be a good starting point. These snapshots can be retained for a configurable period, e.g., 30 days, before being automatically deleted to manage storage costs. More frequent snapshots should not incur significant additional costs, because they are logical copies and do not require duplicating the underlying data.
Restoring data from a snapshot can be done using the BigQuery console, command-line interface, or API. These operations are generally fast and efficient, as they also do not require copying data but instead create a logical view of the data at the specified point in time.
Another source of backup data is the original Parquet files uploaded to GCS. These files could be retained for a configurable period, e.g., 30 days, to allow for recovery in the event of a catastrophic failure that affects both the BigQuery dataset and the snapshots. They could also be kept indefinitely as a long-term archive, though this would incur additional storage costs. Parquet files could also be stored on-premises for additional protection, though this could require significant disk space as the data volume grows over time.
Finally, the chunk tracking database will also need to be backed up and
recovered as well, as it is a critical component of the data processing
pipeline.
If a managed Cloud SQL instance is used, automated backups can be enabled
using the built-in features of Cloud SQL.
If a CPNG instance is used, a custom backup and recovery process will need to
be implemented, such as using pg_dump
to create logical backups of the
database at regular intervals and storing them in GCS.
Deployment and Operations#
The PPDB will mainly be deployed and operated on Google Cloud Platform (GCP), though some components will run on-premises on the Rubin Science Platform (RSP) at the United Status Data Facility (USDF).
USDF Deployment#
The two components of the system which much run on-premises at the USDF are the
export and upload processes, as they need direct access to the APDB, which is
not a publicly accessible database.
The two processes are implemented in the ppdb-replication
application
within the Phalanx repository, which is the
standard project used to deploy and manage applications on the RSP.
The application can be deployed and managed using Argo CD, which provides a web
interface for monitoring and administration.
Specific versions of the processors can be deployed by referencing Docker image
tags in the Phalanx configuration.
Compute resource requirements for these two processes are relatively modest, and the exporter will additionally need disk space for storing the exported Parquet files before they are uploaded to GCS. It has yet to be determined whether or not the files will be kept long-term as an on-premises backup. They could potentially be deleted as soon as the files are copied into GCS, though it may be prudent to keep them for a certain period, e.g., 30 days, to allow for recovery in the event of a catastrophic failure. Exactly how much space these files require will depend on the data rates from the APDB and the retention period, as well as the Parquet compression level, but it is likely that at least several TB of disk space should be allocated for this purpose.
GCP Deployment#
The cloud deployment will be much more complicated than at the USDF, as it involves many services which need to be deployed and configured. Most of these are fully managed by Google using a Software as a Service (SaaS) model, which reduces the operational burden, but all toegether they still require significant configuration and management.
GCP Services#
Required GCP services must be enabled in the project before they can be used. These are summarized below with their purpose for the PPDB and the tasks involved in setting them up.
Service Name |
Purpose for PPDB |
Setup Tasks |
---|---|---|
BigQuery |
data storage and querying |
dataset creation table creation permissions and roles |
Google Cloud Storage (GCS) |
Parquet file storage Dataflow flex templates and other artifacts |
bucket creation |
Cloud Run |
Cloud functions for data processing |
function deployments |
Cloud Scheduler |
scheduling periodic tasks |
promotion job scheduling |
Dataflow |
data ingestion and processing |
flex templates, etc. |
CNGP or Cloud SQL |
chunk tracking database |
instance creation user and database setup GKE cluster setup (if CPNG) |
IAM |
permissions and roles for services |
service accounts roles and permissions |
virtual networking |
secure connectivity between services |
VPC setup firewall rules |
Pub/Sub |
messaging between components |
topic creation subscription creation |
Terraform Deployment#
Existing Rubin GCP projects are managed using the
idf_deploy repoository, which uses
Terraform to define and deploy project settings and resources.
This repository should be used to manage the PPDB project as well, ensuring
that the configuration can be version controlled and easily deployed into new
projects if needed, particularly in the case of a disaster recovery scenario.
In addition to the configuration of the GCP services themselves, idf_deploy
is also used to manage project-level settings such as the project name, billing
account, budget amount, etc.
Currently the PPDB cloud setup is managed by a set of ad hoc scripts for development in the ppdb-scripts repository. Porting these to Terraform, as well as adding the additional configuration needed for a production deployment, will be a major sub-project of the overall PPDB effort. An individual with significant expertise in this area will be needed to lead the effort.
Even after the port to Terraform has been completed, some aspects of the configuration may be better managed outside of it. This includes the creation of the BigQuery tables, which will likely be done using a custom script or tool that can read the schema from a file and create the tables programmatically. Additionally, Cloud Run functions will likely be deployed in GitHub Actions where new versions are pushed automatically when new code is merged to the main branch.
References#
Mario Jurić, Tim S. Axelrod, Andrew C. Becker, and others. Data Products Definition Document. Systems Engineering Controlled Document LSE-163, NSF-DOE Vera C. Rubin Observatory, July 2023. URL: https://lse-163.lsst.io/, doi:10.71929/rubin/2587118.
Jeremy McCormick. Database Platform Comparison for the Prompt Products Database (PPDB). Data Management Technical Note DMTN-308, NSF-DOE Vera C. Rubin Observatory, February 2025. URL: https://dmtn-308.lsst.io/.
Stelios Voutsinas. Breakdown of adaptations to the CADC TAP Service for the RSP. SQuaRE Technical Note SQR-099, NSF-DOE Vera C. Rubin Observatory, April 2025. URL: https://sqr-099.lsst.io/.