Platform cost dashboard with Evidence and DuckDB

Every time someone asked "how much are we spending?", I was opening two tabs: AWS Cost Explorer and Datadog's billing page. Adding them up manually. No consolidated view, no budget tracking.

I wanted a single dashboard showing total platform cost and budget vs actual for AWS. The constraint I set for myself: no backend, no database server, no ingestion pipeline.


Evidence and DuckDB

Evidence is a framework for building data apps from Markdown and SQL. You write .md files with SQL code blocks; Evidence executes the queries and renders the results as charts, tables, and KPI cards. The output is a static site with no runtime server.

The component library covers a lot of ground: charts (bar, line, area, scatter, funnel, Sankey, maps), interactive inputs (dropdowns, date ranges, sliders), and UI primitives like tabs, modals, and alerts. See the full component reference.

The SQL runs against DuckDB during the build. DuckDB is an in-process analytical database: it runs inside your process, reads files directly (CSV, Parquet, JSON), and executes OLAP-style queries without a server. Evidence uses it as the query engine. Your SQL gets executed against the configured data sources, and the results get baked into static HTML.

The consequence is that the entire analytical stack collapses to a build step. You write SQL, Evidence queries DuckDB, you get a static site. Nothing to operate at runtime.


Data model

All data lives in sources/csv/ inside the project:

File Contents
fact_cost_monthly.csv Monthly cost by AWS account
fact_cost_daily.csv Daily cost by AWS account
fact_cost_aws_daily.csv Daily cost by account, service, and region
dim_account.csv AWS accounts (ID + name)
budget_aws.csv Monthly AWS budget
datadog.csv Monthly Datadog cost

No schema definitions, no migrations. The CSVs are the schema. Numeric values use a dot as decimal separator, no currency symbol (1255392.00, not $1,255,392.00).


Queries in Markdown

A page in Evidence is a .md file with named SQL blocks. The blocks are fenced with triple backticks and a query name instead of a language. Evidence makes the result available as a variable you can pass to chart components.

The home page consolidates AWS and Datadog into a single monthly series:

WITH aws AS (
  SELECT date_trunc('month', month)::DATE AS period,
         SUM(amortized_cost + tax_cost + edp_cost + refund_cost) AS cost
  FROM csv.fact_cost_monthly
  WHERE provider = 'aws'
    AND month BETWEEN '${inputs.date_filter.start}'::DATE
                  AND '${inputs.date_filter.end}'::DATE
  GROUP BY 1
)
SELECT period, 'AWS' AS provider, cost FROM aws
UNION ALL
SELECT period::DATE, 'Datadog', cost FROM csv.datadog
  WHERE period::DATE BETWEEN '${inputs.date_filter.start}'::DATE
                         AND '${inputs.date_filter.end}'::DATE

The ${inputs.date_filter.start} and ${inputs.date_filter.end} bindings come from a <DateRange> component at the top of the page. Evidence handles the wiring between UI components and query parameters.


Build and deploy

The build runs in GitHub Actions: install deps, run npm run sources to initialize the DuckDB connector, then npm run build to generate the static site. The build/ directory goes into a Docker image on top of nginx:alpine:

FROM nginx:alpine
COPY build/ /usr/share/nginx/html
COPY nginx.conf /etc/nginx/conf.d/default.conf
EXPOSE 80

The image is published to GHCR as ghcr.io/getulioruiz/duckfin:sha-<7-chars>. From there it's a standard GitOps flow: update the image tag in homelab/gitops/apps/duckfin/values.yaml, push, ArgoCD syncs, Kubernetes rolls out the new pod.

The running container requests 50m CPU and 64Mi memory. It's nginx serving static files.


Updating data

There's no live data connection, so refreshing is manual: edit the CSV, commit, push to main, CI builds and publishes a new image, update the tag in values.yaml, ArgoCD rolls it out.

More friction than a live pipeline. For a dashboard where the data cadence is monthly, it's a workable trade. But it's the obvious thing to improve.


From CSVs to CUR: the CronJob approach

The manual CSV editing was the first thing to go. AWS publishes the Cost and Usage Report 2.0 (CUR) as Parquet files to S3 from the AWS account. DuckDB reads Parquet natively. The idea: a Kubernetes CronJob that runs every few hours, reads the CUR Parquets, builds a DuckDB database, generates the static site, and publishes it.

The memory problem

The first attempt was naive: read all CUR Parquets in a single INSERT INTO ... SELECT * FROM read_parquet('s3://bucket/prefix/**/*.parquet'). For a multi-account organization with hourly granularity, a single month of CUR can be 20+ GB of Parquet. Five months? The DuckDB process would balloon past 100 GB of memory and get OOM-killed by Kubernetes.

The root cause: DuckDB keeps the entire transaction in its WAL (write-ahead log) until commit. A single INSERT spanning months of data means the WAL grows monotonically until the transaction completes. Peak memory and disk usage are proportional to the total data volume, not the steady-state database size.

Chunked processing: one month at a time

The solution was to process one BILLING_PERIOD partition at a time, each as its own transaction:

for each BILLING_PERIOD=YYYY-MM partition:
    1. BEGIN TRANSACTION
    2. DELETE rows for that month (if re-processing)
    3. INSERT INTO stg_aws_cur FROM read_parquet('aws/costs/data/BILLING_PERIOD=YYYY-MM/*.parquet')
    4. CHECKPOINT (flush WAL to disk, reclaim space)
    5. INSERT into ingest_logs (partition name, row count, duration, DB size on disk)
    6. Print progress from ingest_logs

Each partition's metadata (row count, duration, database size after checkpoint) is logged to an ingest_logs table inside the same DuckDB file. This gives visibility into how long each month took and how the database grew over time — useful for debugging slow partitions or estimating disk requirements.

This bounds peak memory to roughly one month's worth of data instead of the entire history. After each CHECKPOINT, DuckDB compacts the WAL back into the main database file, so disk usage stays predictable.

Production CronJob

In production (EKS), the CronJob mounts a PVC at /app/data — that is where processed.duckdb lives between runs. The PVC is the source of truth; we do not ship the DB to S3 because the underlying EBS volume is covered by AWS Backup, so point-in-time recovery is handled out of band.

Each run, once a day:

  1. Reads the existing processed.duckdb from the PVC (or initializes a fresh one if the volume is empty)
  2. Determines the refresh window: start of previous month to today (a 2-month window catches AWS retroactive corrections like SP/RI adjustments and late-arriving refunds)
  3. Deletes rows in that window from the database
  4. Streams those months from S3 CUR via read_parquet with hive_partitioning=true directly into DuckDB
  5. Runs npm run sources and npm run build — this step dominates total runtime. npm run sources is where Evidence initializes the DuckDB connector by scanning the full database file; with a database that grows to tens or hundreds of GB depending on org size and history depth, this alone can take several minutes
  6. Uploads only the build/ directory to S3 — the DB stays on the PVC
  7. Triggers a rolling restart of the nginx deployment so it picks up the new site

The CronJob image includes Node.js, DuckDB CLI, and aws-cli. The nginx deployment is separate: an init container syncs the static site from S3, then nginx serves it. Total runtime resources for serving: 50m CPU, 64Mi memory. The heavy lifting happens only during the CronJob window.


What I'd do differently

A few things I learned building this:

  1. Start with chunked processing from day one. The single-transaction approach caused a few OOM kills before I understood the WAL behavior.

  2. Separate CI from data processing. The CI pipeline only builds and pushes the Docker image. It never touches S3 or runs npm run build. Data processing happens at runtime in the CronJob. This separation means a code change doesn't require re-processing months of billing data.

  3. The 2-month refresh window is a pragmatic choice. AWS CUR corrections (SP amortization adjustments, late refunds, RI fee redistribution) can arrive up to 3 days after the billing period closes. Re-processing the previous month on every run guarantees you always have the final numbers.

  4. DuckDB-WASM in the browser is surprisingly capable. The final site has zero backend queries. Users load the page, DuckDB-WASM initializes, loads the Parquet shards on demand, and runs the SQL client-side. For a dashboard with ~5M rows of pre-aggregated data, query response is sub-second.