fix(ci): restore Forgejo runner autoscaler capacity #14

Open
opened 2026-05-18 21:56:52 +00:00 by simon · 0 comments
Owner

Summary

During the web/v0.159.1 publication recovery, organization-level Forgejo Actions runners did not scale as expected. The queue had many waiting ubuntu-latest jobs, but only one active runner was attached to a stale pre-patch publish job. The documented runner host (claude@192.168.8.235) was not reachable from this environment (No route to host), so I could not restart or inspect the autoscaler directly.

Evidence

  • orgs/carrtech/actions/runners/jobs showed many ubuntu-latest jobs waiting while the stale Build and publish web image task remained running.
  • Three old jobs remain waiting on ubuntu-24.04-arm:
    • Build web image (linux/arm64)
    • Build ingest image (linux/arm64)
    • Build ansible-api image (linux/arm64)
  • Deleting the active runner registration did not clear the stuck task.
  • Forgejo API does not expose cancel/stop endpoints for actions runs on this instance.
  • I had to register a temporary local ephemeral runner to publish docker.io/carrtechdev/ct-ops-web:v0.159.1.

Impact

Release publication can stall when the autoscaler does not start enough ubuntu-latest runners or when jobs target labels with no active runner, especially ubuntu-24.04-arm.

Suggested Fix

  • Inspect and restart forgejo-runner-autoscaler.service on the runner host.
  • Confirm stale Incus runner VMs are cleaned up and MAX_RUNNERS capacity is available.
  • Either provision ubuntu-24.04-arm runners or migrate those jobs to the supported ubuntu-latest multi-arch Buildx path.
  • Add an operational runbook for cancelling stuck Forgejo Actions runs or stale runner registrations.
## Summary During the web/v0.159.1 publication recovery, organization-level Forgejo Actions runners did not scale as expected. The queue had many waiting `ubuntu-latest` jobs, but only one active runner was attached to a stale pre-patch publish job. The documented runner host (`claude@192.168.8.235`) was not reachable from this environment (`No route to host`), so I could not restart or inspect the autoscaler directly. ## Evidence - `orgs/carrtech/actions/runners/jobs` showed many `ubuntu-latest` jobs waiting while the stale `Build and publish web image` task remained running. - Three old jobs remain waiting on `ubuntu-24.04-arm`: - `Build web image (linux/arm64)` - `Build ingest image (linux/arm64)` - `Build ansible-api image (linux/arm64)` - Deleting the active runner registration did not clear the stuck task. - Forgejo API does not expose cancel/stop endpoints for actions runs on this instance. - I had to register a temporary local ephemeral runner to publish `docker.io/carrtechdev/ct-ops-web:v0.159.1`. ## Impact Release publication can stall when the autoscaler does not start enough `ubuntu-latest` runners or when jobs target labels with no active runner, especially `ubuntu-24.04-arm`. ## Suggested Fix - Inspect and restart `forgejo-runner-autoscaler.service` on the runner host. - Confirm stale Incus runner VMs are cleaned up and `MAX_RUNNERS` capacity is available. - Either provision `ubuntu-24.04-arm` runners or migrate those jobs to the supported `ubuntu-latest` multi-arch Buildx path. - Add an operational runbook for cancelling stuck Forgejo Actions runs or stale runner registrations.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
carrtech/ct-ops#14
No description provided.