Rust CLI / Automatic S3 Inventory

Automatic S3 inventory and diff

A focused CLI for large S3-compatible buckets: list, diff, and export Parquet while startup discovery, runtime splitting, dry-run plans, and run manifests handle the operational detail.

Rust · Apache-2.0 · v0.20.0 · S3-compatible · Parquet · Agent-safe JSON

Current evidence

1M objects, ~19s at -c 8

The current README cites a third-party Alibaba Cloud OSS benchmark. The product direction is now automatic first: startup discovery, runtime splitting, and verifiable artifacts.

1M objectsAlibaba Cloud OSSv0.20.0
benchmark1M objects

Alibaba Cloud OSS third-party run

speedup~55s to ~19s

single stream vs -c 8

releaseApache-2.0

v0.20.0 and later

Benchmark pathapprox objects/s
single stream
18,18255s
-c 8 automatic
52,63219s
Default automation path
  1. Discover real prefix boundaries
  2. Split long-tail segments while throughput improves
  3. Write Parquet, plans, manifests, and traces

The chart stays limited to benchmark data; automation is shown as product behavior.

Output: Parquet · key-space CSV · NDJSON · TSV · run manifestStartup discovery + throughput-aware runtime splitting
Startup discovery by defaultThroughput-aware runtime splittingAgent-safe dry-run plansRun manifests with local checks

A look inside

The default path handles the hard parts

s3-turbo-list now behaves more like an automated inventory and diff tool: startup discovery, runtime splitting, dry-run plans, run manifests, and artifact checks sit on the default path.

s3-turbo-list v0.20.0automatic inventory path
startup discoveryzero hints

first recursive list can run in parallel

OSS benchmark~19s

1M objects at -c 8

licenseApache-2.0

from v0.20.0 onward

preflightdoctor --json · --dry-run --agent · --plan-json
artifactsParquet · key-space CSV · run manifest · trace JSONL
verifymanifest-summary --check · hashes · row counts · schema
provider logsapp/s3-turbo-list User-Agent · compat-probe

Evidence chain

One run from discovery to verification

s3-turbo-list is not a pile of knobs. It pulls the error-prone parts of large object-storage list and diff runs into the default path.

01

Startup discovery

Delimiter probes find real prefix boundaries before a first recursive list fans out.

02

Throughput splitting

Runtime splitting fans out long-tail segments only while throughput is still improving.

03

Artifact checks

Parquet, manifests, hashes, row counts, and schema metadata keep CI and agents honest.

Third-party OSS run1M objects
single stream55s

18,182 objects/s

-c 8 automatic19s

52,632 objects/s

What it solves

Use it when

You need an inventory or diff of a large S3-compatible bucket without first designing a partitioning plan or tuning a pile of flags.

It is not

A backup tool, sync engine, or storage browser. It is one focused binary for automatic listing, bounded diffing, and machine-readable evidence.

What stays simple

First runs parallelize with no hints file. Defaults choose the common path; Parquet, manifests, trace JSONL, and stable exit codes keep the result inspectable.

Automation path

Automatic key-space discovery

Startup delimiter probes find real prefix boundaries and cache them, so a first recursive list can run in parallel with no setup.

Throughput-aware fan-out

Runtime splitting fans out long-tail segments while throughput is still improving, treating concurrency as an upper bound rather than a target.

Bounded diff

Diff partitions both sides, lists them in parallel, and streams an ordered merge with `DiffFlag` rows without holding the combined keyspace in memory.

Adaptive Parquet output

The writer stays single-file on rate-limited stores and adds part-file writers only when Parquet encoding becomes the bottleneck.

Dry-run plans

`--dry-run --agent` and `--plan-json` emit the resolved plan, output paths, config source, warnings, and file conflicts without S3 requests.

Run manifests

`--run-manifest` records status, metrics, artifacts, warnings, SHA256, Parquet row counts, and schema metadata for completed runs.

Stable exit codes

Exit classes distinguish validation, provider setup, network, output, data validation, and interrupted runs for CI and agents.

Traceable provider behavior

`compat-probe` and `--trace-compat` make endpoint behavior and S3 API calls visible before scaling a run.

Why it exists

Automation is the product

Object storage work often starts with two primitive questions: what is in this bucket, and how does it differ from another bucket? The hard part is not adding knobs; it is making the default path fast, inspectable, and safe to automate.

s3-turbo-list keeps the surface intentionally small. It discovers structure, adapts fan-out, writes analysis-ready artifacts, and exposes JSON plans and manifests so humans, CI jobs, and agents can all trust the run.

One run

01

Preflight

Run `doctor --json`, `doctor --simple`, or `--dry-run --agent` to resolve local config, planned outputs, and warnings before touching S3.

02

List or diff

Use the default recursive list path, or diff two buckets; startup discovery and cached hints provide parallel segments automatically.

03

Verify artifacts

Write a run manifest, then use `manifest-summary --check` to validate status, artifact hashes, row counts, schema metadata, and exit class.

04

Analyze

Load Parquet into DuckDB, pandas, or pyarrow, or stream TSV/NDJSON when shell tools or agents need rows directly.

Practical runs

First bucket inventory

Run `list` with a bucket and region; startup discovery finds boundaries and writes Parquet plus key-space counts automatically.

Agent-safe preflight

Generate a dry-run JSON plan before the scan, then let an agent inspect warnings, planned artifacts, and config source.

Migration diff

Diff source and target buckets into one Parquet dataset with `DiffFlag`, then filter changed, left-only, and right-only rows downstream.

Compatible endpoints

Real S3-compatible endpoints

AWS S3MinIOCloudflare R2BOSOSSB2

Endpoint presets, compat-probe, trace JSONL, and provider-specific warnings make it easier to understand how each S3-compatible service behaves before scaling up a scan.

Questions & how-to

Why not just use aws s3 ls?

For small buckets, sequential listing is fine. s3-turbo-list is for large buckets where the default path should discover structure, list in parallel, and leave structured artifacts.

Do I need to prepare hints first?

No. Startup discovery probes real `CommonPrefixes` boundaries and caches them. Hints files are optional controls for repeated inventories.

What changed in v0.20.0?

The release fixes root-segment retry resume behavior, adds timeouts to runtime split-probe requests, and keeps clean interrupts from inflating run-manifest fatal errors. The project remains Apache-2.0.

Start with an agent-safe dry-run

Open repository