Startup discovery
Delimiter probes find real prefix boundaries before a first recursive list fans out.
Rust CLI / Automatic S3 Inventory
A focused CLI for large S3-compatible buckets: list, diff, and export Parquet while startup discovery, runtime splitting, dry-run plans, and run manifests handle the operational detail.
Current evidence
The current README cites a third-party Alibaba Cloud OSS benchmark. The product direction is now automatic first: startup discovery, runtime splitting, and verifiable artifacts.
Alibaba Cloud OSS third-party run
single stream vs -c 8
v0.20.0 and later
The chart stays limited to benchmark data; automation is shown as product behavior.
A look inside
s3-turbo-list now behaves more like an automated inventory and diff tool: startup discovery, runtime splitting, dry-run plans, run manifests, and artifact checks sit on the default path.
first recursive list can run in parallel
1M objects at -c 8
from v0.20.0 onward
Evidence chain
s3-turbo-list is not a pile of knobs. It pulls the error-prone parts of large object-storage list and diff runs into the default path.
Delimiter probes find real prefix boundaries before a first recursive list fans out.
Runtime splitting fans out long-tail segments only while throughput is still improving.
Parquet, manifests, hashes, row counts, and schema metadata keep CI and agents honest.
18,182 objects/s
52,632 objects/s
What it solves
You need an inventory or diff of a large S3-compatible bucket without first designing a partitioning plan or tuning a pile of flags.
A backup tool, sync engine, or storage browser. It is one focused binary for automatic listing, bounded diffing, and machine-readable evidence.
First runs parallelize with no hints file. Defaults choose the common path; Parquet, manifests, trace JSONL, and stable exit codes keep the result inspectable.
Automation path
Startup delimiter probes find real prefix boundaries and cache them, so a first recursive list can run in parallel with no setup.
Runtime splitting fans out long-tail segments while throughput is still improving, treating concurrency as an upper bound rather than a target.
Diff partitions both sides, lists them in parallel, and streams an ordered merge with `DiffFlag` rows without holding the combined keyspace in memory.
The writer stays single-file on rate-limited stores and adds part-file writers only when Parquet encoding becomes the bottleneck.
`--dry-run --agent` and `--plan-json` emit the resolved plan, output paths, config source, warnings, and file conflicts without S3 requests.
`--run-manifest` records status, metrics, artifacts, warnings, SHA256, Parquet row counts, and schema metadata for completed runs.
Exit classes distinguish validation, provider setup, network, output, data validation, and interrupted runs for CI and agents.
`compat-probe` and `--trace-compat` make endpoint behavior and S3 API calls visible before scaling a run.
Why it exists
Object storage work often starts with two primitive questions: what is in this bucket, and how does it differ from another bucket? The hard part is not adding knobs; it is making the default path fast, inspectable, and safe to automate.
s3-turbo-list keeps the surface intentionally small. It discovers structure, adapts fan-out, writes analysis-ready artifacts, and exposes JSON plans and manifests so humans, CI jobs, and agents can all trust the run.
One run
Run `doctor --json`, `doctor --simple`, or `--dry-run --agent` to resolve local config, planned outputs, and warnings before touching S3.
Use the default recursive list path, or diff two buckets; startup discovery and cached hints provide parallel segments automatically.
Write a run manifest, then use `manifest-summary --check` to validate status, artifact hashes, row counts, schema metadata, and exit class.
Load Parquet into DuckDB, pandas, or pyarrow, or stream TSV/NDJSON when shell tools or agents need rows directly.
Practical runs
Run `list` with a bucket and region; startup discovery finds boundaries and writes Parquet plus key-space counts automatically.
Generate a dry-run JSON plan before the scan, then let an agent inspect warnings, planned artifacts, and config source.
Diff source and target buckets into one Parquet dataset with `DiffFlag`, then filter changed, left-only, and right-only rows downstream.
Compatible endpoints
Endpoint presets, compat-probe, trace JSONL, and provider-specific warnings make it easier to understand how each S3-compatible service behaves before scaling up a scan.
Questions & how-to
For small buckets, sequential listing is fine. s3-turbo-list is for large buckets where the default path should discover structure, list in parallel, and leave structured artifacts.
No. Startup discovery probes real `CommonPrefixes` boundaries and caches them. Hints files are optional controls for repeated inventories.
The release fixes root-segment retry resume behavior, adds timeouts to runtime split-probe requests, and keeps clean interrupts from inflating run-manifest fatal errors. The project remains Apache-2.0.