Module `trase.tools.aws.s3_helpers`

Functions

def bucket_exists(client, bucket: str)

def download_file(key, bucket='trase-storage', version_id=None, path=None, client=None, force=False)

Download an object stored in S3, but skip if the file already exists.

In particular, we skip the download if the local file exists and is newer than the remote or has a different bytesize to the remote.

Examples

>>> download_file("candyland/metadata/assets.csv")
'/var/folders/_m/s00h1rdj3m75ptngsv9tv52w0000gp/T/assets.csv'

Args

key: the key to the object on S3, e.g. "candyland/metadata/assets.csv"
bucket : optional: the S3 bucket, defaults to bucket defined in trase.config.settings
version_id : optional: the specific version ID of the file to download, e.g. "yiJXp1MC0N5U128RoBtlWOcAEjHFVowi"
path : optional: the file path on the local file system to download the file to. If not provided then a file will be made in the temporary directory of the operating system. (Such temporary directories are usually cleared on every restart of the operating system). This path will be checked to determine whether to skip downloading the file.
force : bool, optional: whether to re-download the file even if it already exists on the file system.

Returns

The path to the downloaded file

def file_differs(client, bucket: str, key: str, path: str, version_id: Optional[str] = None) ‑> bool

Determine whether the object at s3://BUCKET/KEY is different from a local file. This is done by first comparing the bytesize and then by comparing the "last modified" timestamp.

This algorithm will be a false negative if the file was modified locally since it was downloaded from S3 but the byte size was not changed.

If the local file does not exist, this is considered to be different from the remote and this function will return True.

def get_etag(client, bucket: str, key: str, version_id: Optional[str] = None) ‑> str

def get_latest_s3_inventory(client=<botocore.client.S3 object>) ‑> pandas.core.frame.DataFrame

Read the most recent S3-Inventory snapshot (see :func:read_s3_inventory_manifest()).

def get_version_id(client, bucket: str, key: str) ‑> Optional[str]

Fetch the latest version id for an S3 object.

Returns: the version id or None if it does not exist (for example, if the bucket does not have version control enabled)

def head_object(client, bucket: str, key: str, version_id: Optional[str] = None)

def head_objects(client, bucket: str, keys: List[str]) ‑> List[Optional[dict]]

Head multiple objects at once, issuing as few requests to S3 as possible.

The S3 API has an operation called HeadObject. However, it only allows you to specify a single S3 object. When you have a lot of objects, the fact that you have to issue one request per object means that the whole operation can be quite slow.

This function gets around this by using the ListObjectsV2 operation instead. This allows you to specify a S3 prefix, and returns much of the same information as HeadObject, including LastModified, ETag, and Size.

However, this function does not support specifiying a specific VersionId: it will only list the latest version of objects. If you wish to specify VersionId then use head_object().

Returns: a list of dictionaries containing the response from ListObjectsV2. Each dictionary will have the keys Key, LastModified, ETag, Size and StorageClass. The order and size of the list will match the keys parameter that was passed in. If an object does not exist, the entry will be None.

def list_access_log_keys(after_key=None, client=<botocore.client.S3 object>)

Return access-log object keys under trase-storage-access-logs/, sorted ascending and skipping zero-byte objects.

Filenames are date-prefixed (YYYY-MM-DD-HH-MM-SS-<id>) so a lexical sort is chronological; after_key (exclusive) acts as a high-watermark for incremental loads.

def list_objects(key_prefix, bucket='trase-storage', client=<botocore.client.S3 object>)

def list_s3_inventory_manifests(client=<botocore.client.S3 object>)

Return [(snapshot_ts, manifest_key), …] for every S3-Inventory snapshot, sorted oldest→newest.

snapshot_ts is the per-run directory name S3 writes (e.g. 2026-05-31T01-00Z).

def object_exists(client, bucket: str, key: str, version_id: Optional[str] = None)

def read_s3_inventory_manifest(manifest_key, client=<botocore.client.S3 object>, progress: bool = False) ‑> pandas.core.frame.DataFrame

Read a single inventory snapshot — every gz CSV data file referenced by its manifest.json — into a DataFrame with :data:INVENTORY_COLUMNS (all strings).

Set progress=True to print a line per data file (the gz CSVs can be tens of MB each, so this reassures long runs they're progressing).

def split_s3_path(s3_path: str)

Splits an S3 path into bucket and key.

Args

s3_path : str: Full S3 path (e.g., 's3://my-bucket/path/to/file.txt')

Returns

tuple: (bucket, key)