Module trase.tools.aws.s3_helpers
Functions
def bucket_exists(client, bucket: str)def download_file(key, bucket='trase-storage', version_id=None, path=None, client=None, force=False)-
Download an object stored in S3, but skip if the file already exists.
In particular, we skip the download if the local file exists and is newer than the remote or has a different bytesize to the remote.
Examples
>>> download_file("candyland/metadata/assets.csv") '/var/folders/_m/s00h1rdj3m75ptngsv9tv52w0000gp/T/assets.csv'Args
key- the key to the object on S3, e.g. "candyland/metadata/assets.csv"
bucket:optional- the S3 bucket, defaults to bucket defined in trase.config.settings
version_id:optional- the specific version ID of the file to download, e.g. "yiJXp1MC0N5U128RoBtlWOcAEjHFVowi"
path:optional- the file path on the local file system to download the file to. If not provided then a file will be made in the temporary directory of the operating system. (Such temporary directories are usually cleared on every restart of the operating system). This path will be checked to determine whether to skip downloading the file.
force:bool, optional- whether to re-download the file even if it already exists on the file system.
Returns
The path to the downloaded file
def file_differs(client, bucket: str, key: str, path: str, version_id: Optional[str] = None) ‑> bool-
Determine whether the object at s3://BUCKET/KEY is different from a local file. This is done by first comparing the bytesize and then by comparing the "last modified" timestamp.
This algorithm will be a false negative if the file was modified locally since it was downloaded from S3 but the byte size was not changed.
If the local file does not exist, this is considered to be different from the remote and this function will return
True. def get_etag(client, bucket: str, key: str, version_id: Optional[str] = None) ‑> strdef get_latest_s3_inventory(client=<botocore.client.S3 object>) ‑> pandas.core.frame.DataFrame-
Read the most recent S3-Inventory snapshot (see :func:
read_s3_inventory_manifest()). def get_version_id(client, bucket: str, key: str) ‑> Optional[str]-
Fetch the latest version id for an S3 object.
Returns: the version id or
Noneif it does not exist (for example, if the bucket does not have version control enabled) def head_object(client, bucket: str, key: str, version_id: Optional[str] = None)def head_objects(client, bucket: str, keys: List[str]) ‑> List[Optional[dict]]-
Head multiple objects at once, issuing as few requests to S3 as possible.
The S3 API has an operation called HeadObject. However, it only allows you to specify a single S3 object. When you have a lot of objects, the fact that you have to issue one request per object means that the whole operation can be quite slow.
This function gets around this by using the ListObjectsV2 operation instead. This allows you to specify a S3 prefix, and returns much of the same information as HeadObject, including LastModified, ETag, and Size.
However, this function does not support specifiying a specific VersionId: it will only list the latest version of objects. If you wish to specify VersionId then use
head_object().Returns: a list of dictionaries containing the response from ListObjectsV2. Each dictionary will have the keys Key, LastModified, ETag, Size and StorageClass. The order and size of the list will match the keys parameter that was passed in. If an object does not exist, the entry will be None.
def list_access_log_keys(after_key=None, client=<botocore.client.S3 object>)-
Return access-log object keys under
trase-storage-access-logs/, sorted ascending and skipping zero-byte objects.Filenames are date-prefixed (
YYYY-MM-DD-HH-MM-SS-<id>) so a lexical sort is chronological;after_key(exclusive) acts as a high-watermark for incremental loads. def list_objects(key_prefix, bucket='trase-storage', client=<botocore.client.S3 object>)def list_s3_inventory_manifests(client=<botocore.client.S3 object>)-
Return
[(snapshot_ts, manifest_key), …]for every S3-Inventory snapshot, sorted oldest→newest.snapshot_tsis the per-run directory name S3 writes (e.g.2026-05-31T01-00Z). def object_exists(client, bucket: str, key: str, version_id: Optional[str] = None)def read_s3_inventory_manifest(manifest_key, client=<botocore.client.S3 object>, progress: bool = False) ‑> pandas.core.frame.DataFrame-
Read a single inventory snapshot — every gz CSV data file referenced by its
manifest.json— into a DataFrame with :data:INVENTORY_COLUMNS(all strings).Set
progress=Trueto print a line per data file (the gz CSVs can be tens of MB each, so this reassures long runs they're progressing). def split_s3_path(s3_path: str)-
Splits an S3 path into bucket and key.
Args
s3_path:str- Full S3 path (e.g., 's3://my-bucket/path/to/file.txt')
Returns
tuple- (bucket, key)