Module `trase.tools.aws.s3_helpers`

Functions

def bucket_exists(client, bucket: str)

def download_file(key, bucket='trase-storage', version_id=None, path=None, client=None, track=True, force=False)

Download an object stored in S3, but skip if the file already exists.

In particular, we skip the download if the local file exists and is newer than the remote or has a different bytesize to the remote.

Examples

>>> download_file("candyland/metadata/assets.csv")
'/var/folders/_m/s00h1rdj3m75ptngsv9tv52w0000gp/T/assets.csv'

Args

key: the key to the object on S3, e.g. "candyland/metadata/assets.csv"
bucket : optional: the S3 bucket, defaults to bucket defined in trase.config.settings
version_id : optional: the specific version ID of the file to download, e.g. "yiJXp1MC0N5U128RoBtlWOcAEjHFVowi"
path : optional: the file path on the local file system to download the file to. If not provided then a file will be made in the temporary directory of the operating system. (Such temporary directories are usually cleared on every restart of the operating system). This path will be checked to determine whether to skip downloading the file.
track : bool, optional: whether to add the object to the S3 tracker
force : bool, optional: whether to re-download the file even if it already exists on the file system.

Returns

The path to the downloaded file

def file_differs(client, bucket: str, key: str, path: str, version_id: Optional[str] = None) ‑> bool

Determine whether the object at s3://BUCKET/KEY is different from a local file. This is done by first comparing the bytesize and then by comparing the "last modified" timestamp.

This algorithm will be a false negative if the file was modified locally since it was downloaded from S3 but the byte size was not changed.

If the local file does not exist, this is considered to be different from the remote and this function will return True.

def get_etag(client, bucket: str, key: str, version_id: Optional[str] = None) ‑> str

def get_latest_s3_inventory(client=<botocore.client.S3 object>) ‑> pandas.core.frame.DataFrame

def get_version_id(client, bucket: str, key: str) ‑> Optional[str]

Fetch the latest version id for an S3 object.

Returns: the version id or None if it does not exist (for example, if the bucket does not have version control enabled)

def head_object(client, bucket: str, key: str, version_id: Optional[str] = None)

def head_objects(client, bucket: str, keys: List[str]) ‑> List[Optional[dict]]

Head multiple objects at once, issuing as few requests to S3 as possible.

The S3 API has an operation called HeadObject. However, it only allows you to specify a single S3 object. When you have a lot of objects, the fact that you have to issue one request per object means that the whole operation can be quite slow.

This function gets around this by using the ListObjectsV2 operation instead. This allows you to specify a S3 prefix, and returns much of the same information as HeadObject, including LastModified, ETag, and Size.

However, this function does not support specifiying a specific VersionId: it will only list the latest version of objects. If you wish to specify VersionId then use head_object().

Returns: a list of dictionaries containing the response from ListObjectsV2. Each dictionary will have the keys Key, LastModified, ETag, Size and StorageClass. The order and size of the list will match the keys parameter that was passed in. If an object does not exist, the entry will be None.

def list_objects(key_prefix, bucket='trase-storage', client=<botocore.client.S3 object>)

def object_exists(client, bucket: str, key: str, version_id: Optional[str] = None)

def split_s3_path(s3_path: str)

Splits an S3 path into bucket and key.

Args

s3_path : str: Full S3 path (e.g., 's3://my-bucket/path/to/file.txt')

Returns

tuple: (bucket, key)