Module `trase.tools.aws.metadata`

This module creates and manages metadata for objects uploaded to AWS S3.

The aim of this metadata is to record some basic provenance for our S3 objects:

The script that produced the object (file path relative to Git and commit hash)
The person that ran the script

The metadata is written to a YAML file on disk (next to the data file) so that it can be inspected before upload. It is not uploaded as a separate S3 object. Instead, the same information is attached to the data object itself as S3 user-defined metadata (see s3_metadata_from_dict()). This can be viewed with, for example:

aws s3api head-object --bucket trase-storage --key path/to/object.csv

or via the Trase CLI / boto3 head_object call (see the "Viewing S3 metadata" section of the preprocessing documentation).

Metadata generation is always best-effort: if it fails for any reason (for example, the script is not in a Git repository) the upload of the data object still proceeds.

Functions

def dataclass_to_dict(object: ) ‑> dict

def generate_metadata(script_path: str) ‑> Metadata

def generate_metadata_path(path)

def get_commit_hash(script_path)

def get_metadata_from_git(script_path) ‑> Script

Fetch git commit hash information: requires user to have "git" installed and also assumes that file is in a git repository

def get_script_path_relative_to_root(script_path)

def s3_metadata_from_dict(metadata: dict) ‑> dict

Flatten a metadata dictionary (as produced by dataclass_to_dict() or read back from the YAML file) into the string key/value pairs that can be attached to an S3 object as user-defined metadata.

Keys whose value is missing are omitted. All values are coerced to strings, as required by S3.

def s3_metadata_from_file(metadata_path: str) ‑> dict

Read the metadata YAML file from disk and return the S3 user-defined metadata dictionary (see s3_metadata_from_dict()). Returns an empty dictionary if the file does not exist.

def upload_file_with_metadata(client, bucket: str, key: str, path: str, metadata_path: str = None, printer=<bound method Logger.debug of <Logger trase.tools.aws.metadata (WARNING)>>)

Uploads a file to S3, attaching the contents of its metadata file (if present) as S3 user-defined metadata on the object.

The metadata is read from the YAML file on disk; it is not uploaded as a separate S3 object. If the metadata file cannot be read for any reason the object is still uploaded, just without the extra metadata.

Args

bucket: the s3 bucket that the file should be uploaded to
key: the s3 key that the file should be uploaded to
path: the location of the file on disk
metadata_path: the location of the metadata file on disk. Defaults to the data file with a ".yml" extension.

def write_csv_for_upload(df: pd.DataFrame, key: str, script_path: str = None, path: str = None, metadata_path: str = None, bucket: str = 'trase-storage', do_upload: bool = None, skip_metadata: bool = False, **pd_to_csv_kwargs)

Writes a CSV file and either indicates to the user how they should upload it to, S3 or actually performs the upload itself.

It is a very intentional design that, by default, the upload to S3 does not occur. We want the user to explicitly choose to do the upload to minimise the chance that they accidentally overwrite data on S3. By putting a "human in the loop" we also give the user a chance to inspect the data before upload.

Args

df: the Pandas dataframe that should be serialized
key: the s3 key that the dataframe should be uploaded to
script_path : optional: the path of the script that generated the file. Often this can be taken from __file__. If it is not supplied then we will inspect the call stack and take the filename of the file which called this function, which may or may not be what was intended.
path : optional: the (temporary) file location where the Pandas dataframe will be written to before uploading. The reason it is written to disk is to give the user the chance to inspect its contents before uploading. If not provided then a path will be generated in the operating system's temporary directory.
metadata_path : optional: the (temporary) file location where the metadata will be written to before uploading. If not provided then it will be written next to the data file with a ".yml" extension.
bucket : optional: the s3 bucket to upload the file and metadata to. Defaults to bucket defined in trase.config.settings.
do_upload : optional: whether to actually do the upload or just print to the user that they should do it themselves. If not provided then we will look for the presence of "–upload" in sys.argv to determine its value.
skip_metadata : optional: if true, no metadata file is generated and no S3 metadata is attached to the object.
pd_to_csv_kwargs: keyword arguments that will be passed to pd.to_csv

def write_file_for_upload(key: str, do_write, script_path: str = None, path: str = None, metadata_path: str = None, bucket: str = 'trase-storage', do_upload: bool = None, skip_metadata: bool = False)

def write_geopandas_for_upload(gdf: gpd.GeoDataFrame, key: str, script_path: str = None, path: str = None, metadata_path: str = None, bucket: str = 'trase-storage', do_upload: bool = None, skip_metadata: bool = False, driver: str = 'geojson', **to_file_kwargs)

Writes a GeoDataFrame to disk and optionally uploads to S3 with metadata.

Supported drivers: - "parquet" -> GeoParquet - "geojson" -> GeoJSON - "gpkg" -> GeoPackage

def write_json_for_upload(data, key: str, script_path: str = None, path: str = None, metadata_path: str = None, bucket: str = 'trase-storage', do_upload: bool = None, skip_metadata: bool = False, **json_dump_kwargs)

def write_metadata(metadata: Metadata, path)

def write_parquet_for_upload(df: pd.DataFrame, key: str, is_polars=False, script_path: str = None, path: str = None, metadata_path: str = None, bucket: str = 'trase-storage', do_upload: bool = None, skip_metadata: bool = False, **to_parquet_kwargs)

Classes

class Metadata (script: Script, user: str)

Metadata(script: trase.tools.aws.metadata.Script, user: str)

Instance variables

var script : Script
var user : str

The location of a GitHub script

Instance variables

var commit_hash : str | None
var github_repository : str | None
var github_username : str | None
var path : str | None
var type : str | None