Module trase.tools.aws.metadata
This module creates and manages metadata for objects uploaded to AWS S3.
The aim of this metadata is to record some basic provenance for our S3 objects:
- The script that produced the object (file path relative to Git and commit hash)
- The person that ran the script
The metadata is written to a YAML file on disk (next to the data file) so that it can
be inspected before upload. It is not uploaded as a separate S3 object. Instead, the
same information is attached to the data object itself as S3 user-defined metadata (see
s3_metadata_from_dict()). This can be viewed with, for example:
aws s3api head-object --bucket trase-storage --key path/to/object.csv
or via the Trase CLI / boto3 head_object call (see the "Viewing S3 metadata" section
of the preprocessing documentation).
Metadata generation is always best-effort: if it fails for any reason (for example, the script is not in a Git repository) the upload of the data object still proceeds.
Functions
def dataclass_to_dict(object:) ‑> dict def generate_metadata(script_path: str) ‑> Metadatadef generate_metadata_path(path)def get_commit_hash(script_path)def get_metadata_from_git(script_path) ‑> Script-
Fetch git commit hash information: requires user to have "git" installed and also assumes that
fileis in a git repository def get_script_path_relative_to_root(script_path)def s3_metadata_from_dict(metadata: dict) ‑> dict-
Flatten a metadata dictionary (as produced by
dataclass_to_dict()or read back from the YAML file) into the string key/value pairs that can be attached to an S3 object as user-defined metadata.Keys whose value is missing are omitted. All values are coerced to strings, as required by S3.
def s3_metadata_from_file(metadata_path: str) ‑> dict-
Read the metadata YAML file from disk and return the S3 user-defined metadata dictionary (see
s3_metadata_from_dict()). Returns an empty dictionary if the file does not exist. def upload_file_with_metadata(client, bucket: str, key: str, path: str, metadata_path: str = None, printer=<bound method Logger.debug of <Logger trase.tools.aws.metadata (WARNING)>>)-
Uploads a file to S3, attaching the contents of its metadata file (if present) as S3 user-defined metadata on the object.
The metadata is read from the YAML file on disk; it is not uploaded as a separate S3 object. If the metadata file cannot be read for any reason the object is still uploaded, just without the extra metadata.
Args
bucket- the s3 bucket that the file should be uploaded to
key- the s3 key that the file should be uploaded to
path- the location of the file on disk
metadata_path- the location of the metadata file on disk. Defaults to the data file with a ".yml" extension.
def write_csv_for_upload(df: pd.DataFrame, key: str, script_path: str = None, path: str = None, metadata_path: str = None, bucket: str = 'trase-storage', do_upload: bool = None, skip_metadata: bool = False, **pd_to_csv_kwargs)-
Writes a CSV file and either indicates to the user how they should upload it to, S3 or actually performs the upload itself.
It is a very intentional design that, by default, the upload to S3 does not occur. We want the user to explicitly choose to do the upload to minimise the chance that they accidentally overwrite data on S3. By putting a "human in the loop" we also give the user a chance to inspect the data before upload.
Args
df- the Pandas dataframe that should be serialized
key- the s3 key that the dataframe should be uploaded to
script_path:optional- the path of the script that generated the file. Often
this can be taken from
__file__. If it is not supplied then we will inspect the call stack and take the filename of the file which called this function, which may or may not be what was intended. path:optional- the (temporary) file location where the Pandas dataframe will be written to before uploading. The reason it is written to disk is to give the user the chance to inspect its contents before uploading. If not provided then a path will be generated in the operating system's temporary directory.
metadata_path:optional- the (temporary) file location where the metadata will be written to before uploading. If not provided then it will be written next to the data file with a ".yml" extension.
bucket:optional- the s3 bucket to upload the file and metadata to.
Defaults to bucket defined in
trase.config.settings. do_upload:optional- whether to actually do the upload or just print to the
user that they should do it themselves. If not provided then we will look
for the presence of "–upload" in
sys.argvto determine its value. skip_metadata:optional- if true, no metadata file is generated and no S3 metadata is attached to the object.
pd_to_csv_kwargs- keyword arguments that will be passed to pd.to_csv
def write_file_for_upload(key: str, do_write, script_path: str = None, path: str = None, metadata_path: str = None, bucket: str = 'trase-storage', do_upload: bool = None, skip_metadata: bool = False)def write_geopandas_for_upload(gdf: gpd.GeoDataFrame, key: str, script_path: str = None, path: str = None, metadata_path: str = None, bucket: str = 'trase-storage', do_upload: bool = None, skip_metadata: bool = False, driver: str = 'geojson', **to_file_kwargs)-
Writes a GeoDataFrame to disk and optionally uploads to S3 with metadata.
Supported drivers: - "parquet" -> GeoParquet - "geojson" -> GeoJSON - "gpkg" -> GeoPackage
def write_json_for_upload(data, key: str, script_path: str = None, path: str = None, metadata_path: str = None, bucket: str = 'trase-storage', do_upload: bool = None, skip_metadata: bool = False, **json_dump_kwargs)def write_metadata(metadata: Metadata, path)def write_parquet_for_upload(df: pd.DataFrame, key: str, is_polars=False, script_path: str = None, path: str = None, metadata_path: str = None, bucket: str = 'trase-storage', do_upload: bool = None, skip_metadata: bool = False, **to_parquet_kwargs)
Classes
class Metadata (script: Script, user: str)-
Metadata(script: trase.tools.aws.metadata.Script, user: str)
Class variables
var script : Scriptvar user : str
class Script (path: Optional[str], commit_hash: Optional[str], github_username: Optional[str] = 'sei-international', github_repository: Optional[str] = 'TRASE', type: Optional[str] = 'github_script')-
The location of a GitHub script
Class variables
var commit_hash : Optional[str]var github_repository : Optional[str]var github_username : Optional[str]var path : Optional[str]var type : Optional[str]