Module `trase.tools.aws.aws_helpers`

Functions for s3 access and parsing

Functions

def get_pandas_df(key, bucket='trase-storage', version_id=None, client=<botocore.client.S3 object>, track=True, sep=';', encoding='utf8', xlsx=False, print_version_id=False, **kwargs) ‑> pandas.core.frame.DataFrame

Read a CSV or XLSX dataset from S3 to a Pandas DataFrame. See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html

def get_s3_json(key, s3_resource=s3.ServiceResource(), bucket_name='trase-storage', track=True)

def get_s3_object_body(key, s3_client=<botocore.client.S3 object>, bucket_name='trase-storage', track=True)

def is_good_s3_key(key: str) ‑> bool

AWS S3 allows the key to be constructed from any UTF-8 character. However, in practice some characters like space or "!" cause problems.

To make problems less likely we should stick to only a limited set of characters:

0-9 a-z A-Z _ . ( )

Plus, of course, the directory separator. This is in line with AWS's recommendation (https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMetadata.html).

def list_version_ids(key, bucket='trase-storage', client=<botocore.client.S3 object>, ascending=True) ‑> List[VersionIdResponse]

List Version IDs for a given S3 object.

These can be a mix of versions and delete markers. By default the versions are returned in ascending order, with the latest version ID positioned as the last element in the list.

Args

key: a key to a S3 object
bucket: S3 bucket
ascending: sort list in ascending last modified (default) or descending

Returns: a list of tuples (version_id, modified, is_delete) containing the string version ID, a datetime object representing when the version ID was created, and a boolean representing whether the version ID is a delete marker.

def make_good_s3_key(string) ‑> str

def parse_aws_object2(key: str, encoding='utf-8', separator=';', quoting=None, bucket='trase-storage', Range='bytes=-', track=True)

def read_csv(key, bucket='trase-storage', version_id=None, client=<botocore.client.S3 object>, track=True, print_version_id=False, **kwargs) ‑> pandas.core.frame.DataFrame

Read an S3 object containing CSV data to a DataFrame

Args

kwargs: passed through to pd.read_csv

def read_geojson(key, bucket='trase-storage', version_id=None, client=<botocore.client.S3 object>, track=True, print_version_id=False, **kwargs) ‑> pandas.core.frame.DataFrame

Import GeoPandas and Read an S3 object containing Geometry data to a DataFrame

Args

kwargs: passed through to geopandas.read_file

def read_json(key, bucket='trase-storage', version_id=None, client=<botocore.client.S3 object>, track=True, print_version_id=False, **kwargs)

Args

kwargs: passed through to json.load

def read_s3_csv(key, s3_client=<botocore.client.S3 object>, bucket_name='trase-storage', track=True)

def read_s3_folder(folder, prefix='', s3_resource=s3.ServiceResource(), bucket_name='trase-storage')

def read_s3_object(key, s3_resource=s3.ServiceResource(), bucket_name='trase-storage', track=True)

def read_s3_parquet(key, bucket='trase-storage', version_id=None, client=<botocore.client.S3 object>, track=True, print_version_id=False, **kwargs) ‑> pandas.core.frame.DataFrame

Read a Parquet dataset from S3 to a Pandas DataFrame. See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_parquet.html

Arguments

kwargs: passed to pd.read_parquet

def read_xlsx(key, bucket='trase-storage', version_id=None, client=<botocore.client.S3 object>, track=True, print_version_id=False, **kwargs) ‑> pandas.core.frame.DataFrame

Read an S3 object containing XLSX data to a DataFrame

Args

kwargs: passed through to pd.read_excel

def read_yaml(key, bucket='trase-storage', version_id=None, client=<botocore.client.S3 object>, track=True, print_version_id=False, **kwargs)

Args

kwargs: passed through to yaml.safe_load

def stream_object(key, bucket='trase-storage', version_id=None, client=<botocore.client.S3 object>, track=True, print_version_id=False, decode=True, encoding='utf8', decoding_errors='strict', **kwargs) ‑> ContextManager[IO]

Returns a stream containing the body of an S3 object. This function is a context manager to ensure that the HTTP stream gets closed:

with stream_object("my-object", "my_bucket") as file:
    print("The contents are", file.read())

Args

decode: if true then the stream will contain text; otherwise it will contain bytes.
decoding_errors: see the "errors" parameter of https://docs.python.org/3/library/io.html#io.TextIOWrapper
track: if true then the S3 object will be added to trase.tools.aws.metadata.S3_OBJECTS_ACCESSED_IN_CURRENT_SESSION

def upload_pandas_df_to_s3(df, new_key, sep=';', encoding='utf8', float_format='%.2f', quotechar='"', bucket_name='trase-storage')

Upload a csv dataset to s3 from a pandas DataFrame

:param df: pandas DataFrame object :param new_key: s3 path :param sep: Separator :param bucket_name: s3 bucket name :param encoding: encoding str :param float_format: format of float columns :param quotechar: quoting character

:return: AWS ServiceResource object

def upload_s3_csv_buffer(csv_buffer, key_name, s3_client=<botocore.client.S3 object>, bucket_name='trase-storage')

Uploads local file to s3

Classes

class VersionId (version_id, modified, is_delete)

VersionIdResponse(version_id, modified, is_delete)

Ancestors

builtins.tuple

Instance variables

var is_delete: Alias for field number 2
var modified: Alias for field number 1
var version_id: Alias for field number 0