Module trase.tools.aws.aws_helpers
Functions for s3 access and parsing
Functions
def get_pandas_df(key, bucket='trase-storage', version_id=None, client=<botocore.client.S3 object>, track=True, sep=';', encoding='utf8', xlsx=False, print_version_id=False, **kwargs) ‑> pandas.core.frame.DataFrame-
Read a CSV or XLSX dataset from S3 to a Pandas DataFrame. See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html
def get_s3_json(key, s3_resource=s3.ServiceResource(), bucket_name='trase-storage', track=True)def get_s3_object_body(key, s3_client=<botocore.client.S3 object>, bucket_name='trase-storage', track=True)def is_good_s3_key(key: str) ‑> bool-
AWS S3 allows the key to be constructed from any UTF-8 character. However, in practice some characters like space or "!" cause problems.
To make problems less likely we should stick to only a limited set of characters:
0-9 a-z A-Z _ . ( )Plus, of course, the directory separator. This is in line with AWS's recommendation (https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMetadata.html).
def list_version_ids(key, bucket='trase-storage', client=<botocore.client.S3 object>, ascending=True) ‑> List[VersionIdResponse]-
List Version IDs for a given S3 object.
These can be a mix of versions and delete markers. By default the versions are returned in ascending order, with the latest version ID positioned as the last element in the list.
Args
key- a key to a S3 object
bucket- S3 bucket
ascending- sort list in ascending last modified (default) or descending
Returns: a list of tuples (version_id, modified, is_delete) containing the string version ID, a
datetimeobject representing when the version ID was created, and a boolean representing whether the version ID is a delete marker. def make_good_s3_key(string) ‑> strdef parse_aws_object2(key: str, encoding='utf-8', separator=';', quoting=None, bucket='trase-storage', Range='bytes=-', track=True)def read_csv(key, bucket='trase-storage', version_id=None, client=<botocore.client.S3 object>, track=True, print_version_id=False, **kwargs) ‑> pandas.core.frame.DataFrame-
Read an S3 object containing CSV data to a DataFrame
Args
kwargs- passed through to
pd.read_csv
def read_geojson(key, bucket='trase-storage', version_id=None, client=<botocore.client.S3 object>, track=True, print_version_id=False, **kwargs) ‑> pandas.core.frame.DataFrame-
Import GeoPandas and Read an S3 object containing Geometry data to a DataFrame
Args
kwargs- passed through to
geopandas.read_file
def read_json(key, bucket='trase-storage', version_id=None, client=<botocore.client.S3 object>, track=True, print_version_id=False, **kwargs)-
Args
kwargs- passed through to
json.load
def read_s3_csv(key, s3_client=<botocore.client.S3 object>, bucket_name='trase-storage', track=True)def read_s3_folder(folder, prefix='', s3_resource=s3.ServiceResource(), bucket_name='trase-storage')def read_s3_object(key, s3_resource=s3.ServiceResource(), bucket_name='trase-storage', track=True)def read_s3_parquet(key, bucket='trase-storage', version_id=None, client=<botocore.client.S3 object>, track=True, print_version_id=False, **kwargs) ‑> pandas.core.frame.DataFrame-
Read a Parquet dataset from S3 to a Pandas DataFrame. See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_parquet.html
Arguments
kwargs: passed to pd.read_parquet
def read_xlsx(key, bucket='trase-storage', version_id=None, client=<botocore.client.S3 object>, track=True, print_version_id=False, **kwargs) ‑> pandas.core.frame.DataFrame-
Read an S3 object containing XLSX data to a DataFrame
Args
kwargs- passed through to
pd.read_excel
def read_yaml(key, bucket='trase-storage', version_id=None, client=<botocore.client.S3 object>, track=True, print_version_id=False, **kwargs)-
Args
kwargs- passed through to
yaml.safe_load
def stream_object(key, bucket='trase-storage', version_id=None, client=<botocore.client.S3 object>, track=True, print_version_id=False, decode=True, encoding='utf8', decoding_errors='strict', **kwargs) ‑> ContextManager[IO]-
Returns a stream containing the body of an S3 object. This function is a context manager to ensure that the HTTP stream gets closed:
with stream_object("my-object", "my_bucket") as file: print("The contents are", file.read())Args
decode- if true then the stream will contain text; otherwise it will contain bytes.
decoding_errors- see the "errors" parameter of https://docs.python.org/3/library/io.html#io.TextIOWrapper
track- if true then the S3 object will be added to
trase.tools.aws.metadata.S3_OBJECTS_ACCESSED_IN_CURRENT_SESSION
def upload_pandas_df_to_s3(df, new_key, sep=';', encoding='utf8', float_format='%.2f', quotechar='"', bucket_name='trase-storage')-
Upload a csv dataset to s3 from a pandas DataFrame
:param df: pandas DataFrame object :param new_key: s3 path :param sep: Separator :param bucket_name: s3 bucket name :param encoding: encoding str :param float_format: format of float columns :param quotechar: quoting character
:return: AWS ServiceResource object
def upload_s3_csv_buffer(csv_buffer, key_name, s3_client=<botocore.client.S3 object>, bucket_name='trase-storage')-
Uploads local file to s3
Classes
class VersionId (version_id, modified, is_delete)-
VersionIdResponse(version_id, modified, is_delete)
Ancestors
- builtins.tuple
Instance variables
var is_delete-
Alias for field number 2
var modified-
Alias for field number 1
var version_id-
Alias for field number 0