Module trase.tools.sei_pcs.dataframe_container

How we store and update Pandas dataframes once they have been loaded from disk into memory.

Functions

def construct_export_dataframe(dataset: str, container: DataFrameContainer, export_columns: List[Export]) ‑> pandas.core.frame.DataFrame
def flow_report_by_attribute(dataframes: DataFrameContainer, name, *args, **kwargs)

Classes

class DataFrameContainer (dataframes: Dict[str, pandas.core.frame.DataFrame], links: Dict[str, Dict[str, Link]], defaults: dict = <factory>, validation: dict = <factory>)

This class has a few functions. On the face of it, it's just a dictionary-like container which allows you to access Pandas dataframes by a name.

More than that, it also executes "left-joins" recursively according to a list of "links". This takes some of the effort out of constantly doing that yourself (see the trase.tools.sei_pcs.recursive_join module).

However, the biggest value it brings is that it adds some safety checks when you want to modify one of the dataframes in the containers. This ensures that the dataframe adheres to its original columns: none can be added, deleted, or change dtype. See DataFrameContainer.update() for more on how this works.

Class variables

var dataframes : Dict[str, pandas.core.frame.DataFrame]
var defaults : dict
var validation : dict

Methods

def get(self, name, copy=True, ids=True)

Get a dataframe in the container by name. You should assume that the returned object is a copy, not a pointer: altering the object will only alter the copy, not the original stored in this class.

All joins specified in the links recipe when this class was created will be executed (see the trase.tools.sei_pcs.recursive_join module).

Args

copy : bool
if you know that you are not going to alter the return object and you are looking for extra performance, pass copy=False and you may be able to avoid an in-memory copy operation. If you do this you should never alter the object you receive, or else you may corrupt this class.
ids : bool
adds a column called "_id", which is required if you ever want to call Dataframe.update with this dataframe.
def replace(self, name, df: pandas.core.frame.DataFrame, conserved_columns=None, missing_value_columns='warn', extra_columns='raise')
def update(self, name, df: pandas.core.frame.DataFrame, columns, conserved_columns=None)