Module trase.tools.sei_pcs.pandas_utilities
Useful functions that utilise Pandas with the goal of (a) reducing code repetition and (b) minimising common errors.
Functions
def append(df: pandas.core.frame.DataFrame, items) ‑> pandas.core.frame.DataFramedef assert_frame_like(a, b, **kwargs)-
Assert two dataframes are equal, ignoring row and column order and index
def cartesian_product(df_a, df_b, suffixes=('_x', '_y')) ‑> pandas.core.frame.DataFrame-
Returns the Cartesian product of two dataframes.
For example, given the following two dataframes
| foo | | bar | |-----| |-----| | a | | 1 | | b | | 2 |The Cartesian product would be:
| foo | bar | |-----|-----| | a | 1 | | a | 2 | | b | 1 | | b | 2 |The returned dataframe will always have len(df1) * len(df2) rows. In the example above both dataframes have only one column, but this is not a requirement.
Args
suffixes- suffixes which will be applied to columns which appear in both dataframes
Note: this function is superceded in Pandas 1.2+ as pd.merge(how="cross")
def compare_dataframes(df_a, df_b, comparators: Dict[str, List], group_by: Optional[Iterable[str]] = None, suffixes=('_a', '_b')) ‑> pandas.core.frame.DataFrame-
Compares two dataframes. Before comparison, the dataframes are aggregated via summation. Missing rows in either dataframe are filled with zero before comparison.
Args
comparators- A dictionary of {value_column: compare_methods}.
value_column: str, The column which will be compared numerically. It will
be summed over the group_by columns. The column must exist in both
dataframes
compare_methods: a list of Compare functions, which take two series and
returns their comparison. You can also pass in the string name of a
method on
Compare. SeeComparefor some examples. group_by- A list of columns to group the dataframes by during aggregation. The columns must exist in both dataframes.
suffixes- how to suffix the two value columns, defaults to "_a" and "_b".
def compare_dataframes_single(df_a, df_b, value: str, group_by: Optional[Iterable[str]] = None, comparator=<function Compare.signed_symmetric_relative_error>, column='comparison', suffixes=('_a', '_b'), sort_absolute=True, sort='ascending', factor=1) ‑> pandas.core.frame.DataFrame-
A wrapper around
compare_dataframes()which:- Takes just one value column
- Renames the comparison column
- Sorts dataframe for you
The intention is that the return value is slightly easier to use since columns only have one level, and it performs sorting for you (which is usually what you want).
Example
>>> compare_dataframes_single(df_a, df_b, "volume", ["exporter"]) total_a total_b comparison exporter BUNGE 100 200 1.0 CARGILL 500 0 -infArgs
value- the column which will be compared numerically
group_by- columns over which to group
comparator- a function which takes two series and returns their comparison.
You can also pass in the string name of a method on
Compare. SeeComparefor some examples. Defaults toCompare.signed_symmetric_relative_error() column- the name of the comparison column. Defaults to "comparison"
sort_absolute- whether to use absolute values as the sorting key. Defaults to True
sort:optional- sorting order, either "ascending" (default) or "descending"
factor:optional- multiply the comparison by a factor. The multiplication is done before any sorting occurs.
Returns: a
pd.DataFramewith the groups as the index. and three columns named (unless overridden) "total_a", "total_b", and "comparison" def concat(dfs, *args, sort=False, ignore_index=True, **kwargs)-
Some useful defaults for concatenating two dataframes together.
In particular:
- Do not sort the dataframes.
- Ignore indexes: we assume the indexes carry not useful information.
- Validate that all dataframes have the same columns. Because NaNs are really annoying in Pandas.
def full_merge(*args, how='outer', indicator=False, **kwargs)-
Performs a Pandas merge but adds an additional check that no keys are missing.
Suppose that we have two dataframes, df_production and df_demand
df_production: df_demand: name | production name | demand --------+----------- --------+------- BALDIM | 16 BALDIM | 16 CAXIAS | 73 CAXIAS | 34 ... | ... ... | ... PALMAS | 109 PALMAS | 46 URUOCA | 121 URUOCA | 47Suppose that we wish to merge production and demand together. Our model expects that merge keys present in the left are present in the right and vice-versa. However, how do we assert this in code? If we don't assert it, we might accidentally lose rows.
This function is a wrapper around
pd.mergethat will raise an error if any keys are missing on the left or the right.The behaviour depends on the value of the
howparameter:how="outer": error if any keys in the left are not present in the right and vice-versahow="left": error if any keys in the left are not present in the righthow="right": error if any keys in the right are not present in the lefthow="inner": behaviour will match that ofpd.merge.
All arguments are passed to
pd.merge.Example:
# let's make two dataframes with keys that don't match df_a = pd.DataFrame({"name": ["POMBAL"]}) df_b = pd.DataFrame({"name": ["PALMAS"]}) # pandas merge (defaulting to how=inner) will return an empty dataframe df = pd.merge(df_a, df_b) df.is_empty # True # full_merge (defaulting to how=outer) will raise an error full_merge(df_a, df_b) # ValueError: Not all merge keys match: # name _merge # POMBAL left_only # PALMAS right_only def group_by_and_zip(df_a, df_b, on)def grouped_cartesian_product(df_a, df_b, common: List[str], suffixes=('_x', '_y')) ‑> pandas.core.frame.DataFrame-
Like
cartesian_product(), but merging on some common columns def grouped_proportion(df, numerical_column: str, categorical_columns: List[str]) ‑> pandas.core.series.Series-
Calculates the proportion of some numerical value relative to a group.
For example, given the following dataframe
df:| company | province | vol | BIFFI | ASCOPE | 100 | ARVUT | ASCOPE | 300 | ARVUT | PUNO | 500Calling
grouped_proportion(df, "vol", ["province"])would return:| vol | | 0.25 | | 0.75 | | 1.00 |because 0.25 = 100 / (100 + 300), 0.75 = 300 / (100 + 300), and 1.00 = 500 / 500. The function returns a
pd.Serieswith the same index as the dataframe so that it can be used to add a column:df["proportion"] = grouped_proportion(df, "vol", ["province"])Args
numerical_column- string name of the numerical column, for eaxmple "vol"
categorical_columns- list of one or more categorical columns that will be considered to define the "groups"
def pandas_display_context(rows=None, columns=None, float_format=None, precision: int = None, significant_digits=None)-
Temporarily alters the display settings that Pandas uses to pretty-print dataframes.
Example
with pd.pandas_display_context(rows=100): print(df) # up to 100 rows will be printed print(df) # setting has now been reset to the previous valueArgs
rows:int, optional- the maximum number of rows to print before truncating them
columns:int, optional- the maximum number of columns to print before truncating them
float_format:callable, optional- a function that takes a floating-point number and returns a string representation
precision:int, optional- floating-point output precision in terms of number of places after the decimal
significant_digits:int, optional- number of significant figures to print a floating-point number to. This setting will be ignored if float_format is supplied
def pivot(df, index: List, column: str, value: str, agg='sum')-
A version of
pandas.DataFrame.pivotwhich allows a list of index names, a feature only added in Pandas > 1.1.0. Once we upgrade Pandas we can delete this function def prefix_columns(df, prefix='', suffix='')def pretty_summarise(obj: Union[pandas.core.frame.DataFrame, pandas.core.series.Series, Iterable], index=False, max_rows=5, length=True, prefix='') ‑> str-
Returns the the top five values of an iterable, along with the number of items. This is useful for printing a potentially long list of items.
Example
>>> text = pretty_summarise([ ... "Berlin", "Frankfurt", "Heidelberg", "Munich", "Hamburg", "Bremen" ... ]) >>> print(text) Berlin Frankfurt ... Hamburg Bremen Length: 6Args
obj- a Pandas dataframe, Pandas series, or iterable of any kind
index:bool, optional- whether to print the index of the Pandas object (default: False)
max_rows:int, optional- number rows to print before collapsing with "…" (default: 5)
length:bool, optional- whether to print the total number of rows in the object (default: True)
Returns: a string representation of the pretty-printed object
def rename(df, columns)def select_minimum(df, by, on, include_id=True)def select_subset(df, subset)def set_pandas_display(rows=None, columns=None, float_format=None, precision: int = None, significant_digits=None)-
Alters the display settings that Pandas uses to pretty-print dataframes.
When this function is called the setting will be altered globally for the current Python session. To alter a setting temporarily use
pandas_display_context()instead.Example
pd.set_pandas_display(rows=100) print(df) # up to 100 rows will be printedArgs
rows:int, optional- the maximum number of rows to print before truncating them
columns:int, optional- the maximum number of columns to print before truncating them
float_format:callable, optional- a function that takes a floating-point number and returns a string representation
precision:int, optional- floating-point output precision in terms of number of places after the decimal
significant_digits:int, optional- number of significant figures to print a floating-point number to. This setting will be ignored if float_format is supplied
def split_dataframe_using_proportions(df: pandas.core.frame.DataFrame, df_proportions: pandas.core.frame.DataFrame, values: List[str], on: List[str], by: str, where=None, validate=None) ‑> pandas.core.frame.DataFramedef stitch_dataframes(df_left, df_right, volume_column: str, values_left: List = None, values_right: List = None, fillna='UNKNOWN', indicator=False)-
Merge dataframes on common columns, splitting the values in
values_leftandvalue_rightaccording to proportions involume_column.df_left = pd.DataFrame( [ ("K1", "M1", 10, 55, "good"), ("K2", "M1", 5, 32, "great"), ("K1", "M2", 2.5, 11, "awesome"), ("K2", "M2", 3, 19.2, "awful"), ], columns=["kab", "mill", "vol", "deforestation", "quality"], ) df_right = pd.DataFrame( [ ("M1", "T1", "C1", 8, 853, 3), ("M1", "T2", "C2", 8, 732, 5), ("M2", "T1", "C1", 1, 97, 3), ("M2", "T2", "C3", 4, 333, 5), ], columns=["mill", "exporter", "country", "vol", "fob", "forest_500"], ) df = stitch_dataframes( df_left, df_right, volume_column="vol", values_left=["deforestation"], values_right=["fob"], ) def to_nested_dict(df, columns, leaf_key, one=False) ‑> dictdef warnings_if_pd_merge_called_without_validate()
Classes
class Compare-
Static methods
def absolute_error(a, b)-
The function
|b| - |a| def absolute_symmetric_relative_error(a, b)def cosine_similarity(a, b)def l1_distance(a, b)def l2_distance(a, b)def relative_error(a, b)-
Implements the function:
b - a ----- |a|Note that this function will be undefined for zero and near-zero values of a
def signed_symmetric_relative_error(a, b)-
Implements the below function, where division by zero is +/- infinity:
|b| - |a| ------------- min(|a|, |b|)The motivation for this comparator is to meaningfully handle zero values in either a or b: positive infinity means that a is zero but there is a value in b, and negative infinity means that a has a value but b is zero.
Example
signed_symmetric_relative_error( 10, 10) # 0 signed_symmetric_relative_error(250, 100) # -1.5 signed_symmetric_relative_error(100, 250) # 1.5 signed_symmetric_relative_error(100, 0) # -inf signed_symmetric_relative_error(0, 100) # inf
def signed_total_relative_error(a, b)-
Implements the below function:
|b| - |a| --------- sum(|a|)The original use case for this comparator was to get the difference in volumes exported to e.g. China relative to the entire export volume for the year.