Module trase.tools.sei_pcs.pandas_utilities

Useful functions that utilise Pandas with the goal of (a) reducing code repetition and (b) minimising common errors.

Functions

def append(df: pandas.core.frame.DataFrame, items) ‑> pandas.core.frame.DataFrame
def assert_frame_like(a, b, **kwargs)

Assert two dataframes are equal, ignoring row and column order and index

def cartesian_product(df_a, df_b, suffixes=('_x', '_y')) ‑> pandas.core.frame.DataFrame

Returns the Cartesian product of two dataframes.

For example, given the following two dataframes

| foo |    | bar |
|-----|    |-----|
| a   |    | 1   |
| b   |    | 2   |

The Cartesian product would be:

| foo | bar |
|-----|-----|
| a   | 1   |
| a   | 2   |
| b   | 1   |
| b   | 2   |

The returned dataframe will always have len(df1) * len(df2) rows. In the example above both dataframes have only one column, but this is not a requirement.

Args

suffixes
suffixes which will be applied to columns which appear in both dataframes

Note: this function is superceded in Pandas 1.2+ as pd.merge(how="cross")

def compare_dataframes(df_a, df_b, comparators: Dict[str, List], group_by: Optional[Iterable[str]] = None, suffixes=('_a', '_b')) ‑> pandas.core.frame.DataFrame

Compares two dataframes. Before comparison, the dataframes are aggregated via summation. Missing rows in either dataframe are filled with zero before comparison.

Args

comparators
A dictionary of {value_column: compare_methods}. value_column: str, The column which will be compared numerically. It will be summed over the group_by columns. The column must exist in both dataframes compare_methods: a list of Compare functions, which take two series and returns their comparison. You can also pass in the string name of a method on Compare. See Compare for some examples.
group_by
A list of columns to group the dataframes by during aggregation. The columns must exist in both dataframes.
suffixes
how to suffix the two value columns, defaults to "_a" and "_b".
def compare_dataframes_single(df_a, df_b, value: str, group_by: Optional[Iterable[str]] = None, comparator=<function Compare.signed_symmetric_relative_error>, column='comparison', suffixes=('_a', '_b'), sort_absolute=True, sort='ascending', factor=1) ‑> pandas.core.frame.DataFrame

A wrapper around compare_dataframes() which:

  1. Takes just one value column
  2. Renames the comparison column
  3. Sorts dataframe for you

The intention is that the return value is slightly easier to use since columns only have one level, and it performs sorting for you (which is usually what you want).

Example

>>> compare_dataframes_single(df_a, df_b, "volume", ["exporter"])
          total_a  total_b  comparison
exporter
   BUNGE      100      200         1.0
 CARGILL      500        0        -inf

Args

value
the column which will be compared numerically
group_by
columns over which to group
comparator
a function which takes two series and returns their comparison. You can also pass in the string name of a method on Compare. See Compare for some examples. Defaults to Compare.signed_symmetric_relative_error()
column
the name of the comparison column. Defaults to "comparison"
sort_absolute
whether to use absolute values as the sorting key. Defaults to True
sort : optional
sorting order, either "ascending" (default) or "descending"
factor : optional
multiply the comparison by a factor. The multiplication is done before any sorting occurs.

Returns: a pd.DataFrame with the groups as the index. and three columns named (unless overridden) "total_a", "total_b", and "comparison"

def concat(dfs, *args, sort=False, ignore_index=True, **kwargs)

Some useful defaults for concatenating two dataframes together.

In particular:

  • Do not sort the dataframes.
  • Ignore indexes: we assume the indexes carry not useful information.
  • Validate that all dataframes have the same columns. Because NaNs are really annoying in Pandas.
def full_merge(*args, how='outer', indicator=False, **kwargs)

Performs a Pandas merge but adds an additional check that no keys are missing.

Suppose that we have two dataframes, df_production and df_demand

df_production:                    df_demand:

  name  | production               name  | demand
--------+-----------             --------+-------
 BALDIM |         16              BALDIM |     16
 CAXIAS |         73              CAXIAS |     34
 ...    |        ...              ...    |    ...
 PALMAS |        109              PALMAS |     46
 URUOCA |        121              URUOCA |     47

Suppose that we wish to merge production and demand together. Our model expects that merge keys present in the left are present in the right and vice-versa. However, how do we assert this in code? If we don't assert it, we might accidentally lose rows.

This function is a wrapper around pd.merge that will raise an error if any keys are missing on the left or the right.

The behaviour depends on the value of the how parameter:

  • how="outer": error if any keys in the left are not present in the right and vice-versa
  • how="left": error if any keys in the left are not present in the right
  • how="right": error if any keys in the right are not present in the left
  • how="inner": behaviour will match that of pd.merge.

All arguments are passed to pd.merge.

Example:

# let's make two dataframes with keys that don't match
df_a = pd.DataFrame({"name": ["POMBAL"]})
df_b = pd.DataFrame({"name": ["PALMAS"]})

# pandas merge (defaulting to how=inner) will return an empty dataframe
df = pd.merge(df_a, df_b)
df.is_empty  # True

# full_merge (defaulting to how=outer) will raise an error
full_merge(df_a, df_b)  # ValueError: Not all merge keys match:
                        #    name      _merge
                        #  POMBAL   left_only
                        #  PALMAS  right_only
def group_by_and_zip(df_a, df_b, on)
def grouped_cartesian_product(df_a, df_b, common: List[str], suffixes=('_x', '_y')) ‑> pandas.core.frame.DataFrame

Like cartesian_product(), but merging on some common columns

def grouped_proportion(df, numerical_column: str, categorical_columns: List[str]) ‑> pandas.core.series.Series

Calculates the proportion of some numerical value relative to a group.

For example, given the following dataframe df:

| company | province | vol
| BIFFI   | ASCOPE   | 100
| ARVUT   | ASCOPE   | 300
| ARVUT   | PUNO     | 500

Calling grouped_proportion(df, "vol", ["province"]) would return:

|  vol |
| 0.25 |
| 0.75 |
| 1.00 |

because 0.25 = 100 / (100 + 300), 0.75 = 300 / (100 + 300), and 1.00 = 500 / 500. The function returns a pd.Series with the same index as the dataframe so that it can be used to add a column:

df["proportion"] = grouped_proportion(df, "vol", ["province"])

Args

numerical_column
string name of the numerical column, for eaxmple "vol"
categorical_columns
list of one or more categorical columns that will be considered to define the "groups"
def pandas_display_context(rows=None, columns=None, float_format=None, precision: int = None, significant_digits=None)

Temporarily alters the display settings that Pandas uses to pretty-print dataframes.

Example

with pd.pandas_display_context(rows=100):
   print(df)  # up to 100 rows will be printed

print(df)  # setting has now been reset to the previous value

Args

rows : int, optional
the maximum number of rows to print before truncating them
columns : int, optional
the maximum number of columns to print before truncating them
float_format : callable, optional
a function that takes a floating-point number and returns a string representation
precision : int, optional
floating-point output precision in terms of number of places after the decimal
significant_digits : int, optional
number of significant figures to print a floating-point number to. This setting will be ignored if float_format is supplied
def pivot(df, index: List, column: str, value: str, agg='sum')

A version of pandas.DataFrame.pivot which allows a list of index names, a feature only added in Pandas > 1.1.0. Once we upgrade Pandas we can delete this function

def prefix_columns(df, prefix='', suffix='')
def pretty_summarise(obj: Union[pandas.core.frame.DataFrame, pandas.core.series.Series, Iterable], index=False, max_rows=5, length=True, prefix='') ‑> str

Returns the the top five values of an iterable, along with the number of items. This is useful for printing a potentially long list of items.

Example

>>> text = pretty_summarise([
...     "Berlin", "Frankfurt", "Heidelberg", "Munich", "Hamburg", "Bremen"
... ])
>>> print(text)
    Berlin
 Frankfurt
   ...
   Hamburg
    Bremen
Length: 6

Args

obj
a Pandas dataframe, Pandas series, or iterable of any kind
index : bool, optional
whether to print the index of the Pandas object (default: False)
max_rows : int, optional
number rows to print before collapsing with "…" (default: 5)
length : bool, optional
whether to print the total number of rows in the object (default: True)

Returns: a string representation of the pretty-printed object

def rename(df, columns)
def select_minimum(df, by, on, include_id=True)
def select_subset(df, subset)
def set_pandas_display(rows=None, columns=None, float_format=None, precision: int = None, significant_digits=None)

Alters the display settings that Pandas uses to pretty-print dataframes.

When this function is called the setting will be altered globally for the current Python session. To alter a setting temporarily use pandas_display_context() instead.

Example

pd.set_pandas_display(rows=100)
print(df)  # up to 100 rows will be printed

Args

rows : int, optional
the maximum number of rows to print before truncating them
columns : int, optional
the maximum number of columns to print before truncating them
float_format : callable, optional
a function that takes a floating-point number and returns a string representation
precision : int, optional
floating-point output precision in terms of number of places after the decimal
significant_digits : int, optional
number of significant figures to print a floating-point number to. This setting will be ignored if float_format is supplied
def split_dataframe_using_proportions(df: pandas.core.frame.DataFrame, df_proportions: pandas.core.frame.DataFrame, values: List[str], on: List[str], by: str, where=None, validate=None) ‑> pandas.core.frame.DataFrame
def stitch_dataframes(df_left, df_right, volume_column: str, values_left: List = None, values_right: List = None, fillna='UNKNOWN', indicator=False)

Merge dataframes on common columns, splitting the values in values_left and value_right according to proportions in volume_column.

df_left = pd.DataFrame(
    [
        ("K1", "M1", 10, 55, "good"),
        ("K2", "M1", 5, 32, "great"),
        ("K1", "M2", 2.5, 11, "awesome"),
        ("K2", "M2", 3, 19.2, "awful"),
    ],
    columns=["kab", "mill", "vol", "deforestation", "quality"],
)

df_right = pd.DataFrame(
    [
        ("M1", "T1", "C1", 8, 853, 3),
        ("M1", "T2", "C2", 8, 732, 5),
        ("M2", "T1", "C1", 1, 97, 3),
        ("M2", "T2", "C3", 4, 333, 5),
    ],
    columns=["mill", "exporter", "country", "vol", "fob", "forest_500"],
)

df = stitch_dataframes(
    df_left,
    df_right,
    volume_column="vol",
    values_left=["deforestation"],
    values_right=["fob"],
)
def to_nested_dict(df, columns, leaf_key, one=False) ‑> dict
def warnings_if_pd_merge_called_without_validate()

Classes

class Compare

Static methods

def absolute_error(a, b)

The function |b| - |a|

def absolute_symmetric_relative_error(a, b)
def cosine_similarity(a, b)
def l1_distance(a, b)
def l2_distance(a, b)
def relative_error(a, b)

Implements the function:

b - a
-----
 |a|

Note that this function will be undefined for zero and near-zero values of a

def signed_symmetric_relative_error(a, b)

Implements the below function, where division by zero is +/- infinity:

  |b| - |a|
-------------
min(|a|, |b|)

The motivation for this comparator is to meaningfully handle zero values in either a or b: positive infinity means that a is zero but there is a value in b, and negative infinity means that a has a value but b is zero.

Example

signed_symmetric_relative_error( 10, 10) # 0 signed_symmetric_relative_error(250, 100) # -1.5 signed_symmetric_relative_error(100, 250) # 1.5 signed_symmetric_relative_error(100, 0) # -inf signed_symmetric_relative_error(0, 100) # inf

def signed_total_relative_error(a, b)

Implements the below function:

|b| - |a|
---------
sum(|a|)

The original use case for this comparator was to get the difference in volumes exported to e.g. China relative to the entire export volume for the year.