Module trase.tools.sei_pcs.recursive_join

In order to reduce the amount of joins that a user has to do in their model code we allow them to specify a recipe of joins that can be handled for them.

A typical example is that they might have a customs dataset with a column containing geocodes of municipalities which they always want to be left-joined on to a dataset of municipalities.

This module implements the above wishes into actual Panda merge operations with user-friendly errors for when things go wrong. It also supports more deeply-nested joining operations.

Functions

def dot(a, b)
def prejoin(dataframes: Dict[str, pandas.core.frame.DataFrame], links: Dict[str, Dict[str, Link]], excluding) ‑> Tuple[Dict[str, pandas.core.frame.DataFrame], Dict[str, Dict[str, Link]]]
def recursive_join(dataset: str, dataframes: Dict[str, pandas.core.frame.DataFrame], links: Dict[str, Dict[str, Link]], copy=True, limit=100) ‑> pandas.core.frame.DataFrame

Perform a "recursive join" for the given dataset.

For example, given the dataframes:

dataframes = {
    "state": pd.DataFrame([
        {"code": 12, "name": "ACRE"},
        # ...
    ]),
    "municipality": pd.DataFrame([
        {"state": 12, "code": 1200302, "name": "FEIJO"},
        # ...
    ]),
}

We might specify the following links:

links = {
    "state": {},
    "municipality": Link("state", "code"),
}

That is to say: the "state" column of the "municipality" dataframe should be left-joined on to the "code" column of the "state" dataframe.

We would perform such a join as follows:

recursive_join("municipality", dataframes, links)

That would return the following dataframe

| state.code | state.name |    code |  name |
|------------|------------|---------|-------|
| 12         | ACRE       | 1200302 | FEIJO |

The above example is simple, but this method also implements deeply-nested links with many dataframes.

Args

copy
if false and there are no joins to be made, a reference to the original dataframe in the dictionary will be returned. This can be a bit faster if you know you are not going to modify the return value.
limit
the limit of recursive entries to this call before RecursionError is raised

Raises

LinkError
if any left joins have missing or duplicated keys on the right.
ValueError
if any of the targets of the links are themselves linked, for example dataset1.column1 left joins on dataset2.column2, but dataset2.column2 also left joins on some dataset3.column3.

Classes

Link(dataset: str, column: str, validate_only: bool = False)

Class variables

var column : str
var dataset : str
var validate_only : bool