Module trase.tools.sei_pcs.recursive_join
In order to reduce the amount of joins that a user has to do in their model code we allow them to specify a recipe of joins that can be handled for them.
A typical example is that they might have a customs dataset with a column containing geocodes of municipalities which they always want to be left-joined on to a dataset of municipalities.
This module implements the above wishes into actual Panda merge operations with user-friendly errors for when things go wrong. It also supports more deeply-nested joining operations.
Functions
def dot(a, b)def prejoin(dataframes: Dict[str, pandas.core.frame.DataFrame], links: Dict[str, Dict[str, Link]], excluding) ‑> Tuple[Dict[str, pandas.core.frame.DataFrame], Dict[str, Dict[str, Link]]]def recursive_join(dataset: str, dataframes: Dict[str, pandas.core.frame.DataFrame], links: Dict[str, Dict[str, Link]], copy=True, limit=100) ‑> pandas.core.frame.DataFrame-
Perform a "recursive join" for the given dataset.
For example, given the dataframes:
dataframes = { "state": pd.DataFrame([ {"code": 12, "name": "ACRE"}, # ... ]), "municipality": pd.DataFrame([ {"state": 12, "code": 1200302, "name": "FEIJO"}, # ... ]), }We might specify the following links:
links = { "state": {}, "municipality": Link("state", "code"), }That is to say: the "state" column of the "municipality" dataframe should be left-joined on to the "code" column of the "state" dataframe.
We would perform such a join as follows:
recursive_join("municipality", dataframes, links)That would return the following dataframe
| state.code | state.name | code | name | |------------|------------|---------|-------| | 12 | ACRE | 1200302 | FEIJO |The above example is simple, but this method also implements deeply-nested links with many dataframes.
Args
copy- if false and there are no joins to be made, a reference to the original dataframe in the dictionary will be returned. This can be a bit faster if you know you are not going to modify the return value.
limit- the limit of recursive entries to this call before RecursionError is raised
Raises
LinkError- if any left joins have missing or duplicated keys on the right.
ValueError- if any of the targets of the links are themselves linked, for example dataset1.column1 left joins on dataset2.column2, but dataset2.column2 also left joins on some dataset3.column3.
Classes
class Link (dataset: str, column: str, validate_only: bool = False)-
Link(dataset: str, column: str, validate_only: bool = False)
Class variables
var column : strvar dataset : strvar validate_only : bool