Module trase.tools.pandasdb.find

Clean a Pandas dataframe containing node labels.

Functions

def find_biomes_by_trase_id(df, returning: Iterable[str], node_attributes_ref_id: int, trase_id=Identifier('trase_id'), on_extra_columns='error', cur=None, cnx=None)
def find_default_name_by_node_id(df, returning: Iterable[str], node_id: psycopg2.sql.Composable = Identifier('node_id'), on_extra_columns='error', cur=None, cnx=None)
def find_economic_blocs_by_trase_id(df, returning: Iterable[str], trase_id=Identifier('trase_id'), on_extra_columns='error', cur=None, cnx=None)
def find_nodes_by_code(df, returning: Iterable[str], code_id=Identifier('code_id'), value=Identifier('value'), year: psycopg2.sql.Composable = Literal(None), on_extra_columns='error', cur=None, cnx=None) ‑> List[Tuple]
def find_nodes_by_name(df, returning: Iterable[str], name=Identifier('name'), level: psycopg2.sql.Composable = Literal(None), parent_id: psycopg2.sql.Composable = Literal(None), sub_type_id: psycopg2.sql.Composable = Literal(None), type_id: psycopg2.sql.Composable = Literal(None), year: psycopg2.sql.Composable = Literal(None), on_extra_columns='error', cur=None, cnx=None) ‑> List[Tuple]

Performs a lookup into the main.nodes and main.node_names table given some search criteria.

The following search terms are required:

  • name (str)

The following search terms are optional:

  • parent_id (int), which should be an id of the main.nodes table
  • sub_type_id (int), which should be an id of the main.node_sub_types table
  • type_id (int)
  • level (int)
  • year (int)

The function requires a Pandas dataframe. Each row of the dataframe represents one "lookup" in the main schema table.

The search terms themselves can come from one of two sources:

  1. A column in the dataframe, indicated using sql.Identifier
  2. A value that is constant for all rows of the dataframe, indicated using sql.Literal

For example, the following call would get the name search terms using the "name" column in the dataframe, and would use the level=4 for every search:

from psycopg2 import sql

find_nodes_by_name(
    df,
    name=sql.Identifier("name"),
    level=sql.Literal(4),
    ...

The function additionally requires you to specify which information should be returned from the search via the "returning" argument. Supported values are:

  • default_name
  • lookup_value
  • parent_id
  • type_id
  • sub_type_id
  • level
  • year
  • node_id
  • trase_id
  • count

Example usage:

import pandas as pd
from psycopg2 import sql
from trase.tools.pandasdb.find import find_nodes_by_name

df = pd.DataFrame({"country_label": ["COOK ISLAS", "ZIMBAWE", "SAMOA AMERICANA"]})
returning = ["default_name", "node_id"]

result = find_nodes_by_name(
    df,
    returning,
    name=sql.Identifier("country_label"),
    level=sql.Literal(1),
)

print(result)
# [('COOK ISLANDS', 55), ('ZIMBABWE', 211), ('AMERICAN SAMOA', 7)]

df_result = pd.Dataframe(result, columns=returning)

Let's break down the call to find_nodes_by_name():

  1. First we pass the data containing the search terms we want to look up, in the form of a dataframe. This function is designed to take large dataframes:

    find_nodes_by_name(
        df,
    
  2. We say that the function should return the (default) name of the node that it identifies, along with the Node ID:

    returning=["default_name", "node_id"]
    
  3. Next, we say that the data used to match the node name should come from the "country_label" column in the dataframe. Here sql.Identifer means a column in the dataframe:

    name=sql.Identifier("country_label"),
    
  4. Then we say that the level should be 1, which is the level for a country. Here sql.Literal means a value that should be the same for all rows:

    level=sql.Literal(1),
    
  5. The result of the function is a list of tuples, which we can convert to a dataframe

    df_result = pd.Dataframe(result, columns=returning)
    

TODO:

  • Test that columns adhere to postgres standards (or get around this with a hack!)

Args

df
a non-empty Pandas DataFrame
returning
a list of strings
name
a column in the dataframe or literal value (defaults to the column "name")
on_extra_columns
behaviour if the dataframe contains columns which are not used by this function. One of "error", "warn" or "ignore" (defaults to "error").

Returns

A list of tuples containing the search results. The list will be the same length as the input dataframe, and will be in the same order (i.e. the fifth entry in the return value is the result of the fifth row in the input dataframe). If a search result returned zero or not matches then it will contain None values. The tuples will be in the same order as the order of the "returning" argument.

def find_nodes_by_trase_id(df, returning: Iterable[str], trase_id=Identifier('trase_id'), year: psycopg2.sql.Composable = Literal(None), on_extra_columns='error', cur=None, cnx=None) ‑> List[Tuple]
def find_parent_regions_by_trase_id(df, returning: Iterable[str], parent_region_level_name: psycopg2.sql.Composable, trase_id=Identifier('trase_id'), on_extra_columns='error', cur=None, cnx=None)
def find_trader_groups_by_name(df, returning: Iterable[str], group_name=Identifier('group_name'), year: psycopg2.sql.Composable = Literal(None), on_extra_columns='error', cur=None, cnx=None) ‑> List[Tuple]
def find_trader_groups_by_trader_id(df, returning: Iterable[str], trader_id=Identifier('trader_id'), year: psycopg2.sql.Composable = Literal(None), on_extra_columns='error', cur=None, cnx=None) ‑> List[Tuple]

Example

>>> df = pd.DataFrame({"group_name": ["H B EMPREENDIMENTOS"]})
>>> df_result = find_trader_groups_by_name(
...     df,
...     returning=["node_id"],
... )
>>> print(df_result)
[(1828696,)]
def find_traders_and_groups_by_label(df, returning: Iterable[str], trader_label=Identifier('trader_label'), country_id: psycopg2.sql.Composable = Literal(None), year: psycopg2.sql.Composable = Literal(None), on_extra_columns='error', cur=None, cnx=None)

TODO: unclear if the "count" return type is from the trader lookup or the group lookup. Make this clearer.

def find_traders_and_groups_by_trase_id(df, returning: Iterable[str], trase_id=Identifier('trase_id'), year: psycopg2.sql.Composable = Literal(None), on_extra_columns='error', cur=None, cnx=None)
def find_traders_by_label(df, returning: Iterable[str], trader_label=Identifier('trader_label'), country_id: psycopg2.sql.Composable = Literal(None), on_extra_columns='error', cur=None, cnx=None) ‑> List[Tuple]

Query the traders hierarchy in the database to clean the trader name and identify the trader group.

The following search terms are required:

  • trader_label (str)

The following search terms are optional:

  • country_id (int), which should be an id of the main.nodes table for a country node (type 'REGION', level 1) when specified

The function requires a Pandas dataframe. Each row of the dataframe represents one "lookup" in the main schema table.

The search terms themselves can come from one of two sources:

  1. A column in the dataframe, indicated using sql.Identifier
  2. A value that is constant for all rows of the dataframe, indicated using sql.Literal

When applying the function to the exporter, the "country_id" can be provided as a sql.Literal argument as it is the same for all rows:

from psycopg2 import sql

find_nodes_by_name(
    df,
    trader_label=sql.Identifier("exporter"),
    country_id=sql.Literal(27),
    ...

When applying the function to the importer, each row will have its own country of import, so the "country_id" should be supplied in the dataframe alongside with "trader_label" as sql.Identifier:

from psycopg2 import sql

find_nodes_by_name(
    df,
    trader_label=sql.Identifier("importer"),
    country_id=sql.Identifier("country_of_import_id"),
    ...

The function additionally requires you to specify which information should be returned from the search via the "returning" argument. Supported values are:

  • trader_id
  • trader_name
  • trase_id
  • lookup_value
  • count

Example usage:

import pandas as pd
from psycopg2 import sql
from trase.tools.pandasdb.find import find_traders_by_label

returning = ["trader_name", "trase_id", "count"]

df = pd.DataFrame({"trader_label": ["CARGILL"]})
result = find_traders_by_label(df, returning)
print(result)
# [
#     (
#         'CARGILL',
#         'BR-TRADER-53404893',
#         1,
#     ),
# ]

df = pd.DataFrame({"trader_label": ["BAYER SA"], "country_id": [12]})
result = find_traders_by_label(df, returning, country_id=sql.Identifier("country_id"))
print(result)
# [
#     (
#         'BAYER S.A.',
#         'AR-TRADER-3050381106',
#         1,
#     ),
# ]

df = pd.DataFrame({"trader_label": ["JOSE GARCIA"], "country_id": [27]})
result = find_traders_by_label(df, returning, country_id=sql.Identifier("country_id"))
print(result)
# [
#     (
#         None,
#         None,
#         8,
#     ),
# ]

Args

df
a non-empty Pandas DataFrame
returning
a list of strings
trader_label
a column in the dataframe (defaults to the column "trader_label")
country_id
a column in the dataframe or literal value (defaults to the column "country_id")
on_extra_columns
behaviour if the dataframe contains columns which are not used by this function. One of "error", "warn" or "ignore" (defaults to "error").

Returns

A list of tuples containing the search results. The list will be the same length as the input dataframe, and will be in the same order (i.e. the fifth entry in the return value is the result of the fifth row in the input dataframe). If a search result returned zero or not matches then it will contain None values. The tuples will be in the same order as the order of the "returning" argument.

def none_factory()

Classes

class LookupBase

Subclasses

Methods

def find(self, df, returning, on_extra_columns, cur, cnx)
def report(self) ‑> str
class LookupBiomesByTraseId (trase_id: psycopg2.sql.Composable, ref_id: psycopg2.sql.Literal)

LookupBiomesByTraseId(trase_id: psycopg2.sql.Composable, ref_id: psycopg2.sql.Literal)

Ancestors

Class variables

var SUPPORTS_RETURNING
var ref_id : psycopg2.sql.Literal
var trase_id : psycopg2.sql.Composable

Methods

def query(self, table_identifier)
class LookupByNodeCode (code_id: psycopg2.sql.Composable, value: psycopg2.sql.Composable, year: psycopg2.sql.Composable = <factory>)

LookupByNodeCode(code_id: psycopg2.sql.Composable, value: psycopg2.sql.Composable, year: psycopg2.sql.Composable = )

Ancestors

Class variables

var SUPPORTS_RETURNING
var code_id : psycopg2.sql.Composable
var value : psycopg2.sql.Composable
var year : psycopg2.sql.Composable

Methods

def query(self, table_identifier: psycopg2.sql.Identifier)
class LookupByNodeName (name: psycopg2.sql.Composable, level: psycopg2.sql.Composable = <factory>, parent_id: psycopg2.sql.Composable = <factory>, sub_type_id: psycopg2.sql.Composable = <factory>, type_id: psycopg2.sql.Composable = <factory>, year: psycopg2.sql.Composable = <factory>)

LookupByNodeName(name: psycopg2.sql.Composable, level: psycopg2.sql.Composable = , parent_id: psycopg2.sql.Composable = , sub_type_id: psycopg2.sql.Composable = , type_id: psycopg2.sql.Composable = , year: psycopg2.sql.Composable = )

Ancestors

Class variables

var SUPPORTS_RETURNING
var level : psycopg2.sql.Composable
var name : psycopg2.sql.Composable
var parent_id : psycopg2.sql.Composable
var sub_type_id : psycopg2.sql.Composable
var type_id : psycopg2.sql.Composable
var year : psycopg2.sql.Composable

Methods

def query(self, table_identifier)
class LookupByNodeTraseId (trase_id: psycopg2.sql.Composable, year: psycopg2.sql.Composable = <factory>)

LookupByNodeTraseId(trase_id: psycopg2.sql.Composable, year: psycopg2.sql.Composable = )

Ancestors

Class variables

var SUPPORTS_RETURNING
var trase_id : psycopg2.sql.Composable
var year : psycopg2.sql.Composable

Methods

def query(self, table_identifier)
class LookupDefaultNameByNodeId (node_id: psycopg2.sql.Composable)

LookupDefaultNameByNodeId(node_id: psycopg2.sql.Composable)

Ancestors

Class variables

var SUPPORTS_RETURNING
var node_id : psycopg2.sql.Composable

Methods

def query(self, table_identifier)
class LookupEconomicBlocsByTraseId (trase_id: psycopg2.sql.Composable)

LookupEconomicBlocsByTraseId(trase_id: psycopg2.sql.Composable)

Ancestors

Class variables

var SUPPORTS_RETURNING
var trase_id : psycopg2.sql.Composable

Methods

def query(self, table_identifier)
class LookupParentRegionsByTraseId (parent_region_level_name: psycopg2.sql.Composable, trase_id: psycopg2.sql.Composable)

LookupParentRegionsByTraseId(parent_region_level_name: psycopg2.sql.Composable, trase_id: psycopg2.sql.Composable)

Ancestors

Class variables

var SUPPORTS_RETURNING
var parent_region_level_name : psycopg2.sql.Composable
var trase_id : psycopg2.sql.Composable

Methods

def query(self, table_identifier)
class LookupTraderGroupsByName (group_name: psycopg2.sql.Composable, year: psycopg2.sql.Composable = <factory>)

LookupTraderGroupsByName(group_name: psycopg2.sql.Composable, year: psycopg2.sql.Composable = )

Ancestors

Class variables

var SUPPORTS_RETURNING
var group_name : psycopg2.sql.Composable
var year : psycopg2.sql.Composable

Methods

def query(self, table_identifier)
class LookupTraderGroupsByTraderId (trader_id: psycopg2.sql.Composable, year: psycopg2.sql.Composable = <factory>)

LookupTraderGroupsByTraderId(trader_id: psycopg2.sql.Composable, year: psycopg2.sql.Composable = )

Ancestors

Class variables

var SUPPORTS_RETURNING
var trader_id : psycopg2.sql.Composable
var year : psycopg2.sql.Composable

Methods

def query(self, table_identifier)
class LookupTradersByLabel (trader_label: psycopg2.sql.Composable, country_id: psycopg2.sql.Composable)

LookupTradersByLabel(trader_label: psycopg2.sql.Composable, country_id: psycopg2.sql.Composable)

Ancestors

Class variables

var SUPPORTS_RETURNING
var country_id : psycopg2.sql.Composable
var trader_label : psycopg2.sql.Composable

Methods

def query(self, table_identifier)