Module trase.tools.pandasdb.query

Run arbitrary SQL commands using data stored in a Pandas dataframe.

Functions

def query_with_dataframe(df, query, name='df', drop=True, cnx=None, cur=None)

Perform an SQL query that uses the data contained in a dataframe.

Before running the SQL query, the dataframe will be streamed to a temporary table in the database. The query will then be wrapped with a common table expression (CTE) which provides access to the table in the query.


                      1. stream to
┌───────────────────┐    temp. table   ┌─────────────────┐
│ Pandas dataframe  ├──────────────────► pg_temp.tmp1234 │
└───────────────────┘                  └────────┬────────┘
                                                │
┌───────────────────┐                           │
│ Pandas dataframe  ◄───────────────────────────┘
└───────────────────┘ 2. Perform SQL query
                      3. Return data to Pandas

For example, here we are streaming a dataframe of trase ids and then performing an SQL JOIN. The contents of the dataframe are referred to as "df" in the query:

Example

>>> df = pd.DataFrame({"trase_id": ["EC-TRADER-0917021586001"]})
>>> query_with_dataframe(df, "select id from df join nodes using (trase_id)")
         id
0  10596850

Performance

The entire dataframe you pass will be streamed to the database. For speed consider selecting only the columns you require before passing the dataframe to this function.

Queries that involve CTES

Bear in mind that the query will be wrapped to the following:

with df as (select * from pg_temp.random1234) <query>

Therefore if your query itself contains a CTE, it should start with a comma rather than with:

query_with(
    df,
    '''
    , my_table as (select * from nodes)
    select id from df join my_table using (trase_id)
    '''
)

becomes:

with
    df as (select * from pg_temp.random1234),
    my_table as (select * from nodes)
select id from df join my_table using (trase_id)

Arguments

name: the alias for the data in the query, defaults to "df" drop: whether to drop the temporary table after the query runs, defaults to True