Module trase.tools.splitgraph_utilities.bulk_extract

Access Splitgraph data via the Export API

Functions

def extract_from_splitgraph(query, remote='data.splitgraph.com', poll_interval_seconds=1, poll_timeout_seconds=120) ‑> pandas.core.frame.DataFrame

Extract data from Splitgraph into a Pandas dataframe.

It is also possible to access data from Splitgraph by connecting to it as a database, or querying their DDN (data delivery network) over HTTP.

However, these methods have the drawback that the results will be truncated at 10,000 rows. In this function we use the "CSV export" API programmatically, which does not have this limit. You can also use this via the www.splitgraph.com website by clicking "export" above the SQL query.

Note that the dataframe will only contain strings. If you want to cast to specific types then do so after calling the function. The dataframe will also not contain any NaN values: these will be represented as empty strings.

Required setup

In order to access Splitgraph you will need to obtain an API token. This requires the ability to use the command-line on your laptop. To do so:

  1. Install the sgr command-line tool
  2. Create an account at splitgraph.com, if you haven't done already.
  3. Run the following and provide your Splitgraph email/username and password from the previous step:

    sgr cloud login

  4. (Optional) if you would like to access trase.splitgraph.io then also run::

    sgr cloud add --remote trase trase.splitgraph.io

    sgr cloud login --remote trase

If you do not feel comfortable using the command line then get in touch with the data team and we will help you.

Example

Here is an example accessing the latest Brazilian soy model:

df = extract_from_splitgraph(
    """
    select * from "trase-development/data-source"."supply-chains-latest"
    where latest and country_of_production = 'BRAZIL' and commodity = 'SOY'
    """
)

Args

query
an SQL query that will run on Splitgraph. You can visit https://www.splitgraph.com/trase-development (or whatever repository you are interested in) to get a sample SQL query that you can use as a starting point.
internal : bool
whether to use www.splitgraph.com (default, false) or trase.splitgraph.io (true).
poll_interval_seconds : int
how often to poll Splitgraph for the status of the CSV export job (default: 1).
poll_timeout_seconds : int
how many seconds to wait before giving up on the job and assuming it failed (default: 120).

Returns: A Pandas dataframe