Module trase.tools.splitgraph_utilities.bulk_extract
Access Splitgraph data via the Export API
Functions
def extract_from_splitgraph(query, remote='data.splitgraph.com', poll_interval_seconds=1, poll_timeout_seconds=120) ‑> pandas.core.frame.DataFrame-
Extract data from Splitgraph into a Pandas dataframe.
It is also possible to access data from Splitgraph by connecting to it as a database, or querying their DDN (data delivery network) over HTTP.
However, these methods have the drawback that the results will be truncated at 10,000 rows. In this function we use the "CSV export" API programmatically, which does not have this limit. You can also use this via the www.splitgraph.com website by clicking "export" above the SQL query.
Note that the dataframe will only contain strings. If you want to cast to specific types then do so after calling the function. The dataframe will also not contain any NaN values: these will be represented as empty strings.
Required setup
In order to access Splitgraph you will need to obtain an API token. This requires the ability to use the command-line on your laptop. To do so:
- Install the sgr command-line tool
- Create an account at splitgraph.com, if you haven't done already.
-
Run the following and provide your Splitgraph email/username and password from the previous step:
sgr cloud login -
(Optional) if you would like to access trase.splitgraph.io then also run::
sgr cloud add --remote trase trase.splitgraph.iosgr cloud login --remote trase
If you do not feel comfortable using the command line then get in touch with the data team and we will help you.
Example
Here is an example accessing the latest Brazilian soy model:
df = extract_from_splitgraph( """ select * from "trase-development/data-source"."supply-chains-latest" where latest and country_of_production = 'BRAZIL' and commodity = 'SOY' """ )Args
query- an SQL query that will run on Splitgraph. You can visit https://www.splitgraph.com/trase-development (or whatever repository you are interested in) to get a sample SQL query that you can use as a starting point.
internal:bool- whether to use www.splitgraph.com (default, false) or trase.splitgraph.io (true).
poll_interval_seconds:int- how often to poll Splitgraph for the status of the CSV export job (default: 1).
poll_timeout_seconds:int- how many seconds to wait before giving up on the job and assuming it failed (default: 120).
Returns: A Pandas dataframe