Module trase.tools.etl.pandas_wrapper

Functions

def cast_dataframe_to_types(df, converters)
def open_filepath_or_buffer(filepath_or_buffer)
def read_csv(path_or_buffer, converters: Optional[dict], delimiter: Optional[str], quotechar, encoding, header='infer', skiprows=None)
def read_excel(location, converters: dict)
def read_json(source: Union[str, pathlib.Path, IO[~AnyStr]], types: Dict) ‑> pandas.core.frame.DataFrame

Loads a JSON file from disk. The JSON file must be in a row-wise layout:

[
    {"my_column": 1},
    {"my_column": 2},
    ...
]

The dictionary of types should map column names to their types. A column will only appear in the dataframe if it also appears in this mapping. Constructing typed Python objects is delegated to the json module. We do, however, reject Python's float NaN object and other esoteric float objects. Unlike pd.read_json, types are strictly enforced (even more strictly than Python! For example, None is not an acceptable value for a boolean). You can use the type object to skip validation for that column.

Unlike pd.read_json we do not interpret certain "missing values" as NaNs (except for a float NaN interpreted as an object).

TODO explore combining validation and casting for clarity and performance

def sniff_csv_delimiter(path_or_buffer, encoding='utf8')

Pandas is not very good at this, so we instead use the csv sniffer from Python's standard library. we run the sniffer on a sample of the file. if the object is buffer-like, then we'll need to be careful to reset it to the original location after read

def types_similar(dtype, python_type: )

Classes

class WellBehavedCSV

Pandas and CSVs can do unexpected things. For example, how can you represent each of None, np.nan, and "" in a CSV? How can you write a number with a leading zero that doesn't get stripped on read?

This class imposes a set of restrictions with the guarantee that if you deserialized a serialized DataFrame you will always recover exactly the original:

write(df_2, location, types)
df_2 == read(loaction, types)

The restrictions are as follows:

- Columns must be either str, float, int or bool type
- None values are not allowed (but NaN is OK in non-string columns)
- String columns may contain neither None nor NaN
- No named or multi-level indexes are allowed (this isn't for any good reason,
    I just didn't spend the time understanding the behaviour of this)

Class variables

var ALLOWED_DTYPES
var Types

Static methods

def read(source: Union[str, pathlib.Path, IO[~AnyStr]], types: Optional[Dict[str, type]] = None) ‑> pandas.core.frame.DataFrame
def read_csv(source: Union[str, pathlib.Path, IO[~AnyStr]])
def write(df: pandas.core.frame.DataFrame, destination: Union[str, pathlib.Path, IO[~AnyStr]], types: Optional[Dict[str, type]] = None)