datapunt_processing.transform.preprocessing package

Submodules

datapunt_processing.transform.preprocessing.data_selection module

class datapunt_processing.transform.preprocessing.data_selection.DateRange(start, end)

Bases: object

Class representing a range of dates

Can be used to select data in data frames if the data frame contains a column ‘date’ and is populated with datetime.date values.

static from_dataframe(df: pandas.core.frame.DataFrame)

Create date range from a data frame Args:

df (pd.DataFrame): a data frame having a date column populated
with dt.datetime values.
Returns:
DateRange object
length()

Returns the time delta between end and start

Returns:
dt.timedelta
select(df)

Apply the data range selection on a data frame Args:

df (pd.DataFrame): input data frame
Returns:
pd.DataFrame
datapunt_processing.transform.preprocessing.data_selection.remove_nan_targets(df: pandas.core.frame.DataFrame)

Remove NaNs in the target list. Args:

df(pd.DataFrame): data frame on which to operate run (Run): run parameters
Returns:
pd.DataFrame
datapunt_processing.transform.preprocessing.data_selection.select_and_report(df, selection, description, max_drop_rate=None)

Select data from a data frame and report how many rows were dropped. Args:

df(pd.DataFrame): apply selection on this data frame selection(pd.Series): boolean selection description(str): description in log message max_drop_rate(float, optional): max fraction of dropped rows
Returns:
pd.DataFrame

datapunt_processing.transform.preprocessing.enrichment module

datapunt_processing.transform.preprocessing.enrichment.enrich_datetime(df)

Add datetime features (i.e. year, day of week, etc., also for ML feature generation) Args:

df (pd.DataFrame): schedule or historical data
Returns:
pd.DataFrame

datapunt_processing.transform.preprocessing.ml_helperfunctions module

datapunt_processing.transform.preprocessing.ml_helperfunctions.accuracy_score(y_true, y_pred)

Compare y_true to y_pred and return the accuracy

datapunt_processing.transform.preprocessing.ml_helperfunctions.calculate_covariance_matrix(X, Y=None)

Calculate the covariance matrix for the dataset X

datapunt_processing.transform.preprocessing.ml_helperfunctions.calculate_std_dev(X)

Calculate the standard deviations of the features in dataset X

datapunt_processing.transform.preprocessing.ml_helperfunctions.calculate_variance(X)

Return the variance of the features in dataset X

datapunt_processing.transform.preprocessing.ml_helperfunctions.euclidean_distance(x1, x2)

Calculates the l2 distance between two vectors

datapunt_processing.transform.preprocessing.ml_helperfunctions.mean_squared_error(y_true, y_pred)

Returns the mean squared error between y_true and y_pred

datapunt_processing.transform.preprocessing.ml_preprocessing module

datapunt_processing.transform.preprocessing.utilities module

class datapunt_processing.transform.preprocessing.utilities.BigFile(f)

Bases: object

Wrapper for pickling big files See pickle_big_dump and pickle_big_load functions.

read(n)

Read n bytes Reads a big file in batch of almost ~ 1 GB Args:

n (int): number of bytes
Returns:
bytearray: buffer
write(buffer)

Write buffer Writes in batches of ~ 1GB Args:

buffer (bytearray): buffer
datapunt_processing.transform.preprocessing.utilities.assert_unique(series)

Assert that all values are unique. Raises a value error if not. Empty series are ignored. :param series: input :return: none

datapunt_processing.transform.preprocessing.utilities.calc_error(df)
Parameters:df – DataFrame Predictions data frame.
Returns:Series Error (truth - prediction)
datapunt_processing.transform.preprocessing.utilities.cols_not_in(columns, df: pandas.core.frame.DataFrame)
datapunt_processing.transform.preprocessing.utilities.get_last_full_year(df)

Find the last full year in data frame E.g. will return 2016 if you are in 2017. Args:

df (pd.DataFrame): input data, has column dayofyear
Returns:
int: last full year
datapunt_processing.transform.preprocessing.utilities.get_script_dir()
datapunt_processing.transform.preprocessing.utilities.is_numeric(column)
Parameters:column – Series A column.
Returns:bool True when is a numeric type.
datapunt_processing.transform.preprocessing.utilities.merge_and_report(left, right, on: list, description='', n_unmatched_limit=None) → pandas.core.frame.DataFrame

Performs a merge between two data frame and reports stats on matches Only left merges are supported for now. If a column is present in both the left and right data frame, the left column has priority and the right column is ignored. Args:

left (pd.DataFrame): lhs right (pd.DataFrame): rhs on (list[str]): columns to match on. description (str or None): description of what merge is done n_unmatched_limit (int): throw error when number of rows not found in

left side is larger than this number
Returns:
pd.DataFrame
datapunt_processing.transform.preprocessing.utilities.optional_make_dir(path)

Creates directory at if it does not exist yet Args:

path (str): path to create directory in
datapunt_processing.transform.preprocessing.utilities.pickle_big_dump(obj, file_path)
datapunt_processing.transform.preprocessing.utilities.pickle_big_load(file_path)
datapunt_processing.transform.preprocessing.utilities.rms(array)

Calculate the root mean square of an array

Module contents