datapunt_processing.transform.preprocessing package¶

Submodules¶

datapunt_processing.transform.preprocessing.data_selection module¶

class datapunt_processing.transform.preprocessing.data_selection.DateRange(start, end)¶

Bases: object

Class representing a range of dates

Can be used to select data in data frames if the data frame contains a column ‘date’ and is populated with datetime.date values.

static from_dataframe(df: pandas.core.frame.DataFrame)¶

Create date range from a data frame Args:

df (pd.DataFrame): a data frame having a date column populated

with dt.datetime values.

Returns:: DateRange object

length()¶

Returns the time delta between end and start

Returns:: dt.timedelta

select(df)¶

Apply the data range selection on a data frame Args:

df (pd.DataFrame): input data frame

Returns:: pd.DataFrame

datapunt_processing.transform.preprocessing.data_selection.remove_nan_targets(df: pandas.core.frame.DataFrame)¶

Remove NaNs in the target list. Args:

df(pd.DataFrame): data frame on which to operate run (Run): run parameters

Returns:: pd.DataFrame

datapunt_processing.transform.preprocessing.data_selection.select_and_report(df, selection, description, max_drop_rate=None)¶

Select data from a data frame and report how many rows were dropped. Args:

df(pd.DataFrame): apply selection on this data frame selection(pd.Series): boolean selection description(str): description in log message max_drop_rate(float, optional): max fraction of dropped rows

Returns:: pd.DataFrame

datapunt_processing.transform.preprocessing.enrichment module¶

datapunt_processing.transform.preprocessing.enrichment.enrich_datetime(df)¶

Add datetime features (i.e. year, day of week, etc., also for ML feature generation) Args:

df (pd.DataFrame): schedule or historical data

Returns:: pd.DataFrame

datapunt_processing.transform.preprocessing.ml_helperfunctions module¶

datapunt_processing.transform.preprocessing.ml_helperfunctions.accuracy_score(y_true, y_pred)¶: Compare y_true to y_pred and return the accuracy

datapunt_processing.transform.preprocessing.ml_helperfunctions.calculate_covariance_matrix(X, Y=None)¶: Calculate the covariance matrix for the dataset X

datapunt_processing.transform.preprocessing.ml_helperfunctions.calculate_std_dev(X)¶: Calculate the standard deviations of the features in dataset X

datapunt_processing.transform.preprocessing.ml_helperfunctions.calculate_variance(X)¶: Return the variance of the features in dataset X

datapunt_processing.transform.preprocessing.ml_helperfunctions.euclidean_distance(x1, x2)¶: Calculates the l2 distance between two vectors

datapunt_processing.transform.preprocessing.ml_helperfunctions.mean_squared_error(y_true, y_pred)¶: Returns the mean squared error between y_true and y_pred

datapunt_processing.transform.preprocessing.ml_preprocessing module¶

datapunt_processing.transform.preprocessing.utilities module¶

class datapunt_processing.transform.preprocessing.utilities.BigFile(f)¶

Bases: object

Wrapper for pickling big files See pickle_big_dump and pickle_big_load functions.

read(n)¶

Read n bytes Reads a big file in batch of almost ~ 1 GB Args:

n (int): number of bytes

Returns:: bytearray: buffer

write(buffer)¶: Write buffer Writes in batches of ~ 1GB Args:

buffer (bytearray): buffer

datapunt_processing.transform.preprocessing.utilities.assert_unique(series)¶: Assert that all values are unique. Raises a value error if not. Empty series are ignored. :param series: input :return: none

datapunt_processing.transform.preprocessing.utilities.calc_error(df)¶

Parameters:	df – DataFrame Predictions data frame.
Returns:	Series Error (truth - prediction)

datapunt_processing.transform.preprocessing.utilities.cols_not_in(columns, df: pandas.core.frame.DataFrame)¶

datapunt_processing.transform.preprocessing.utilities.get_last_full_year(df)¶

Find the last full year in data frame E.g. will return 2016 if you are in 2017. Args:

df (pd.DataFrame): input data, has column dayofyear

Returns:: int: last full year

datapunt_processing.transform.preprocessing.utilities.get_script_dir()¶

datapunt_processing.transform.preprocessing.utilities.is_numeric(column)¶

Parameters:	column – Series A column.
Returns:	bool True when is a numeric type.

datapunt_processing.transform.preprocessing.utilities.merge_and_report(left, right, on: list, description='', n_unmatched_limit=None) → pandas.core.frame.DataFrame¶

Performs a merge between two data frame and reports stats on matches Only left merges are supported for now. If a column is present in both the left and right data frame, the left column has priority and the right column is ignored. Args:

left (pd.DataFrame): lhs right (pd.DataFrame): rhs on (list[str]): columns to match on. description (str or None): description of what merge is done n_unmatched_limit (int): throw error when number of rows not found in

left side is larger than this number

Returns:: pd.DataFrame

datapunt_processing.transform.preprocessing.utilities.optional_make_dir(path)¶: Creates directory at if it does not exist yet Args:

path (str): path to create directory in

datapunt_processing.transform.preprocessing.utilities.pickle_big_dump(obj, file_path)¶

datapunt_processing.transform.preprocessing.utilities.pickle_big_load(file_path)¶

datapunt_processing.transform.preprocessing.utilities.rms(array)¶: Calculate the root mean square of an array