datapunt_processing.transform.preprocessing package¶
Submodules¶
datapunt_processing.transform.preprocessing.data_selection module¶
-
class
datapunt_processing.transform.preprocessing.data_selection.
DateRange
(start, end)¶ Bases:
object
Class representing a range of dates
Can be used to select data in data frames if the data frame contains a column ‘date’ and is populated with datetime.date values.
-
static
from_dataframe
(df: pandas.core.frame.DataFrame)¶ Create date range from a data frame Args:
- df (pd.DataFrame): a data frame having a date column populated
- with dt.datetime values.
- Returns:
- DateRange object
-
length
()¶ Returns the time delta between end and start
- Returns:
- dt.timedelta
-
select
(df)¶ Apply the data range selection on a data frame Args:
df (pd.DataFrame): input data frame- Returns:
- pd.DataFrame
-
static
-
datapunt_processing.transform.preprocessing.data_selection.
remove_nan_targets
(df: pandas.core.frame.DataFrame)¶ Remove NaNs in the target list. Args:
df(pd.DataFrame): data frame on which to operate run (Run): run parameters- Returns:
- pd.DataFrame
-
datapunt_processing.transform.preprocessing.data_selection.
select_and_report
(df, selection, description, max_drop_rate=None)¶ Select data from a data frame and report how many rows were dropped. Args:
df(pd.DataFrame): apply selection on this data frame selection(pd.Series): boolean selection description(str): description in log message max_drop_rate(float, optional): max fraction of dropped rows- Returns:
- pd.DataFrame
datapunt_processing.transform.preprocessing.enrichment module¶
-
datapunt_processing.transform.preprocessing.enrichment.
enrich_datetime
(df)¶ Add datetime features (i.e. year, day of week, etc., also for ML feature generation) Args:
df (pd.DataFrame): schedule or historical data- Returns:
- pd.DataFrame
datapunt_processing.transform.preprocessing.ml_helperfunctions module¶
-
datapunt_processing.transform.preprocessing.ml_helperfunctions.
accuracy_score
(y_true, y_pred)¶ Compare y_true to y_pred and return the accuracy
-
datapunt_processing.transform.preprocessing.ml_helperfunctions.
calculate_covariance_matrix
(X, Y=None)¶ Calculate the covariance matrix for the dataset X
-
datapunt_processing.transform.preprocessing.ml_helperfunctions.
calculate_std_dev
(X)¶ Calculate the standard deviations of the features in dataset X
-
datapunt_processing.transform.preprocessing.ml_helperfunctions.
calculate_variance
(X)¶ Return the variance of the features in dataset X
-
datapunt_processing.transform.preprocessing.ml_helperfunctions.
euclidean_distance
(x1, x2)¶ Calculates the l2 distance between two vectors
-
datapunt_processing.transform.preprocessing.ml_helperfunctions.
mean_squared_error
(y_true, y_pred)¶ Returns the mean squared error between y_true and y_pred
datapunt_processing.transform.preprocessing.ml_preprocessing module¶
datapunt_processing.transform.preprocessing.utilities module¶
-
class
datapunt_processing.transform.preprocessing.utilities.
BigFile
(f)¶ Bases:
object
Wrapper for pickling big files See pickle_big_dump and pickle_big_load functions.
-
read
(n)¶ Read n bytes Reads a big file in batch of almost ~ 1 GB Args:
n (int): number of bytes- Returns:
- bytearray: buffer
-
write
(buffer)¶ Write buffer Writes in batches of ~ 1GB Args:
buffer (bytearray): buffer
-
-
datapunt_processing.transform.preprocessing.utilities.
assert_unique
(series)¶ Assert that all values are unique. Raises a value error if not. Empty series are ignored. :param series: input :return: none
-
datapunt_processing.transform.preprocessing.utilities.
calc_error
(df)¶ Parameters: df – DataFrame Predictions data frame. Returns: Series Error (truth - prediction)
-
datapunt_processing.transform.preprocessing.utilities.
cols_not_in
(columns, df: pandas.core.frame.DataFrame)¶
-
datapunt_processing.transform.preprocessing.utilities.
get_last_full_year
(df)¶ Find the last full year in data frame E.g. will return 2016 if you are in 2017. Args:
df (pd.DataFrame): input data, has column dayofyear- Returns:
- int: last full year
-
datapunt_processing.transform.preprocessing.utilities.
get_script_dir
()¶
-
datapunt_processing.transform.preprocessing.utilities.
is_numeric
(column)¶ Parameters: column – Series A column. Returns: bool True when is a numeric type.
-
datapunt_processing.transform.preprocessing.utilities.
merge_and_report
(left, right, on: list, description='', n_unmatched_limit=None) → pandas.core.frame.DataFrame¶ Performs a merge between two data frame and reports stats on matches Only left merges are supported for now. If a column is present in both the left and right data frame, the left column has priority and the right column is ignored. Args:
left (pd.DataFrame): lhs right (pd.DataFrame): rhs on (list[str]): columns to match on. description (str or None): description of what merge is done n_unmatched_limit (int): throw error when number of rows not found in
left side is larger than this number- Returns:
- pd.DataFrame
-
datapunt_processing.transform.preprocessing.utilities.
optional_make_dir
(path)¶ Creates directory at if it does not exist yet Args:
path (str): path to create directory in
-
datapunt_processing.transform.preprocessing.utilities.
pickle_big_dump
(obj, file_path)¶
-
datapunt_processing.transform.preprocessing.utilities.
pickle_big_load
(file_path)¶
-
datapunt_processing.transform.preprocessing.utilities.
rms
(array)¶ Calculate the root mean square of an array