wtphm.pred_processing

This module contains functions for processing scada data ahead of using it for fault detection or prognostics. Read more in the Labelling the SCADA data section of the User Guide.

wtphm.pred_processing.label_stoppages(scada_data, fault_batches, drop_fault_batches=True, label_pre_stop=True, pre_stop_lims=['90 minutes', 0], oth_batches_to_drop=None, drop_type=None)

Label times in the scada data which occurred during a stoppage and leading up to a stoppage as such.

This adds a column to the passed scada_data, “stoppage”, and an optional column “pre_stop”. “stoppage” is given a 1 if the scada point in question occurs during a stoppage, and “pre_stop” is given a 1 in the samples leading up to the stoppage. Both are 0 otherwise. These vary under different circumstances (see below). It also adds a “batch_id” column. For entries with a “pre_stop” or “stoppage” column of 1, “batch_id” corresponds to the batch giving it that label.

Parameters:
  • scada_data (pandas.DataFrame) – Full set of SCADA data for the turbine.
  • fault_batches (pandas.DataFrame) – The dataframe of batches of fault events, a subset of the output of :func:wtphm.batch.get_batch_data`
  • drop_fault_batches (bool, default=True) – Whether to drop the scada entries which correspond to the stoppage periods covered by fault_batches. i.e. not the pre-fault data, but the fault data itself. This is highly recommended, as otherwise the stoppages themselves will be kept in the returned data, though the “stoppage” column for these entries will be labelled as “1”, while the fault-free data will be labelled “0”.
  • label_pre_stop (bool; default=True) – If True, add a column to the returned scada_data_l for “pre_stop”. Samples in the time leading up to a stoppage are given label 1, and 0 otherwise.
  • pre_stop_lims (2*1 list of pd.Timedelta-compatible strings, default=[‘90 mins’, 0]) – The amount of time before a stoppage to label scada as “pre_stop”. E.g., by default, “pre_stop” is labelled as 1 in the time between 90 mins and 0 mins before the stoppage occurs. If [‘120 mins’, ‘20 mins’] is passed, scada samples from 120 minutes before until 20 minutes before the stoppage are given the “pre_stop” label 1.
  • oth_batches_to_drop (pd.DataFrame, optional; default=None) – Additional batches, independent of dropping the fault_batches if drop_fault_batches is passed, which should be dropped from the scada data. If this is passed, drop_type must be given a string as well.
  • drop_type (str, optional; default=None) – Only used when oth_batches_to_drop has been passed. If ‘both’, the stoppage and pre-stop entries (according to pre_stop_lims) corresponding to batches in oth_batches_to_drop are dropped from the scada data. If ‘stop’, only the stoppage entries are dropped If ‘pre’, opnly the pre-stop entries are dropped
Returns:

scada_data_l (pd.DataFrame) – The original scada_data dataframe with the “pre_stop”, “stoppage” and “batch_id” columns added.

wtphm.pred_processing.get_lagged_features(X, y, features_to_lag_inds, steps)

Returns an array with certain columns as lagged features for classification

Parameters:
  • X (m*n np.ndarray) – The input features, with m samples and n features
  • y (m*1 np.ndarray) – The m target values
  • features_to_lag_inds (np.array) – The indices of the columns in X which will be lagged
  • steps (int) – The number of lagging steps. This means for feature ‘B’ at time T, features will be added to X at T for B@(T-1), B@(T-2)…B@(T-steps).
Returns:

  • X_lagged (np.ndarray) – An array with the original features and lagged features appended. The number of samples will necessarily be decreased because there will be some samples at the start with NA values for features.
  • y_lagged (np.ndarray) – An updated array of target vaues corresponding to the new number of samples in X_lagged