wtphm.clustering.batch_clustering

This module is for dealing with clustering certain similar batches of turbine events together.

It contains functions for extracting clustering-related features from the batches, as well as functions for silhouette plots for evauating them.

This code was used in the following paper:

Leahy, Kevin, et al. “Cluster analysis of wind turbine alarms for characterising and classifying stoppages.” IET Renewable Power Generation 12.10 (2018): 1146-1154.

wtphm.clustering.batch_clustering.get_batch_features(event_data, fault_codes, batch_data, method, lo=1, hi=10, num=1, event_type='fault_events')

Extract features from batches of events which appear during stoppages, to be used for clustering.

Only features from batches that comply with certain constraints are included. These constraints are chosen depending on which feature extraction method is used. Details of the feature extraction methods can be found in [1].

Note: For each “batch” of alarms, there are up to num_codes unique alarm codes. Each alarm has an associated start time, time_on.

Parameters:
  • event_data (pandas.DataFrame) – The original events/fault data. May be grouped (see :func:wtphm.batch_clustering.get_grouped_events_data`).

  • fault_codes (numpy.ndarray) – All event codes that will be treated as fault events for the batches

  • batch_data (pandas.DataFrame) – The dataframe holding the indices in event_data and start and end times for each batch

  • method (string) – One of ‘basic’, ‘t_on’, ‘time’.

    basic:
    • Only considers batches with between lo and hi individual alarms.
    • Array of zeros is filled with num corresponding to order of alarms’ appearance.
    • Does not take into account whether alarms occurred simultaneously.
    • Resultant vector of length num_codes * hi
    t_on:
    • Only consider batches with between lo and hi individual time_ons.
    • For each time_on in each batch, an array of zeros is filled with ones in places corresponding to an alarm that has fired at that time.
    • Results in a pattern array of length num_codes * hi which shows the sequential order of the alarms which have been fired.
    time:
    • Same as above, but extra features are added showing the amount of time between each time_on
  • lo (integer, default=1) – For method='basic', only batches with a minimum of lo alarms will be included in the returned feature set. for method='t_on' or method='time', it’s the minimum number of time_ons.

  • hi (integer, default=10) – For method='basic', only batches with a maximum of hi alarms will be included in the returned feature set. for method='t_on' or method='time', it’s the maximum number of time_ons.

  • num (integer, float, default=1) – The number to be placed in the feature vector to indicate the presence of a particular alarm

  • event_type (string, default=’fault_events’) – The members of batch_data to include for building the feature set. Should normally be ‘fault_events’ or ‘all_events’

Returns:

  • feature_array (numpy.ndarray) – An array of feature arrays corresponding to each batch that has has met the hi and lo criteria
  • assoc_batch (unmpy.ndarray) – An array of 2-length index arrays. It is the same length as feature_array, and each entry points to the corresponding feature_array’s index in batch_data, which in turn contains the index of the feature_array’s associated events in the original events_data or fault_data.

References

[1] Leahy, Kevin, et al. “Cluster analysis of wind turbine alarms for characterising and classifying stoppages.” IET Renewable Power Generation 12.10 (2018): 1146-1154.

wtphm.clustering.batch_clustering.sil_1_cluster(X, cluster_labels, axis_label=True, save=False, save_name=None, x_label='Silhouette coefficient values', avg_pos=0.02, w=2.3, h=2.4)

Show the silhouette scores for clusterer, print the plot, and optionally save it

Parameters:
  • X (np.array or list-like) – Features (possibly feature_array - need to check!)
  • cluster_labels (list of strings) – the labels of each cluster
  • axis_label (Boolean, default=True) – Whether or not to label the cluster plot with each cluster’s number
  • save (Boolean, default=False) – Whether or not to save the resulting silhouette plot
  • save_name (String) – The saved filename
  • x_label (String) – The x axis label for the plot
  • avg_pos (float) – Where to position the text for the average silghouette score relative to the position of the “average” line
  • w (float or int) – width of plot
  • h (float or int) – height of plot
Returns:

fig (matplotlib figure object) – The silhouette analysis

wtphm.clustering.batch_clustering.sil_n_clusters(X, range_n_clusters, clust)

Compare silhouette scores across different numbers of clusters for AgglomerativeClustering, KMeans or similar

Parameters:
  • X (np.array or list-like) – Features (possibly feature_array - need to check!)
  • range_n_clusters (list-like) – The range of clusters you want, e.g. [2,3,4,5,10,20]
  • clust (sklearn clusterer) – the sklearn clusterer to use, e.g. KMeans
Returns:

  • cluster_labels (numpy.ndarray) – The labels for the clusters, with each one corresponding to a feature vector in X.
  • Also prints the silhouette analysis

wtphm.clustering.batch_clustering.cluster_times(batch_data, cluster_labels, assoc_batch, event_dur_type='down_dur')

Returns a DataFrame with a summary of the size and durations of batch members

Parameters:
  • batch_data (pandas.DataFrame) – The dataframe holding the indices in event_data and start and end times for each batch
  • cluster_labels (numpy.ndarray) – The labels for the clusters, with each one corresponding to a feature vector in assoc_batch
  • assoc_batch (nunmpy.ndarray) – Indices of batches associated with each feature_array. Obtained from :func:.get_batch_features
  • event_dur_type (string) – The event group duration in batch_data to return, i.e. either ‘fault_dur’ or ‘down_dur’. ‘down_dur’ means the entire time the turbine was offline, ‘fault_dur’ just means while the turbine was faulting. See :func:wtphm.batch.Batches.get_batch_data for details
Returns:

summary (Pandas.DataFrame) – The DataFrame has the total duration, mean duration, standard deviation of the duration and number of stoppages in each cluster.