wtphm.clustering.batch_clustering¶
This module is for dealing with clustering certain similar batches of turbine events together.
It contains functions for extracting clustering-related features from the batches, as well as functions for silhouette plots for evauating them.
This code was used in the following paper:
Leahy, Kevin, et al. “Cluster analysis of wind turbine alarms for characterising and classifying stoppages.” IET Renewable Power Generation 12.10 (2018): 1146-1154.
-
wtphm.clustering.batch_clustering.
get_batch_features
(event_data, fault_codes, batch_data, method, lo=1, hi=10, num=1, event_type='fault_events')¶ Extract features from batches of events which appear during stoppages, to be used for clustering.
Only features from batches that comply with certain constraints are included. These constraints are chosen depending on which feature extraction method is used. Details of the feature extraction methods can be found in [1].
Note: For each “batch” of alarms, there are up to
num_codes
unique alarm codes. Each alarm has an associated start time,time_on
.Parameters: event_data (pandas.DataFrame) – The original events/fault data. May be grouped (see :func:wtphm.batch_clustering.get_grouped_events_data`).
fault_codes (numpy.ndarray) – All event codes that will be treated as fault events for the batches
batch_data (pandas.DataFrame) – The dataframe holding the indices in
event_data
and start and end times for each batchmethod (string) – One of ‘basic’, ‘t_on’, ‘time’.
- basic:
- Only considers batches with between
lo
andhi
individual alarms. - Array of zeros is filled with
num
corresponding to order of alarms’ appearance. - Does not take into account whether alarms occurred simultaneously.
- Resultant vector of length
num_codes * hi
- Only considers batches with between
- t_on:
- Only consider batches with between
lo
andhi
individualtime_on
s. - For each
time_on
in each batch, an array of zeros is filled with ones in places corresponding to an alarm that has fired at that time. - Results in a pattern array of length
num_codes * hi
which shows the sequential order of the alarms which have been fired.
- Only consider batches with between
- time:
- Same as above, but extra features are added showing the amount
of time between each
time_on
- Same as above, but extra features are added showing the amount
of time between each
lo (integer, default=1) – For
method='basic'
, only batches with a minimum oflo
alarms will be included in the returned feature set. formethod='t_on'
ormethod='time'
, it’s the minimum number oftime_on
s.hi (integer, default=10) – For
method='basic'
, only batches with a maximum ofhi
alarms will be included in the returned feature set. formethod='t_on'
ormethod='time'
, it’s the maximum number oftime_on
s.num (integer, float, default=1) – The number to be placed in the feature vector to indicate the presence of a particular alarm
event_type (string, default=’fault_events’) – The members of batch_data to include for building the feature set. Should normally be ‘fault_events’ or ‘all_events’
Returns: - feature_array (numpy.ndarray) – An array of feature arrays corresponding to each batch that has has
met the
hi
andlo
criteria - assoc_batch (unmpy.ndarray) – An array of 2-length index arrays. It is the same length as
feature_array
, and each entry points to the correspondingfeature_array
’s index inbatch_data
, which in turn contains the index of thefeature_array
’s associated events in the originalevents_data
orfault_data
.
References
[1] Leahy, Kevin, et al. “Cluster analysis of wind turbine alarms for characterising and classifying stoppages.” IET Renewable Power Generation 12.10 (2018): 1146-1154.
-
wtphm.clustering.batch_clustering.
sil_1_cluster
(X, cluster_labels, axis_label=True, save=False, save_name=None, x_label='Silhouette coefficient values', avg_pos=0.02, w=2.3, h=2.4)¶ Show the silhouette scores for
clusterer
, print the plot, and optionally save itParameters: - X (np.array or list-like) – Features (possibly
feature_array
- need to check!) - cluster_labels (list of strings) – the labels of each cluster
- axis_label (Boolean, default=True) – Whether or not to label the cluster plot with each cluster’s number
- save (Boolean, default=False) – Whether or not to save the resulting silhouette plot
- save_name (String) – The saved filename
- x_label (String) – The x axis label for the plot
- avg_pos (float) – Where to position the text for the average silghouette score relative to the position of the “average” line
- w (float or int) – width of plot
- h (float or int) – height of plot
Returns: fig (matplotlib figure object) – The silhouette analysis
- X (np.array or list-like) – Features (possibly
-
wtphm.clustering.batch_clustering.
sil_n_clusters
(X, range_n_clusters, clust)¶ Compare silhouette scores across different numbers of clusters for AgglomerativeClustering, KMeans or similar
Parameters: - X (np.array or list-like) – Features (possibly
feature_array
- need to check!) - range_n_clusters (list-like) – The range of clusters you want, e.g. [2,3,4,5,10,20]
- clust (sklearn clusterer) – the sklearn clusterer to use, e.g. KMeans
Returns: - cluster_labels (numpy.ndarray) – The labels for the clusters, with each one corresponding to a feature
vector in
X
. - Also prints the silhouette analysis
- X (np.array or list-like) – Features (possibly
-
wtphm.clustering.batch_clustering.
cluster_times
(batch_data, cluster_labels, assoc_batch, event_dur_type='down_dur')¶ Returns a DataFrame with a summary of the size and durations of batch members
Parameters: - batch_data (pandas.DataFrame) – The dataframe holding the indices in
event_data
and start and end times for each batch - cluster_labels (numpy.ndarray) – The labels for the clusters, with each one corresponding to a feature
vector in
assoc_batch
- assoc_batch (nunmpy.ndarray) – Indices of batches associated with each
feature_array
. Obtained from :func:.get_batch_features
- event_dur_type (string) – The event group duration in batch_data to return, i.e. either
‘fault_dur’ or ‘down_dur’. ‘down_dur’ means the entire time the turbine
was offline, ‘fault_dur’ just means while the turbine was faulting. See
:func:
wtphm.batch.Batches.get_batch_data
for details
Returns: summary (Pandas.DataFrame) – The DataFrame has the total duration, mean duration, standard deviation of the duration and number of stoppages in each cluster.
- batch_data (pandas.DataFrame) – The dataframe holding the indices in