Correlation Based Feature Selection Per Group (CFS Per Group)¶
Correlation Based Feature Selection Per Group (CFS Per Group)¶
CorrelationBasedFeatureSelectionPerGroup(
non_longitudinal_features: Optional[List[int]] = None,
search_method: str = 'greedySearch',
features_group: Optional[List[List[int]]] = None, parallel: bool = False,
outer_search_method: str = None, inner_search_method: str = 'exhaustiveSearch',
version = 1, num_cpus: int = -1
)
Correlation-based Feature Selection (CFS) per group (CFS Per Group).
This class performs feature selection using the CFS-Per-Group algorithm on given data. The CFS algorithm is, in-a-nutshell, a filter method that selects features based on their correlation with the target variable and their mutual correlation with each other. CFS-Per-Group, on the other hand, is an implementation that is adapted from the original CFS, tailored to understand the Longitudinal temporality.
CFS-Per-Group a Longitudinal Variation of the Standard CFS Method
CFS-Per-Group, also known as Exh-CFS-Gr
in the literature, is a longitudinal variation of the standard CFS method.
It is designed to handle longitudinal data by considering temporal variations across multiple waves (time points).
The method works in two phases:
- Phase 1 (Can work independently from Phase 2): For each longitudinal feature, CFS with exhaustive search (or any other search method available) is applied to a small set of temporal variations across waves to select a subset of relevant and non-redundant features. The selected temporal variations of features are then merged into a single feature set.
- Phase 2 (Works with Phase 1): The feature set from Phase 1 is combined with non-longitudinal features (for a less-biased process) and standard CFS is applied to further remove redundant features.
For more scientific references, refer to the Notes section.
Standard CFS Algorithm implementation available
If you would like to use the standard CFS algorithm, please refer to the CorrelationBasedFeatureSelection
class.
Given that this is out of the scope of this documentation, we recommend checking the source code for more information.
Parameters¶
- features_group (
Optional[List[Tuple[int, ...]]]
, default=None): A temporal matrix representing the temporal dependency of a longitudinal dataset. Each tuple/list of integers in the outer list represents the indices of a longitudinal attribute's waves, with each longitudinal attribute having its own sublist in that outer list. For more details, see the documentation's "Temporal Dependency" page. - non_longitudinal_features (
Optional[List[int]]
): A list of feature indices that are considered non-longitudinal. In version-2, these features will be employed in the second phase of the CFS per group algorithm. - search_method (
str
, default="greedySearch"): The search method to use (Phase-1). Options are "exhaustiveSearch" and "greedySearch". - version (
int
, default=2): The version of the CFS per group algorithm to use. Options are "1" and "2". Version 2 is the improved version with an outer search out of the final aggregated list of features of the first phase. - outer_search_method (
str
, default=None): The outer (to the final aggregated list of features) search method to use for the CFS per group (longitudinal component). If None, it defaults to the same as thesearch_method
. - inner_search_method (
str
, default="exhaustiveSearch"): The inner (to each longitudinal attributes' waves') search method to use for the CFS per group (longitudinal component). - parallel (
bool
, default=False): Whether to use parallel processing for the CFS algorithm (especially useful for the exhaustive search method with the CFS per group, i.e., longitudinal component). - num_cpus (
int
, default=-1): The number of CPUs to use for parallel processing. If -1, all available CPUs will be used.
Attributes¶
- selected_features_ (
ndarray
of shape (n_features,)): The indices of the selected features.
Methods¶
Fit¶
Fits the CFS algorithm on the input data and target variable.
Parameters¶
- X (
np.ndarray
): The input data of shape (n_samples, n_features). - y (
np.ndarray
): The target variable of shape (n_samples).
Returns¶
- CorrelationBasedFeatureSelectionPerGroup: The fitted instance of the CFS algorithm.
Transform¶
Reduces the input data to only the selected features.
Warning
Not to be used directly. Use the apply_selected_features_and_rename
method instead.
Parameters¶
- X (
np.ndarray
): A numpy array of shape (n_samples, n_features) representing the input data.
Returns¶
- np.ndarray: The reduced input data as a numpy array of shape (n_samples, n_selected_features).
Apply selected features and rename¶
.apply_selected_features_and_rename(
df: pd.DataFrame, selected_features: List, regex_match = '^(.+)_w(\\d+)$'
)
Apply selected features to the input DataFrame and rename non-longitudinal features. This function applies the
selected features using the selected_features_
attribute given. Therefore, you can capture by your_model.selected_features_
.
It also renames the non-longitudinal features that may have become non-longitudinal if only one wave remains after the
feature selection process, to avoid them being considered as longitudinal attributes during future automatic feature
grouping.
Note
To avoid adding a "transform" parameter to the Transformer Mixin class, this function was created instead.
To avoid misunderstanding, given that changes to Longitudinal data features (longitudinal and non-longitudinal) are needed,
we created this new function replacing Transform
. Rest assured, the LongitudinalPipeline
interprets
this workaround by default without the need for anything from the user perspective.
Parameters¶
- df (
pd.DataFrame
): The input DataFrame to apply the selected features and perform renaming. - selected_features (
List
): The list of selected features to apply to the input DataFrame. - regex_match (
str
): The regex pattern to use for renaming non-longitudinal features. Follows by default theElsa
naming convention for longitudinal features. For more information, see the source code or open an issue.
Returns¶
- pd.DataFrame: The modified DataFrame with selected features applied and non-longitudinal features renamed.
Examples¶
Dummy Longitudinal Dataset¶
Consider the following dataset: stroke.csv
Features:
smoke
(longitudinal) with two waves/time-pointscholesterol
(longitudinal) with two waves/time-pointsage
(non-longitudinal)gender
(non-longitudinal)
Target:
stroke
(binary classification) at wave/time-point 2 only for the sake of the example
The dataset is shown below (w
stands for wave
in ELSA):
smoke_w1 | smoke_w2 | cholesterol_w1 | cholesterol_w2 | age | gender | stroke_w2 |
---|---|---|---|---|---|---|
0 | 1 | 0 | 1 | 45 | 1 | 0 |
1 | 1 | 1 | 1 | 50 | 0 | 1 |
0 | 0 | 0 | 0 | 55 | 1 | 0 |
1 | 1 | 1 | 1 | 60 | 0 | 1 |
0 | 1 | 0 | 1 | 65 | 1 | 0 |
Example 1: Basic Usage¶
- Define the features_group manually or use a pre-set from the LongitudinalDataset class. If the data was from the ELSA database, you could have used the pre-sets such as
.setup_features_group('elsa')
. - Define the non-longitudinal features manually or use a pre-set from the LongitudinalDataset class. If the data was from the ELSA database, you could have used the pre-sets such as
.setup_features_group('elsa')
, then the non-longitudinal features would have been automatically set.
Example 2: Play with the Hyperparameters¶
- Define the features_group manually or use a pre-set from the LongitudinalDataset class. If the data was from the ELSA database, you could have used the pre-sets such as
.setup_features_group('elsa')
. - Define the non-longitudinal features manually or use a pre-set from the LongitudinalDataset class. If the data was from the ELSA database, you could have used the pre-sets such as
.setup_features_group('elsa')
, then the non-longitudinal features would have been automatically set. - Choose among the search methods: "greedySearch" or "exhaustiveSearch" (default).
- Enable parallel processing or not.
- Set the number of CPUs to use for parallel processing. Here we use 4 CPUs. This means that the CFS algorithm will use 4 CPUs for parallel processing. Or in another word, that each CFS running on each set of longitudinal attributes waves will have as much as dedicated CPUs available. If not enough CPUs, the algorithm will wait for the next available CPU to start the next CFS.
Example 3: Play with the two different versions of the CFS Per Group¶
- Define the features_group manually or use a pre-set from the LongitudinalDataset class. If the data was from the ELSA database, you could have used the pre-sets such as
.setup_features_group('elsa')
. - Define the non-longitudinal features manually or use a pre-set from the LongitudinalDataset class. If the data was from the ELSA database, you could have used the pre-sets such as
.setup_features_group('elsa')
, then the non-longitudinal features would have been automatically set. - Choose among the two versions of the CFS Per Group: "1" or "2" (default). See beginning of the documentation for more information on the versions.
Example 4: How to transform (acquire the final feature sets) the data¶
- Define the features_group manually or use a pre-set from the LongitudinalDataset class. If the data was from the ELSA database, you could have used the pre-sets such as
.setup_features_group('elsa')
. - Define the non-longitudinal features manually or use a pre-set from the LongitudinalDataset class. If the data was from the ELSA database, you could have used the pre-sets such as
.setup_features_group('elsa')
, then the non-longitudinal features would have been automatically set. - Print the number of selected features after fitting the CFS-Per-Group algorithm.
Notes¶
The improved Correlation-Based Feature Selection (CFS) algorithm is built upon the following key references:
GitHub Repositories¶
- Zixiao Shen's CFS Implementation:
- Zixiao. S. (2019, August 11). GitHub - ZixiaoShen/Correlation-based-Feature-Selection. Available at: GitHub
- Mastervii's CFS 2-Phase Variant:
- Pomsuwan, T. (2023, February 24). GitHub - mastervii/CSF_2-phase-variant. Available at: GitHub
Longitudinal Component References¶
- VERSION-1 of the CFS Per Group:
- Pomsuwan, T. and Freitas, A.A. (2017, November). Feature selection for the classification of longitudinal human ageing data. In 2017 IEEE International Conference on Data Mining Workshops (ICDMW) (pp. 739-746). IEEE.
- VERSION-2 of the CFS Per Group:
- Pomsuwan, T. and Freitas, A.A. (2018, February). Feature selection for the classification of longitudinal human ageing data. Master's thesis, University of Kent. Available at: University of Kent