Aggregation Function for Longitudinal Data¶
AggrFunc¶
AggrFunc(
features_group: List[List[int]] = None,
non_longitudinal_features: List[Union[int, str]] = None,
feature_list_names: List[str] = None,
aggregation_func: Union[str, Callable] = "mean",
parallel: bool = False,
num_cpus: int = -1
)
The AggrFunc class helps apply aggregation functions to feature groups in longitudinal datasets. The motivation is to use some of the dataset's temporal information before using traditional machine learning algorithms like Scikit-Learn. However, it is worth noting that aggregation significantly diminishes the overall temporal information of the dataset.
A feature group refers to a collection of features that possess a common base longitudinal attribute while originating from distinct waves of data collection. Refer to the documentation's "Temporal Dependency" page for more details.
Aggregation Function
In a given scenario, it is observed that a dataset comprises three distinct features, namely "income_wave1", "income_wave2", and "income_wave3". It is noteworthy that these features collectively constitute a group within the dataset.
The application of the aggregation function occurs iteratively across the waves, specifically targeting each feature group. As a result, an aggregated feature is produced for every group. In the context of data aggregation, when the designated aggregation function is the mean, it follows that the individual features "income_wave1", "income_wave2", and "income_wave3" would undergo a transformation reduction resulting in the creation of a consolidated feature named "mean_income".
Support for Custom Functions
The latest update to the class incorporates enhanced functionality to accommodate custom aggregation functions, as long as they adhere to the callable interface. The user has the ability to provide a function as an argument, which is expected to accept a pandas Series as input and produce a singular value as output. The pandas Series is representative of the longitudinal attribute across the waves.
Parameters¶
- features_group (
List[List[int]]
): A temporal matrix representing the temporal dependency of a longitudinal dataset. Each tuple/list of integers in the outer list represents the indices of a longitudinal attribute's waves, with each longitudinal attribute having its own sublist in that outer list. For more details, see the documentation's "Temporal Dependency" page. - non_longitudinal_features (
List[Union[int, str]]
, optional): A list of indices of features that are not longitudinal attributes. Defaults to None. - feature_list_names (
List[str]
): A list of feature names in the dataset. - aggregation_func (
Union[str, Callable]
, optional): The aggregation function to apply. Can be "mean", "median", "mode", or a custom function. - parallel (
bool
, optional): Whether to use parallel processing for the aggregation. Defaults to False. - num_cpus (
int
, optional): The number of CPUs to use for parallel processing. Defaults to -1, which uses all available CPUs.
Methods¶
get_params¶
Get the parameters of the AggrFunc instance.Parameters¶
- deep (
bool
, optional): If True, will return the parameters for this estimator and contained subobjects that are estimators. Defaults to True.
Returns¶
- dict: The parameters of the AggrFunc instance.
Prepare_data¶
Prepare the data for the transformation.Parameters¶
- X (
np.ndarray
): The input data. - y (
np.ndarray
, optional): The target data. Not particularly relevant for this class. Defaults to None.
Returns¶
- AggrFunc: The instance of the class with prepared data.
Transform¶
Apply the aggregation function to the feature groups in the dataset.Returns¶
- pd.DataFrame: The transformed dataset.
- List[List[int]]: The feature groups in the transformed dataset. Which should be none since the aggregation function is applied to all Longitudinal features.
- List[Union[int, str]]: The non-longitudinal features in the transformed dataset.
- List[str]: The names of the features in the transformed dataset.
Examples¶
Dummy Longitudinal Dataset¶
Consider the following dataset: stroke.csv
Features:
smoke
(longitudinal) with two waves/time-pointscholesterol
(longitudinal) with two waves/time-pointsage
(non-longitudinal)gender
(non-longitudinal)
Target:
stroke
(binary classification) at wave/time-point 2 only for the sake of the example
The dataset is shown below (w
stands for wave
in ELSA):
smoke_w1 | smoke_w2 | cholesterol_w1 | cholesterol_w2 | age | gender | stroke_w2 |
---|---|---|---|---|---|---|
0 | 1 | 0 | 1 | 45 | 1 | 0 |
1 | 1 | 1 | 1 | 50 | 0 | 1 |
0 | 0 | 0 | 0 | 55 | 1 | 0 |
1 | 1 | 1 | 1 | 60 | 0 | 1 |
0 | 1 | 0 | 1 | 65 | 1 | 0 |
Example 1: Basic Usage with Mean Aggregation¶
- Note that you could have instantiated the features group manually.
features_group = [[0, 1], [2, 3]]
would have been equivalent todataset.setup_features_group("elsa")
in this very scenario. While thenon_longitudinal_features
could have beennon_longitudinal_features = [4, 5]
. However, theelsa
pre-sets do it for you.
Example 2: Using Custom Aggregation Function¶
- Note that you could have instantiated the features group manually.
features_group = [[0, 1], [2, 3]]
would have been equivalent todataset.setup_features_group("elsa")
in this very scenario. While thenon_longitudinal_features
could have beennon_longitudinal_features = [4, 5]
. However, theelsa
pre-sets do it for you.
Example 3: Using Parallel Processing¶
- Note that you could have instantiated the features group manually.
features_group = [[0, 1], [2, 3]]
would have been equivalent todataset.setup_features_group("elsa")
in this very scenario. While thenon_longitudinal_features
could have beennon_longitudinal_features = [4, 5]
. However, theelsa
pre-sets do it for you. - In this example, we specify the number of CPUs to use for parallel processing as 4. This means that, in this case, the aggregation function will be applied to the feature groups in the dataset using 4 CPUs. So the aggregation process should be 4 time faster than the non-parallel processing. The the unique condition that at least the 4 CPUs are used based on the longitudinal characteristics of the dataset.