Longitudinal Dataset¶
LongitudinalDataset¶
The LongitudinalDataset class is a comprehensive container specifically designed for managing and preparing longitudinal datasets. It provides essential data management and transformation capabilities, thereby facilitating the development and application of machine learning algorithms tailored to longitudinal data classification tasks.
Feature Groups and Non-Longitudinal Characteristics
The class employs two crucial attributes, feature_groups
and non_longitudinal_features
, which play a vital role
in enabling adapted/newly-designed machine learning algorithms to comprehend the temporal structure of longitudinal
datasets.
- features_group: A temporal matrix representing the temporal dependency of a longitudinal dataset. Each tuple/list of integers in the outer list represents the indices of a longitudinal attribute's waves, with each longitudinal attribute having its own sublist in that outer list. For more details, see the documentation's "Temporal Dependency" page.
- non_longitudinal_features: A list of feature indices that are considered non-longitudinal. These features are not part of the temporal matrix and are treated as static features or not by any subsequent techniques employed.
Wrapper Around Pandas DataFrame
This class wraps a pandas
DataFrame, offering a familiar interface while incorporating enhancements for
longitudinal data. It ensures effective processing and learning from data collected over multiple time points.
Parameters¶
- file_path (
Union[str, Path]
): Path to the dataset file. Supports both ARFF and CSV formats. - data_frame (
Optional[pd.DataFrame]
, optional): If provided, this pandas DataFrame will serve as the dataset, and the file_path parameter will be ignored.
Properties¶
- data (
pd.DataFrame
): A read-only property that returns the loaded dataset as a pandas DataFrame. - target (
pd.Series
): A read-only property that returns the target variable (class variable) as a pandas Series. - X_train (
np.ndarray
): A read-only property that returns the training data as a numpy array. - X_test (
np.ndarray
): A read-only property that returns the test data as a numpy array. - y_train (
pd.Series
): A read-only property that returns the training target data as a pandas Series. - y_test (
pd.Series
): A read-only property that returns the test target data as a pandas Series.
Methods¶
load_data¶
Load the data from the specified file into a pandas DataFrame.Raises¶
- ValueError: If the file format is not supported. Only ARFF and CSV are supported.
- FileNotFoundError: If the file specified in the file_path parameter does not exist.
load_target¶
.load_target(
target_column: str,
target_wave_prefix: str = "class_",
remove_target_waves: bool = False
)
Parameters¶
- target_column (
str
): The name of the column in the dataset to be used as the target variable. - target_wave_prefix (
str
, optional): The prefix of the columns that represent different waves of the target variable. Defaults to "class_". - remove_target_waves (
bool
, optional): If True, all the columns with target_wave_prefix and the target_column will be removed from the dataset after extracting the target variable. Note, sometimes in Longitudinal study, classes are also subject to be collected at different time points, hence the automatic deletion if this parameter set to true. Defaults to False.
Raises¶
- ValueError: If no data is loaded or the target_column is not found in the dataset.
load_train_test_split¶
Split the data into training and testing sets and save them as attributes.Parameters¶
- test_size (
float
, optional): The proportion of the dataset to include in the test split. Defaults to 0.2. - random_state (
int
, optional): Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls. Defaults to None.
Raises¶
- ValueError: If no data or target is loaded.
load_data_target_train_test_split¶
.load_data_target_train_test_split(
target_column: str,
target_wave_prefix: str = "class_",
remove_target_waves: bool = False,
test_size: float = 0.2,
random_state: int = None
)
Parameters¶
- target_column (
str
): The name of the column in the dataset to be used as the target variable. - target_wave_prefix (
str
, optional): The prefix of the columns that represent different waves of the target variable. Defaults to "class_". - remove_target_waves (
bool
, optional): If True, all the columns with target_wave_prefix and the target_column will be removed from the dataset after extracting the target variable. Defaults to False. - test_size (
float
, optional): The proportion of the dataset to include in the test split. Defaults to 0.2. - random_state (
int
, optional): Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls. Defaults to None.
convert¶
Convert the dataset between ARFF or CSV formats.Parameters¶
- output_path (
Union[str, Path]
): Path to store the resulting file.
Raises¶
- ValueError: If no data to convert or unsupported file format.
save_data¶
Save the DataFrame to the specified file format.Parameters¶
- output_path (
Union[str, Path]
): Path to store the resulting file.
Raises¶
- ValueError: If no data to save.
setup_features_group¶
Set up the feature groups based on the input data and populate the non-longitudinal features attribute.Feature Group Setup
The method allows for setting up feature groups based on the input data provided. The input data can be in the form of a list of lists of integers, a list of lists of strings (feature names), or using a pre-set strategy (e.g., "elsa").
The list of list of integers/strings works as follows:
- Each sublist represents a feature group / or in another word, a longitudinal attribute.
- Each element in the sublist represents the index of the feature in the dataset.
- To be able to compare, two different longitudinal attributes available waves information, there could be gaps in the sublist, which can be filled with -1. For example, if the first longitudinal attribute has 3 waves and the second has 5 waves, the first sublist could be [0, 1, 2, -1, -1] and the second sublist could be [3, 4, 5, 6, 7]. Then, we could compare the first wave of the first attribute with the first wave of the second attribute, and so on (i.e, see which one is older or more recent).
For more information, see the documentation's "Temporal Dependency" page.
Pre-set Strategy
The "elsa" strategy groups features based on their name and suffix "_w1", "_w2", etc. For exemple, if the dataset has features "age_w1", "age_w2". The method will group them together, making w2 more recent than w1 in the features group setup.
More pre-set strategy are welcome to be added in the future. Open an issue if you have any suggestion or if you would like to contribute to one.
Parameters¶
- input_data (
Union[str, List[List[Union[str, int]]]]
): The input data for setting up the feature groups:- If "elsa" is passed, it groups features based on their name and suffix "_w1", "_w2", etc.
- If a list of lists of integers is passed, it assigns the input directly to the feature groups without modification.
- If a list of lists of strings (feature names) is passed, it converts the names to indices and creates feature groups.
Raises¶
- ValueError: If input_data is not one of the expected types or if a feature name is not found in the dataset.
feature_groups¶
Return the feature groups, wherein any placeholders ("-1") are substituted with "N/A" when the names parameter is set to True.Parameters¶
- names (
bool
, optional): If True, the feature names will be returned instead of the indices. Defaults to False.
Returns¶
- List[List[Union[int, str]]]: The feature groups as a list of lists of feature names or indices.
non_longitudinal_features¶
Return the non-longitudinal features.Parameters¶
- names (
bool
, optional): If True, the feature names will be returned instead of the indices. Defaults to False.
Returns¶
- List[Union[int, str]]: The non-longitudinal features as a list of feature names or indices.
set_data¶
Set the data attribute.Parameters¶
- data (
pd.DataFrame
): The data.
set_target¶
Set the target attribute.Parameters¶
- target (
pd.Series
): The target.
setX_train¶
Set the training data attribute.Parameters¶
- X_train (
pd.DataFrame
): The training data.
setX_test¶
Set the test data attribute.Parameters¶
- X_test (
pd.DataFrame
): The test data.
sety_train¶
Set the training target data attribute.Parameters¶
- y_train (
pd.Series
): The training target data.
sety_test¶
Set the test target data attribute.Parameters¶
- y_test (
pd.Series
): The test target data.
Examples¶
Dummy Longitudinal Dataset¶
Consider the following dataset: stroke.csv
Features:
smoke
(longitudinal) with two waves/time-pointscholesterol
(longitudinal) with two waves/time-pointsage
(non-longitudinal)gender
(non-longitudinal)
Target:
stroke
(binary classification) at wave/time-point 2 only for the sake of the example
The dataset is shown below (w
stands for wave
in ELSA):
smoke_w1 | smoke_w2 | cholesterol_w1 | cholesterol_w2 | age | gender | stroke_w2 |
---|---|---|---|---|---|---|
0 | 1 | 0 | 1 | 45 | 1 | 0 |
1 | 1 | 1 | 1 | 50 | 0 | 1 |
0 | 0 | 0 | 0 | 55 | 1 | 0 |
1 | 1 | 1 | 1 | 60 | 0 | 1 |
0 | 1 | 0 | 1 | 65 | 1 | 0 |
Example 1: Basic Usage¶
- Note that you could have instantiated the features group manually.
features_group = [[0, 1], [2, 3]]
would have been equivalent todataset.setup_features_group("elsa")
in this very scenario. While thenon_longitudinal_features
could have beennon_longitudinal_features = [4, 5]
. However, theelsa
pre-sets do it for you.
Exemple 2: Use faster setup with load_data_target_train_test_split
¶
- Note that you could have instantiated the features group manually.
features_group = [[0, 1], [2, 3]]
would have been equivalent todataset.setup_features_group("elsa")
in this very scenario. While thenon_longitudinal_features
could have beennon_longitudinal_features = [4, 5]
. However, theelsa
pre-sets do it for you.
Example 2: Using Custom Feature Groups (different data to Elsa for exemple)¶
- Note that the non-longitudinal features are not included in the custom feature groups. They are automatically detected and stored in the
non_longitudinal_features
attribute.
Example 3: Print my feature groups and non-longitudinal features¶
- Note that you could have instantiated the features group manually.
features_group = [[0, 1], [2, 3]]
would have been equivalent todataset.setup_features_group("elsa")
in this very scenario. While thenon_longitudinal_features
could have beennon_longitudinal_features = [4, 5]
. However, theelsa
pre-sets do it for you.