Gama Longitudinal Classifier¶
GamaLongitudinalClassifier¶
GamaLongitudinalClassifier(
features_group: List[List[Union[int, str]]],
non_longitudinal_features: List[Union[int, str]],
feature_list_names: List[str], update_feature_groups_callback: Union[Callable, str] = 'default',
search_space: Optional[cs.ConfigurationSpace] = None, scoring: Metric = 'roc_auc',
max_pipeline_length: Optional[int] = None, search: Optional[BaseSearch] = None,
*args, **kwargs
)
GamaLongitudinalClassifier
is the principal class for running Auto-Sklong
, specifically designed to
handle longitudinal datasets. It integrates the proposed search space
and employs various search methods to effectively
explore the longitudinal-based search space
for your dataset.
Naming Convention: GamaLongitudinalClassifier
The name GamaLongitudinalClassifier
reflects its foundation on the GAMA
framework, ensuring consistency with
the framework's naming conventions, even though it is a specialised fork tailored for longitudinal data.
Understanding features_group
and non_longitudinal_features
To leverage the full potential of GamaLongitudinalClassifier
, it is crucial to understand the mandatory parameters
features_group
and non_longitudinal_features
. These parameters enable the classifiers' abilities to manage the
temporal dependencies in your dataset. Detailed information can be found in the Temporal Dependencies documentation page
Refer to GAMA base
Documentation
The GamaLongitudinalClassifier
inherits many parameters from the GAMA
base class (e.g the max total time the AutoML search goes on).
A core class in the GAMA
AutoML framework to all sub-class such as the standard GamaClassifier
or GamaRegressor
classes.
For comprehensive details on these inherited parameters, consult the GAMA Base API documentation
Parameters¶
- features_group (
List[List[Union[int, str]]]
): A temporal matrix representing the temporal dependency of a longitudinal dataset. Each tuple/list of integers in the outer list represents the indices of a longitudinal attribute's waves, with each longitudinal attribute having its own sublist in that outer list. For more details, see the documentation'sTemporal Dependency
page. - non_longitudinal_features (
List[Union[int, str]]
): A list of indices or names of features that are not longitudinal. - feature_list_names (
List[str]
): A list of feature names in the dataset. - update_feature_groups_callback (
Union[Callable, str]
, optional): Callback function to update feature groups. This function is invoked to update the structure of longitudinal features during pipeline transformations. By default, if you did not change thesearch space
, you should be all right not to set it it up. Read further in theSklong
documentation here - search_space (
Optional[cs.ConfigurationSpace]
): Thesearch space
being explored by the system. The original is by default, another one could be passed. Further explore theSearch_Space
documentation page. - scoring (
Metric
, default='roc_auc'): The metric used for evaluating the performance of the model. Similar toScikit-Learn
/Auto-Sklearn
, you can pass a customMetric
scoring function that the AutoML system will have to optimise. - max_pipeline_length (
Optional[int]
, default=None): The maximum length of the pipeline to be searched. Whether you would like to have apre-processing
technique in the middle or not. If you put2
, it will discard thepre-processing
step of thesearch space
. - search (
Optional[BaseSearch]
, default=None): The search strategy used for pipeline optimization. Either,BayesianOptimization
,RandomSearch
,Asynchronous Evolutionary Algorithm
, orAsynchronous Successive Halving
. Further explore the Search Methods documentation section. - args: Additional arguments.
- kwargs: Additional keyword arguments.
More are available, here in the GAMA base documentation. Such as, the max_total_time
, max_eval_time
, etc.
Attributes¶
- classes_ (
ndarray
of shape (n_classes,)): The class labels. - n_classes_ (
int
): The number of classes. - feature_importances_ (
ndarray
of shape (n_features,)): The feature importances. - best_pipeline_ (
Pipeline
object): The best pipeline found during the search.
Methods¶
Fit¶
Fit the Gama Longitudinal Classifier
. Or in another word, start the AutoML experiment on your input data.
This takes the amount of time you set in the max_total_time
parameter of the GAMA Base class.
Parameters¶
- x (
Union[pd.DataFrame, np.ndarray]
): The training input samples of shape(n_samples, n_features)
. - y (
Union[pd.Series, np.ndarray]
): The target values of shape(n_samples,)
. - args: Additional arguments.
- kwargs: Additional keyword arguments.
Returns¶
- GamaLongitudinalClassifier: The fitted AutoML classifier. Capable of predicting the target for input X using the best found pipeline.
Predict¶
Predict the target for input X using the best found pipeline.
This method computes the class labels for the input samples.
Parameters¶
- x (
Union[pd.DataFrame, np.ndarray]
): The input samples of shape(n_samples, n_features)
.
Returns¶
- np.ndarray: The predicted class labels of shape
(n_samples,)
.
Predict Proba¶
Predict class probabilities for X using the best found pipeline.
This method computes the class probabilities for the input samples.
Parameters¶
- x (
Union[pd.DataFrame, np.ndarray]
): The input samples of shape(n_samples, n_features)
.
Returns¶
- np.ndarray: The predicted class probabilities of shape
(n_samples, n_classes)
.
Predict Proba from File¶
.predict_proba_from_file(
arff_file_path: str, target_column: Optional[str] = None,
encoding: Optional[str] = None
)
Predict class probabilities from an ARFF file.
This method computes the class probabilities for the input samples in the ARFF file.
Parameters¶
- arff_file_path (
str
): The path to the ARFF file containing the input samples. - target_column (
Optional[str]
, default=None): The name of the target column in the ARFF file. If None, the last column is assumed to be the target. - encoding (
Optional[str]
, default=None): The encoding of the ARFF file.
Returns¶
- np.ndarray: The predicted class probabilities of shape
(n_samples, n_classes)
.
Examples¶
Make sure to open the little +
icon next to any line having them
The examples below are just a starting point. Make sure to open the little +
icon next to any line having them to see further explanation / guidance.
Dummy Longitudinal Dataset¶
Consider the following dataset: stroke.csv
Features:
smoke
(longitudinal) with two waves/time-pointscholesterol
(longitudinal) with two waves/time-pointsage
(non-longitudinal)gender
(non-longitudinal)
Target:
stroke
(binary classification) at wave/time-point 2 only for the sake of the example
The dataset is shown below:
smoke_w1 | smoke_w2 | cholesterol_w1 | cholesterol_w2 | age | gender | stroke_w2 |
---|---|---|---|---|---|---|
0 | 1 | 0 | 1 | 45 | 1 | 0 |
1 | 1 | 1 | 1 | 50 | 0 | 1 |
0 | 0 | 0 | 0 | 55 | 1 | 0 |
1 | 1 | 1 | 1 | 60 | 0 | 1 |
0 | 1 | 0 | 1 | 65 | 1 | 0 |
Example 1: Basic Usage¶
from sklearn.metrics import classification_report
from scikit_longitudinal.data_preparation import LongitudinalDataset
from gama.GamaLongitudinalClassifier import GamaLongitudinalClassifier
# Load your longitudinal dataset
dataset = LongitudinalDataset('./stroke.csv') # (1)
dataset.load_data_target_train_test_split(
target_column="stroke_w2",
)
# Pre-set or manually set your temporal dependencies
dataset.setup_features_group(input_data="elsa") # (2)
# Instantiate the AutoML system
automl = GamaLongitudinalClassifier( # (3)
features_group=dataset.features_group(),
non_longitudinal_features=dataset.non_longitudinal_features(), # (4)
feature_list_names=dataset.data.columns,
)
# Run the AutoML system to find the best model and hyperparameters
automl.fit(dataset.X_train, dataset.y_train)
# Predictions and prediction probabilities
label_predictions = automl.predict(X_test)
probability_predictions = automl.predict_proba(X_test)
# Classification report
print(classification_report(y_test, label_predictions))
- To further explore the documentation about
LongitudinalDataset
available viaSklong
, read here. - Define the features_group manually or use a pre-set from the
LongitudinalDataset
class. If the data was from the ELSA database, you could use as per the example the pre-sets such as.setup_features_group('elsa')
. - Instantiate the
GamaLongitudinalClassifier
class with the features_group and non-longitudinal features and the rest by default. - Define the non-longitudinal features manually or use a pre-set from the
LongitudinalDataset
class. If the data was from the ELSA database, you could use as per the example the pre-sets such as.setup_features_group('elsa')
, then the non-longitudinal features would have been automatically set.
Example 2: How to set my own scoring function¶
from sklearn.metrics import f1_score
from scikit_longitudinal.data_preparation import LongitudinalDataset
from gama.GamaLongitudinalClassifier import GamaLongitudinalClassifier
# Load your longitudinal dataset
dataset = LongitudinalDataset('./stroke.csv') # (1)
dataset.load_data_target_train_test_split(
target_column="stroke_w2",
)
# Pre-set or manually set your temporal dependencies
dataset.setup_features_group(input_data="elsa") # (2)
# Instantiate the AutoML system
automl = GamaLongitudinalClassifier( # (3)
features_group=dataset.features_group(),
non_longitudinal_features=dataset.non_longitudinal_features(), # (4)
feature_list_names=dataset.data.columns,
scoring=f1_score # (5)
)
# Run the AutoML system to find the best model and hyperparameters
automl.fit(dataset.X_train, dataset.y_train)
# Predictions and prediction probabilities
label_predictions = automl.predict(X_test)
probability_predictions = automl.predict_proba(X_test)
# Classification report
print(classification_report(y_test, label_predictions))
- To further explore the documentation about
LongitudinalDataset
available viaSklong
, read here. - Define the features_group manually or use a pre-set from the
LongitudinalDataset
class. If the data was from the ELSA database, you could use as per the example the pre-sets such as.setup_features_group('elsa')
. - Instantiate the
GamaLongitudinalClassifier
class with the features_group and non-longitudinal features and the rest by default. - Define the non-longitudinal features manually or use a pre-set from the
LongitudinalDataset
class. If the data was from the ELSA database, you could use as per the example the pre-sets such as.setup_features_group('elsa')
, then the non-longitudinal features would have been automatically set. - Set the scoring function to
f1_score
instead of the defaultroc_auc
. You could create your own scorer withmake_scorer
fromsklearn.metrics
if you want to use a custom metric. See furthere here.
Example 3: How to export the best pipeline¶
from sklearn.metrics import classification_report
from scikit_longitudinal.data_preparation import LongitudinalDataset
from gama.GamaLongitudinalClassifier import GamaLongitudinalClassifier
# Load your longitudinal dataset
dataset = LongitudinalDataset('./stroke.csv') # (1)
dataset.load_data_target_train_test_split(
target_column="stroke_w2",
)
# Pre-set or manually set your temporal dependencies
dataset.setup_features_group(input_data="elsa") # (2)
# Instantiate the AutoML system
automl = GamaLongitudinalClassifier( # (3)
features_group=dataset.features_group(),
non_longitudinal_features=dataset.non_longitudinal_features(), # (4)
feature_list_names=dataset.data.columns,
)
# Run the AutoML system to find the best model and hyperparameters
automl.fit(dataset.X_train, dataset.y_train)
# Export the best pipeline
automl.export_script('my_pipeline_script.py') # (5)
- To further explore the documentation about
LongitudinalDataset
available viaSklong
, read here. - Define the features_group manually or use a pre-set from the
LongitudinalDataset
class. If the data was from the ELSA database, you could use as per the example the pre-sets such as.setup_features_group('elsa')
. - Instantiate the
GamaLongitudinalClassifier
class with the features_group and non-longitudinal features and the rest by default. - Define the non-longitudinal features manually or use a pre-set from the
LongitudinalDataset
class. If the data was from the ELSA database, you could use as per the example the pre-sets such as.setup_features_group('elsa')
, then the non-longitudinal features would have been automatically set. - Export the best pipeline to a Python file named
my_pipeline_script.py
. The resulting script will define a variablepipeline
orensemble
, depending on the post-processing method that was used after search. Reader further about the post-ensembling methods in theGAMA
class documentation.
Example 4: How to play with the GAMA
base parameters¶
from sklearn.metrics import classification_report
from scikit_longitudinal.data_preparation import LongitudinalDataset
from gama.GamaLongitudinalClassifier import GamaLongitudinalClassifier
from gama.search_methods import RandomSearch
# Load your longitudinal dataset
dataset = LongitudinalDataset('./stroke.csv') # (1)
dataset.load_data_target_train_test_split(
target_column="stroke_w2",
)
# Pre-set or manually set your temporal dependencies
dataset.setup_features_group(input_data="elsa") # (2)
# Instantiate the AutoML system
automl = GamaLongitudinalClassifier( # (3)
features_group=dataset.features_group(),
non_longitudinal_features=dataset.non_longitudinal_features(), # (4)
feature_list_names=dataset.data.columns,
max_total_time=86400, # (5)
max_eval_time=1000, # (6)
n_jobs=4, # (7)
max_memory_mb=2000, # (8)
post_processing=EnsemblePostProcessing(), # (9)
output_directory="my_output", # (10)
store="all", # (11)
search=RandomSearch(), # (12)
)
# Run the AutoML system to find the best model and hyperparameters
automl.fit(dataset.X_train, dataset.y_train)
# Predictions and prediction probabilities
label_predictions = automl.predict(X_test)
probability_predictions = automl.predict_proba(X_test)
# Classification report
print(classification_report(y_test, label_predictions))
- To further explore the documentation about
LongitudinalDataset
available viaSklong
, read here. - Define the features_group manually or use a pre-set from the
LongitudinalDataset
class. If the data was from the ELSA database, you could use as per the example the pre-sets such as.setup_features_group('elsa')
. - Instantiate the
GamaLongitudinalClassifier
class with the features_group and non-longitudinal features and the rest by default. - Define the non-longitudinal features manually or use a pre-set from the
LongitudinalDataset
class. If the data was from the ELSA database, you could use as per the example the pre-sets such as.setup_features_group('elsa')
, then the non-longitudinal features would have been automatically set. - Set the maximum total time – in seconds – for the AutoML system to 24 hours (86400 seconds). Read further in the
Gama base
class documentation. - Set the maximum evaluation time – in seconds – for each pipeline evaluation to 1000 seconds. Read further in the
Gama base
class documentation. - Turn on the parallel processing with 4 jobs. This means that 4 candidates at the same time will be able to be evaluated, if the number of CPUs availale permits-so. Read further in the
Gama base
class documentation. - Set the maximum memory usage – in megabytes – for the AutoML system to 2000 MB. This means that above the current candidate's evaluation will crash. Read further in the
Gama base
class documentation. - Set the post-processing method to
EnsemblePostProcessing
. This will ensemble the best pipelines found during the search. Read further in the officialGAMA
documentation here. - Set the output directory to
my_output
. This will save the results of the search in themy_output
directory. Read further in theGama base
class documentation. - Set the store level to
all
, which keep logs and cache with models and predictions. Read further in theGama base
class documentation. - Run the AutoML's search under
RandomSearch
strategy. Others are available, read further in the Search Methods section.