(gloe-scikit-learn)= # Gloe & Scikit-learn Let's create an example that demonstrates how to use Gloe with [Scikit-Learn](https://scikit-learn.org/) for a machine learning workflow. We will include steps for data loading, preprocessing, model training, and evaluation. A similar example using PyTorch is available {ref}`here `. ## Setup Ensure you have the necessary packages installed: ```shell pip install gloe scikit-learn pandas ``` The below imports are necessary for the rest of the code: ```python import pandas as pd from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score from gloe import transformer, partial_transformer from gloe.utils import attach ``` ## Define the Transformers Let us create transformers for loading data, preprocessing, model creation, training, and evaluation. ### Data Load and Preprocessing Loading data from a CSV file into a Pandas DataFrame: ```python @transformer def load_data(file_path: str) -> pd.DataFrame: return pd.read_csv(file_path) ``` The `preprocess_data` transformer preprocesses the DataFrame, splits it into training and test sets, and standardizes the features. ```python @transformer def preprocess_data(df: pd.DataFrame) -> tuple[pd.DataFrame, pd.DataFrame, pd.Series, pd.Series]: # Assume the last column is the target X = df.iloc[:, :-1] y = df.iloc[:, -1] # Split data into training and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Standardize features scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test) return X_train, X_test, y_train, y_test ``` Once the type `tuple[pd.DataFrame, pd.DataFrame, pd.Series, pd.Series]` is used in many places, we can create a type alias for it: ```python from typing import TypeAlias Data: TypeAlias = tuple[pd.DataFrame, pd.DataFrame, pd.Series, pd.Series] ``` Thus, the definition of the `preprocess_data` transformer can be simplified: ```python @transformer def preprocess_data(df: pd.DataFrame) -> Data: ... ``` ### Model Training and Evaluation Defining a logistic regression model using Scikit-Learn API: ```python @transformer def create_model(_) -> LogisticRegression: model = LogisticRegression(random_state=42) return model ``` ```{note} Transformers need to have at least one argument. The `_` argument is a placeholder that is not used in the transformer. ``` Training the logistic regression model using the provided data, with the number of iterations specified as a partial argument: ```python @partial_transformer def train_model(entry: tuple[LogisticRegression, Data], max_iter: int = 100) -> tuple[LogisticRegression, Data]: model, data = entry X_train, _, y_train, _ = data model.max_iter = max_iter model.fit(X_train, y_train) return model, data ``` Finally, evaluates the trained model on the test data and returns the accuracy score. ```python @transformer def evaluate_model(entry: tuple[LogisticRegression, Data]) -> float: model, data = entry X_train, X_test, y_train, y_test = data y_pred = model.predict(X_test) accuracy = accuracy_score(y_test, y_pred) return accuracy ``` ## Create the Pipeline The transformers are composed into a pipeline using the `>>` operator. `load_data`, `preprocess_data`, `create_model`, `train_model`, and `evaluate_model` are chained together. The `train_model` transformer is provided with the `max_iter` parameter using the partial transformer method. ```python pipeline = ( load_data >> preprocess_data >> attach(create_model) >> train_model(max_iter=200) >> evaluate_model ) ``` The only point that needs attention is the {func}`attach ` function that is used to attach the input of the `create_model` transformer to its output. Let's analyze the types carefully: - `create_model` has a type `Transformer[Any, LogisticRegression]`. - `train_model` has a type `Transformer[tuple[LogisticRegression, Data], tuple[LogisticRegression, Data]`. So, we cannot connect them directly. The `attach` function receives the `create_model` transformer and returns a new transformer with the input of `create_model` attached to its output. The transformer returned by `attach` has the type `Transformer[Data, tuple[LogisticRegression, Data]`: exactly what the `train_model` needs. ### Run the Pipeline The pipeline is executed by calling it directly with the input file_path. The final output is the accuracy of the model evaluation. ```python def main(): file_path = 'data.csv' accuracy = pipeline(file_path) print(f"Model evaluation accuracy: {accuracy}") main() ``` ### Plot the Pipeline Finally, we can visualize the pipeline using the {meth}`.to_image() ` method: ```python pipeline.to_image('scikit-learn-pipeline.png') ``` ![Graph for Scikit-learn pipeline](../_static/assets/scikit-learn-pipeline.jpeg)