{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# How to optimize with optuna an MLP using the Pipeline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Import classes and define paths"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from cetaceo.pipeline import Pipeline\n",
    "from cetaceo.models import MLP\n",
    "from cetaceo.optimization import OptunaBaseOptimizer\n",
    "from cetaceo.data import VTUDataset\n",
    "from cetaceo.evaluators import RegressionEvaluator\n",
    "from cetaceo.utils import PathManager\n",
    "from pathlib import Path\n",
    "\n",
    "import torch\n",
    "from sklearn.preprocessing import MinMaxScaler\n",
    "\n",
    "import warnings\n",
    "warnings.filterwarnings(\"ignore\")\n",
    "\n",
    "import numpy as np\n",
    "# since version 1.20.0, numpy has changed the name of the bool type and np.bool is deprecated\n",
    "# we need to change it to np.bool_ to avoid error with some libraries that still use the old name\n",
    "np.bool = np.bool_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "DATA_DIR = Path.cwd().parent / \"sample_data\"\n",
    "CASE_DIR = Path.cwd() / \"results\"\n",
    "PathManager.create_directory(CASE_DIR / 'models')\n",
    "PathManager.create_directory(CASE_DIR / 'hyperparameters')\n",
    "PathManager.create_directory(CASE_DIR / 'plots')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define sklearn scalers if needed\n",
    "\n",
    "Here, we create 2 minmax scalers, one for scaling the inputs, and other for the outputs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "x_scaler = MinMaxScaler()\n",
    "y_scaler = MinMaxScaler()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create datasets\n",
    "For this example, we will use some VTU (VTK like) meshes obtained with CODA. Thus, a `VTUDataset` is needed. We create one for each dataset split.   \n",
    "The meshes are in separate files "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "mesh_files = list(DATA_DIR.glob(\"*.vtu\"))\n",
    "n_test_samples = int(0.4 * len(mesh_files))\n",
    "\n",
    "train_files = mesh_files[n_test_samples:]\n",
    "test_files = mesh_files[:n_test_samples // 2]\n",
    "valid_files = mesh_files[n_test_samples // 2:n_test_samples]\n",
    "\n",
    "train_dataset = VTUDataset(mesh_files=train_files, x_scaler=x_scaler, y_scaler=y_scaler)\n",
    "test_dataset = VTUDataset(mesh_files=test_files, x_scaler=x_scaler, y_scaler=y_scaler)\n",
    "valid_dataset = VTUDataset(mesh_files=valid_files, x_scaler=x_scaler, y_scaler=y_scaler)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train dataset length : 175224\n",
      "Test dataset length : 58408\n",
      "Valid dataset length : 58408\n",
      "X, y train shapes:\n",
      " torch.Size([175224, 2]) torch.Size([175224, 7])\n"
     ]
    }
   ],
   "source": [
    "x, y = train_dataset[:]\n",
    "print(\"Train dataset length :\", len(train_dataset))\n",
    "print(\"Test dataset length :\", len(test_dataset))\n",
    "print(\"Valid dataset length :\", len(valid_dataset))\n",
    "print(\"X, y train shapes:\\n\", x.shape, y.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we need to preprocess the data somehow, we can use the method `process_data` from the datasets. For instance, we need the velocities from the meshes, but we only have the momentum and the density. We can then apply the following transformation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def process_cell_data(x, y):\n",
    "    rho = y[..., 0:1]\n",
    "    # get velocities from momentum and density\n",
    "    u = y[..., 2:3] / rho\n",
    "    v = y[..., 3:4] / rho\n",
    "    p = y[..., 4:5]\n",
    "    return x, torch.hstack([rho, u, v, p])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_dataset.process_data(process_function=process_cell_data)\n",
    "test_dataset.process_data(process_function=process_cell_data)\n",
    "valid_dataset.process_data(process_function=process_cell_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And now we can scale the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_dataset.scale_data()\n",
    "valid_dataset.scale_data()\n",
    "test_dataset.scale_data()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluator\n",
    "\n",
    "For this example, we are going to use a `RegressionEvaluator`, this time no plots will be generated"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "evaluator = RegressionEvaluator()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Optimization\n",
    "\n",
    "For optimization, first we need to define a set of parameters.  \n",
    "* If a parameter is not defined, its default value will be taken.  \n",
    "* If a parameter is defined with a single value, that parameter won't be optimized and that value will be constant during the whole optimization process.  \n",
    "* If a paramer is defined with a tuple with 2 elelements, that parameter will be optimized and the values chosen during the optimization phase will be on the range defined by the tuple. \n",
    "\n",
    "If we want to store the best parameters obtained during the optimization, a `save_dir` must be specified on the optimizer constructor. Then, they will be stored in a json file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "device = torch.device(\"cuda:3\" if torch.cuda.is_available() else \"cpu\")\n",
    "optim_params = {\n",
    "        \"lr\": 0.01,  # fixed parameter\n",
    "        \"batch_size\": (10, 64),  # optimizable parameter\n",
    "        \"n_layers\": (1, 3),\n",
    "        \"hidden_size\": 32,\n",
    "        \"print_rate_epoch\": 10,\n",
    "        \"epochs\": 50,\n",
    "        \"device\": device,\n",
    "    }"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "optimizer = OptunaBaseOptimizer(\n",
    "        optimization_params=optim_params, n_trials=5, direction=\"minimize\", save_dir=CASE_DIR / 'hyperparameters'\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Run the pipeline\n",
    "\n",
    "When optimizing, you have to specify the model class to optimize. In this case, we are optimizing an MLP."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[I 2024-09-11 14:54:38,215] A new study created in memory with name: no-name-d6c8ce02-114c-4fd1-8890-c0486091bf01\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 10/50 | Train loss (x1e5) 402.1901 \n",
      "Epoch 20/50 | Train loss (x1e5) 402.2421 \n",
      "Epoch 30/50 | Train loss (x1e5) 402.4459 \n",
      "Epoch 40/50 | Train loss (x1e5) 401.9750 \n",
      "Epoch 50/50 | Train loss (x1e5) 402.0932 \n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[I 2024-09-11 15:12:45,447] Trial 0 finished with value: 0.003009177435055923 and parameters: {'batch_size': 26, 'n_layers': 3}. Best is trial 0 with value: 0.003009177435055923.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 10/50 | Train loss (x1e5) 334.2382 \n",
      "Epoch 20/50 | Train loss (x1e5) 321.1680 \n",
      "Epoch 30/50 | Train loss (x1e5) 307.4511 \n",
      "Epoch 40/50 | Train loss (x1e5) 299.3205 \n",
      "Epoch 50/50 | Train loss (x1e5) 295.5657 \n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[I 2024-09-11 15:20:02,835] Trial 1 finished with value: 0.0022019218127490216 and parameters: {'batch_size': 47, 'n_layers': 1}. Best is trial 1 with value: 0.0022019218127490216.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 10/50 | Train loss (x1e5) 322.4274 \n",
      "Epoch 20/50 | Train loss (x1e5) 307.9958 \n",
      "Epoch 30/50 | Train loss (x1e5) 302.4046 \n",
      "Epoch 40/50 | Train loss (x1e5) 295.0588 \n",
      "Epoch 50/50 | Train loss (x1e5) 295.6092 \n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[I 2024-09-11 15:26:58,499] Trial 2 finished with value: 0.002151052706345495 and parameters: {'batch_size': 50, 'n_layers': 1}. Best is trial 2 with value: 0.002151052706345495.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 10/50 | Train loss (x1e5) 316.7952 \n",
      "Epoch 20/50 | Train loss (x1e5) 305.2839 \n",
      "Epoch 30/50 | Train loss (x1e5) 300.2761 \n",
      "Epoch 40/50 | Train loss (x1e5) 297.5901 \n",
      "Epoch 50/50 | Train loss (x1e5) 294.3364 \n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[I 2024-09-11 15:58:24,977] Trial 3 finished with value: 0.002080027835441825 and parameters: {'batch_size': 13, 'n_layers': 2}. Best is trial 3 with value: 0.002080027835441825.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 10/50 | Train loss (x1e5) 402.7308 \n",
      "Epoch 20/50 | Train loss (x1e5) 402.7523 \n",
      "Epoch 30/50 | Train loss (x1e5) 402.8952 \n",
      "Epoch 40/50 | Train loss (x1e5) 402.4810 \n",
      "Epoch 50/50 | Train loss (x1e5) 402.5232 \n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[I 2024-09-11 16:16:17,864] Trial 4 finished with value: 0.003013648607862439 and parameters: {'batch_size': 23, 'n_layers': 3}. Best is trial 3 with value: 0.002080027835441825.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Study statistics: \n",
      "  Number of finished trials:  5\n",
      "  Number of pruned trials:  0\n",
      "  Number of completed trials:  5\n",
      "Best trial:\n",
      "  Value:  0.002080027835441825\n",
      "  Params: \n",
      "    batch_size: 13\n",
      "    n_layers: 2\n",
      "\n",
      "\n",
      "Epoch 10/50 | Train loss (x1e5) 404.7949  | Test loss (x1e5) 309.9594\n",
      "Epoch 20/50 | Train loss (x1e5) 404.7411  | Test loss (x1e5) 307.5095\n",
      "Epoch 30/50 | Train loss (x1e5) 404.5180  | Test loss (x1e5) 299.7614\n",
      "Epoch 40/50 | Train loss (x1e5) 404.8046  | Test loss (x1e5) 314.5927\n",
      "Epoch 50/50 | Train loss (x1e5) 405.0134  | Test loss (x1e5) 299.8414\n",
      "\n",
      "--------------------------------------------------\n",
      "Metrics on train data:\n",
      "--------------------------------------------------\n",
      "\n",
      "Regression evaluator metrics:\n",
      "mse: 0.0040\n",
      "mae: 0.0299\n",
      "mre: 445.1378%\n",
      "ae_95: 0.1222\n",
      "ae_99: 0.3160\n",
      "r2: 0.9754\n",
      "l2_error: 0.0834\n",
      "--------------------------------------------------\n",
      "Metrics on test data:\n",
      "--------------------------------------------------\n",
      "\n",
      "Regression evaluator metrics:\n",
      "mse: 0.0031\n",
      "mae: 0.0276\n",
      "mre: 429.8295%\n",
      "ae_95: 0.1078\n",
      "ae_99: 0.2711\n",
      "r2: 0.9813\n",
      "l2_error: 0.0726\n"
     ]
    }
   ],
   "source": [
    "pipeline = Pipeline(\n",
    "        train_dataset=train_dataset,\n",
    "        test_dataset=test_dataset,\n",
    "        valid_dataset=valid_dataset,\n",
    "        model_class=MLP,\n",
    "        optimizer=optimizer,\n",
    "        evaluators=[evaluator]\n",
    "    )\n",
    "pipeline.run()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To save the trained model with the optimized parameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "pipeline.model.save(CASE_DIR / 'models' / 'model.pth')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To load a model, simply use Model.load method"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = MLP.load(CASE_DIR / 'models' / 'model.pth')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "cetaceo",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}