MartialTerran
/

Neural_Nets_Doing_Simple_Tasks

Model card Files Files and versions Community

MartialTerran commited on 11 days ago

Commit

25652b9

verified ·

1 Parent(s): 93d3a73

Create GreaterThan_MLP_V1.1_with_FailuresAnalysis.ipynb

Browse files

Files changed (1) hide show

GreaterThan_MLP_V1.1_with_FailuresAnalysis.ipynb +467 -0

GreaterThan_MLP_V1.1_with_FailuresAnalysis.ipynb ADDED Viewed

	@@ -0,0 +1,467 @@

+# GreaterThan_MLP_V1.1_with_FailuresAnalysis.py
+"""
+The objective of GreaterThan_MLP_V1.0.py is to establish a fundamental performance baseline
+for a numerical comparison task using a deliberately simple Multi-Layer Perceptron (MLP).
+It avoids all natural language processing techniques by treating the problem as a pure binary classification
+on a fixed-size vector. The dataset consists of synthetically generated pairs of
+two-digit decimal numbers (e.g., 10.00 and 09.21),
+which are deconstructed and flattened into an 8-dimensional feature vector of their raw digits
+([1, 0, 0, 0,
+0, 9, 2, 1]).
+The model is then trained to predict a single binary label (0 for left > right, 1 for right > left),
+directly testing the MLP's capability to learn the hierarchical rules of numerical magnitude
+from the positional values of the input digits alone.
+The MLP model's task is to learn the rules of numerical magnitude from raw digits alone,
+treating the problem as a simple binary classification task.
+It's designed for maximum clarity and serves as a fundamental baseline for this reasoning problem.
+The plan is clear: a simple MLP for binary classification.
+The 8-dimensional input vector, constructed from the two 4-digit numbers, will be the focus.
+The output will cleanly indicate which number is greater. Using on-the-fly data generation.
+The generate_mlp_data function produces the correct 8-dimensional input vectors and binary labels.
+GreaterThan_MLP_V1.0.py presents a basic numerical comparison challenge using a rudimentary MLP as a baseline.
+The core approach hinges on framing the task as a binary classification problem on a fixed-length feature vector.
+Pairs of decimal numbers are converted into an 8-dimensional array of their digit values;
+for instance, 10.00 and 09.21 are transformed to [1, 0, 0, 0, 0, 9, 2, 1].
+The model's training focuses on predicting whether one number is greater than another through a single binary label.
+The MLP baseline model performs remarkably well, achieving over 99.9% accuracy in deciding "GreaterThan".
+This indicates that the underlying logic of numerical comparison can be learned from raw digits by a simple neural network,
+provided the input is structured as a fixed-size vector.
+However, even with high accuracy, failures still occur. Understanding why and on what data the model fails
+is the next critical step in ML engineering. This is how we discover dataset biases, edge cases, and architectural weaknesses.
+Here is the modified script, GreaterThan_MLP_V1.1_with_FailuresAnalysis.py
+It incorporates to automatically detect and log failures to a CSV file when accuracy is high,
+creating a valuable dataset artifact for future analysis and the development of more robust models.
+Here is the ouput of the first demonstration run in colab:
+Model initialized with 9473 parameters.
+--- Starting Training ---
+Epoch [1/100], Train Loss: 0.4015, Train Acc: 82.67%, | Val Loss: 0.1690, Val Acc: 97.03%
+Epoch [2/100], Train Loss: 0.1743, Train Acc: 92.94%, | Val Loss: 0.0974, Val Acc: 98.04%
+Epoch [3/100], Train Loss: 0.1300, Train Acc: 94.54%, | Val Loss: 0.0741, Val Acc: 98.61%
+Epoch [4/100], Train Loss: 0.1112, Train Acc: 95.20%, | Val Loss: 0.0618, Val Acc: 98.96%
+Epoch [5/100], Train Loss: 0.1019, Train Acc: 95.61%, | Val Loss: 0.0565, Val Acc: 98.79%
+Epoch [6/100], Train Loss: 0.0926, Train Acc: 96.04%, | Val Loss: 0.0498, Val Acc: 99.10%
+  -> High accuracy detected. Scanning for failures...
+    -> Logged 607 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+    -> Logged 180 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+Epoch [7/100], Train Loss: 0.0857, Train Acc: 96.33%, | Val Loss: 0.0456, Val Acc: 99.19%
+  -> High accuracy detected. Scanning for failures...
+    -> Logged 562 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+    -> Logged 161 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+Epoch [8/100], Train Loss: 0.0827, Train Acc: 96.47%, | Val Loss: 0.0430, Val Acc: 99.14%
+  -> High accuracy detected. Scanning for failures...
+    -> Logged 538 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+    -> Logged 171 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+Epoch [9/100], Train Loss: 0.0767, Train Acc: 96.73%, | Val Loss: 0.0398, Val Acc: 99.33%
+  -> High accuracy detected. Scanning for failures...
+    -> Logged 462 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+    -> Logged 133 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+Epoch [10/100], Train Loss: 0.0727, Train Acc: 96.87%, | Val Loss: 0.0376, Val Acc: 99.33%
+  -> High accuracy detected. Scanning for failures...
+    -> Logged 457 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+    -> Logged 134 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+Epoch [11/100], Train Loss: 0.0692, Train Acc: 97.04%, | Val Loss: 0.0380, Val Acc: 99.06%
+  -> High accuracy detected. Scanning for failures...
+    -> Logged 703 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+    -> Logged 189 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+Epoch [12/100], Train Loss: 0.0665, Train Acc: 97.17%, | Val Loss: 0.0333, Val Acc: 99.42%
+  -> High accuracy detected. Scanning for failures...
+    -> Logged 365 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+    -> Logged 117 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+Epoch [13/100], Train Loss: 0.0619, Train Acc: 97.36%, | Val Loss: 0.0316, Val Acc: 99.42%
+  -> High accuracy detected. Scanning for failures...
+    -> Logged 396 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+    -> Logged 115 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+Epoch [14/100], Train Loss: 0.0599, Train Acc: 97.46%, | Val Loss: 0.0301, Val Acc: 99.41%
+  -> High accuracy detected. Scanning for failures...
+    -> Logged 397 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+    -> Logged 119 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+Epoch [15/100], Train Loss: 0.0568, Train Acc: 97.63%, | Val Loss: 0.0282, Val Acc: 99.47%
+  -> High accuracy detected. Scanning for failures...
+    -> Logged 359 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+    -> Logged 107 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+Epoch [16/100], Train Loss: 0.0550, Train Acc: 97.72%, | Val Loss: 0.0266, Val Acc: 99.53%
+  -> High accuracy detected. Scanning for failures...
+    -> Logged 331 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+    -> Logged 94 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+Epoch [17/100], Train Loss: 0.0524, Train Acc: 97.80%, | Val Loss: 0.0256, Val Acc: 99.55%
+  -> High accuracy detected. Scanning for failures...
+    -> Logged 321 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+    -> Logged 91 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+Epoch [18/100], Train Loss: 0.0504, Train Acc: 97.93%, | Val Loss: 0.0240, Val Acc: 99.56%
+  -> High accuracy detected. Scanning for failures...
+    -> Logged 290 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+    -> Logged 87 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+Epoch [19/100], Train Loss: 0.0472, Train Acc: 98.04%, | Val Loss: 0.0228, Val Acc: 99.53%
+  -> High accuracy detected. Scanning for failures...
+    -> Logged 288 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+    -> Logged 93 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+Epoch [20/100], Train Loss: 0.0447, Train Acc: 98.16%, | Val Loss: 0.0216, Val Acc: 99.61%
+  -> High accuracy detected. Scanning for failures...
+    -> Logged 289 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+    -> Logged 78 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+Epoch [21/100], Train Loss: 0.0445, Train Acc: 98.12%, | Val Loss: 0.0201, Val Acc: 99.69%
+  -> High accuracy detected. Scanning for failures...
+    -> Logged 240 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+    -> Logged 63 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+Epoch [22/100], Train Loss: 0.0412, Train Acc: 98.29%, | Val Loss: 0.0191, Val Acc: 99.65%
+  -> High accuracy detected. Scanning for failures...
+    -> Logged 227 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+    -> Logged 70 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+Epoch [23/100], Train Loss: 0.0395, Train Acc: 98.35%, | Val Loss: 0.0181, Val Acc: 99.65%
+  -> High accuracy detected. Scanning for failures...
+    -> Logged 236 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+    -> Logged 70 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+Epoch [24/100], Train Loss: 0.0373, Train Acc: 98.48%, | Val Loss: 0.0170, Val Acc: 99.71%
+  -> High accuracy detected. Scanning for failures...
+    -> Logged 209 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+    -> Logged 58 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+Epoch [25/100], Train Loss: 0.0362, Train Acc: 98.53%, | Val Loss: 0.0164, Val Acc: 99.68%
+  -> High accuracy detected. Scanning for failures...
+    -> Logged 222 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+    -> Logged 64 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+Epoch [26/100], Train Loss: 0.0345, Train Acc: 98.61%, | Val Loss: 0.0153, Val Acc: 99.73%
+  -> High accuracy detected. Scanning for failures...
+    -> Logged 199 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+    -> Logged 53 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+Epoch [27/100], Train Loss: 0.0317, Train Acc: 98.74%, | Val Loss: 0.0149, Val Acc: 99.61%
+  -> High accuracy detected. Scanning for failures...
+    -> Logged 253 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+    -> Logged 78 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+Epoch [28/100], Train Loss: 0.0302, Train Acc: 98.80%, | Val Loss: 0.0134, Val Acc: 99.80%
+  -> High accuracy detected. Scanning for failures...
+    -> Logged 162 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+    -> Logged 40 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+Epoch [29/100], Train Loss: 0.0299, Train Acc: 98.80%, | Val Loss: 0.0127, Val Acc: 99.77%
+  -> High accuracy detected. Scanning for failures...
+    -> Logged 163 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+    -> Logged 46 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+Epoch [30/100], Train Loss: 0.0261, Train Acc: 98.98%, | Val Loss: 0.0125, Val Acc: 99.68%
+  -> High accuracy detected. Scanning for failures...
+    -> Logged 240 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+    -> Logged 64 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+Epoch [31/100], Train Loss: 0.0251, Train Acc: 99.05%, | Val Loss: 0.0110, Val Acc: 99.84%
+  -> High accuracy detected. Scanning for failures...
+    -> Logged 135 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+    -> Logged 32 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+Epoch [32/100], Train Loss: 0.0246, Train Acc: 99.01%, | Val Loss: 0.0108, Val Acc: 99.78%
+  -> High accuracy detected. Scanning for failures...
+    -> Logged 167 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+    -> Logged 43 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+Epoch [33/100], Train Loss: 0.0237, Train Acc: 99.07%, | Val Loss: 0.0103, Val Acc: 99.83%
+  -> High accuracy detected. Scanning for failures...
+    -> Logged 121 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+    -> Logged 34 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+Epoch [34/100], Train Loss: 0.0224, Train Acc: 99.14%, | Val Loss: 0.0096, Val Acc: 99.86%
+  -> High accuracy detected. Scanning for failures...
+    -> Logged 127 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+    -> Logged 29 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+Epoch [35/100], Train Loss: 0.0220, Train Acc: 99.15%, | Val Loss: 0.0092, Val Acc: 99.89%
+  -> High accuracy detected. Scanning for failures...
+    -> Logged 100 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+    -> Logged 23 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+Epoch [36/100], Train Loss: 0.0204, Train Acc: 99.22%, | Val Loss: 0.0090, Val Acc: 99.83%
+  -> High accuracy detected. Scanning for failures...
+    -> Logged 126 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+    -> Logged 34 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+Epoch [37/100], Train Loss: 0.0194, Train Acc: 99.25%, | Val Loss: 0.0083, Val Acc: 99.89%
+  -> High accuracy detected. Scanning for failures...
+    -> Logged 93 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+    -> Logged 23 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+Epoch [38/100], Train Loss: 0.0191, Train Acc: 99.25%, | Val Loss: 0.0081, Val Acc: 99.85%
+  -> High accuracy detected. Scanning for failures...
+    -> Logged 110 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+    -> Logged 30 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+Epoch [39/100], Train Loss: 0.0182, Train Acc: 99.31%, | Val Loss: 0.0076, Val Acc: 99.89%
+  -> High accuracy detected. Scanning for failures...
+    -> Logged 74 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+    -> Logged 22 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
+"""
+# REFACTORING MISSION:
+# This script objective is to perform
+# automated failure analysis. When training or validation accuracy surpasses
+# a 99% threshold, the script will automatically log the specific samples
+# that the model failed on. These failures are appended to a CSV file for
+# later inspection, which is invaluable for creating targeted test sets or
+# improving the training data.
+import torch
+import torch.nn as nn
+from torch.utils.data import TensorDataset, DataLoader
+import random
+import numpy as np
+import zipfile
+import os
+import sys
+# ==============================================================================
+# Part 0: Configuration
+# ==============================================================================
+class Config:
+    # --- Data ---
+    num_samples = 100000 # Increased dataset size for more robust training
+    train_split = 0.8
+    # --- Model Architecture ---
+    input_size = 8
+    hidden_size_1 = 128
+    hidden_size_2 = 64
+    output_size = 1
+    # --- Training ---
+    learning_rate = 1e-4 # Slightly lower LR for finer tuning
+    batch_size = 256
+    epochs = 100 # Reduced epochs to 20 as convergence should be faster with more data
+    weight_decay = 1e-4
+    # --- NEW: Failure Analysis ---
+    # The accuracy threshold to trigger logging of failed samples.
+    failure_log_threshold = 99.0
+    # The name of the script, used for the output CSV file.
+    script_name = "GreaterThan_MLP_V1.1_FailureAnalysis"
+    failure_log_filename = f"{script_name}_failed_samples.csv"
+    # --- Device ---
+    device = 'cuda' if torch.cuda.is_available() else 'cpu'
+    print(f"Using device: {device}")
+config = Config()
+# For reproducibility
+torch.manual_seed(1337)
+random.seed(1337)
+np.random.seed(1337)
+# ==============================================================================
+# Part 1: Colab Utility & Data Generation (Unchanged from V1.0)
+# ==============================================================================
+def is_in_colab():
+    """Checks if the script is running in a Google Colab environment."""
+    try:
+        import google.colab
+        return True
+    except ImportError:
+        return False
+def generate_mlp_data(num_samples):
+    """Generates synthetic data for the MLP."""
+    print(f"Generating {num_samples} data points...")
+    features, labels = [], []
+    for _ in range(num_samples):
+        a = round(random.uniform(0, 99.99), 2)
+        b = round(random.uniform(0, 99.99), 2)
+        while a == b:
+            b = round(random.uniform(0, 99.99), 2)
+        a_str, b_str = f"{a:05.2f}", f"{b:05.2f}"
+        a_digits, b_digits = [int(d) for d in a_str if d.isdigit()], [int(d) for d in b_str if d.isdigit()]
+        features.append(a_digits + b_digits)
+        labels.append(0 if a > b else 1)
+    X = torch.tensor(features, dtype=torch.float32)
+    y = torch.tensor(labels, dtype=torch.float32).unsqueeze(1)
+    print("Data generation complete.")
+    return X, y
+# ==============================================================================
+# Part 2: Model Architecture (Unchanged from V1.0)
+# ==============================================================================
+class SimpleMLP(nn.Module):
+    def __init__(self, input_size, hidden_size_1, hidden_size_2, output_size):
+        super().__init__()
+        self.net = nn.Sequential(
+            nn.Linear(input_size, hidden_size_1), nn.ReLU(), nn.Dropout(0.2),
+            nn.Linear(hidden_size_1, hidden_size_2), nn.ReLU(), nn.Dropout(0.2),
+            nn.Linear(hidden_size_2, output_size)
+        )
+    def forward(self, x):
+        return self.net(x)
+# ==============================================================================
+# Part 3: MODIFIED Training and Evaluation Loop
+# ==============================================================================
+def log_failures(model, loader, split, epoch, filename, device):
+    """
+    NEW: Iterates through a data loader, finds incorrect predictions,
+    and appends them to the specified CSV log file.
+    """
+    model.eval()
+    failures_found = 0
+    with torch.no_grad():
+        for inputs, labels in loader:
+            inputs, labels = inputs.to(device), labels.to(device)
+            outputs = model(inputs)
+            predicted = torch.round(torch.sigmoid(outputs))
+            mismatch_indices = (predicted != labels).squeeze()
+            if mismatch_indices.any():
+                failed_inputs = inputs[mismatch_indices]
+                failed_true_labels = labels[mismatch_indices]
+                failed_pred_labels = predicted[mismatch_indices]
+                with open(filename, 'a') as f:
+                    for i in range(failed_inputs.size(0)):
+                        # Format the input vector back into a readable string
+                        input_vec_int = failed_inputs[i].cpu().numpy().astype(int)
+                        num1_str = f"{input_vec_int[0]}{input_vec_int[1]}.{input_vec_int[2]}{input_vec_int[3]}"
+                        num2_str = f"{input_vec_int[4]}{input_vec_int[5]}.{input_vec_int[6]}{input_vec_int[7]}"
+                        true_label = int(failed_true_labels[i].item())
+                        pred_label = int(failed_pred_labels[i].item())
+                        f.write(f"{epoch},{split},{num1_str},{num2_str},{true_label},{pred_label}\n")
+                        failures_found += 1
+    if failures_found > 0:
+        print(f"    -> Logged {failures_found} failures for '{split}' split to {filename}")
+def train_model(model, train_loader, val_loader, optimizer, criterion, epochs, config):
+    """The main training loop, now with failure logging."""
+    print("\n--- Starting Training ---")
+    # NEW: Initialize the failure log file with a header
+    with open(config.failure_log_filename, 'w') as f:
+        f.write("epoch,split,num1,num2,true_label(0:L>R;1:R>L),predicted_label\n")
+    for epoch in range(epochs):
+        model.train()
+        train_loss, train_correct, train_total = 0, 0, 0
+        for inputs, labels in train_loader:
+            inputs, labels = inputs.to(config.device), labels.to(config.device)
+            outputs = model(inputs)
+            loss = criterion(outputs, labels)
+            optimizer.zero_grad()
+            loss.backward()
+            optimizer.step()
+            train_loss += loss.item()
+            predicted = torch.round(torch.sigmoid(outputs))
+            train_total += labels.size(0)
+            train_correct += (predicted == labels).sum().item()
+        train_avg_loss = train_loss / len(train_loader)
+        train_accuracy = 100 * train_correct / train_total
+        model.eval()
+        val_loss, val_correct, val_total = 0, 0, 0
+        with torch.no_grad():
+            for inputs, labels in val_loader:
+                inputs, labels = inputs.to(config.device), labels.to(config.device)
+                outputs = model(inputs)
+                val_loss += criterion(outputs, labels).item()
+                predicted = torch.round(torch.sigmoid(outputs))
+                val_total += labels.size(0)
+                val_correct += (predicted == labels).sum().item()
+        val_avg_loss = val_loss / len(val_loader)
+        val_accuracy = 100 * val_correct / val_total
+        print(f"Epoch [{epoch+1}/{epochs}], "
+              f"Train Loss: {train_avg_loss:.4f}, Train Acc: {train_accuracy:.2f}%, | "
+              f"Val Loss: {val_avg_loss:.4f}, Val Acc: {val_accuracy:.2f}%")
+        # --- NEW: Conditional Failure Logging ---
+        if train_accuracy > config.failure_log_threshold or val_accuracy > config.failure_log_threshold:
+            print(f"  -> High accuracy detected. Scanning for failures...")
+            # Re-iterate over loaders to find and log the specific failures for this epoch
+            log_failures(model, train_loader, 'train', epoch + 1, config.failure_log_filename, config.device)
+            log_failures(model, val_loader, 'val', epoch + 1, config.failure_log_filename, config.device)
+    print("--- Training Finished ---")
+# ==============================================================================
+# Part 4: MODIFIED Final Test Suite
+# ==============================================================================
+def run_final_tests(model, config):
+    """Runs the trained model against a suite of hardcoded test cases and logs failures."""
+    print("\n--- Running Final Test Suite ---")
+    test_cases = [
+        ("Simple Greater", 10.00, 9.21), ("Simple Lesser", 5.50, 50.50),
+        ("Decimal Greater", 54.13, 54.12), ("Decimal Lesser", 99.98, 99.99),
+        ("Edge Case: Large Difference", 0.01, 99.99), ("Edge Case: Zero", 0.00, 5.00),
+        ("Tricky: Same Integer Part", 25.80, 25.79), ("Tricky: Crossover", 49.99, 50.00),
+    ]
+    results_log = "--- MLP Test Suite Results ---\n\n"
+    correct_tests = 0
+    model.eval()
+    with torch.no_grad():
+        for description, a, b in test_cases:
+            a_str, b_str = f"{a:05.2f}", f"{b:05.2f}"
+            a_digits, b_digits = [int(d) for d in a_str if d.isdigit()], [int(d) for d in b_str if d.isdigit()]
+            feature_vector = torch.tensor(a_digits + b_digits, dtype=torch.float32).to(config.device)
+            output = model(feature_vector)
+            predicted_class = 1 if torch.sigmoid(output).item() > 0.5 else 0
+            ground_truth_class = 0 if a > b else 1
+            result = "CORRECT"
+            if predicted_class != ground_truth_class:
+                result = "INCORRECT"
+                # --- NEW: Log failure to CSV ---
+                with open(config.failure_log_filename, 'a') as f:
+                    f.write(f"final_test,{description.replace(',',';')},{a_str},{b_str},{ground_truth_class},{predicted_class}\n")
+            else:
+                correct_tests += 1
+            predicted_winner = "Left" if predicted_class == 0 else "Right"
+            log_line = (f"Test: '{description}' | {a_str} vs {b_str}\n"
+                        f"  -> Model says: {predicted_winner} is greater\n"
+                        f"  -> Result: {result}\n" + "-"*30 + "\n")
+            print(log_line)
+            results_log += log_line
+    final_accuracy = 100 * correct_tests / len(test_cases)
+    summary = f"\nFinal Test Accuracy: {final_accuracy:.2f}% ({correct_tests}/{len(test_cases)} correct)\n"
+    print(summary)
+    results_log += summary
+    return results_log
+# ==============================================================================
+# Main Execution Block
+# ==============================================================================
+if __name__ == '__main__':
+    X, y = generate_mlp_data(config.num_samples)
+    dataset = TensorDataset(X, y)
+    train_size = int(config.train_split * len(dataset))
+    val_size = len(dataset) - train_size
+    train_dataset, val_dataset = torch.utils.data.random_split(dataset, [train_size, val_size])
+    train_loader = DataLoader(train_dataset, batch_size=config.batch_size, shuffle=True)
+    val_loader = DataLoader(val_dataset, batch_size=config.batch_size, shuffle=False)
+    model = SimpleMLP(config.input_size, config.hidden_size_1, config.hidden_size_2, config.output_size).to(config.device)
+    print(f"\nModel initialized with {sum(p.numel() for p in model.parameters())} parameters.")
+    optimizer = torch.optim.AdamW(model.parameters(), lr=config.learning_rate, weight_decay=config.weight_decay)
+    criterion = nn.BCEWithLogitsLoss()
+    train_model(model, train_loader, val_loader, optimizer, criterion, config.epochs, config)
+    test_results = run_final_tests(model, config)
+    if is_in_colab():
+        print(f"\nDetected Google Colab environment. Zipping and downloading results...")
+        results_filename = f"{config.script_name}_test_summary.txt"
+        with open(results_filename, "w") as f:
+            f.write(test_results)
+        zip_filename = f"{config.script_name}_outputs.zip"
+        with zipfile.ZipFile(zip_filename, 'w') as zipf:
+            zipf.write(results_filename)
+            # Also include the new failure log in the zip file
+            if os.path.exists(config.failure_log_filename):
+                zipf.write(config.failure_log_filename)
+        try:
+            from google.colab import files
+            files.download(zip_filename)
+            print(f"Downloaded {zip_filename} successfully.")
+        except Exception as e:
+            print(f"Could not initiate download. Error: {e}")
+    else:
+        print(f"\nNot running in Colab. Test results printed above. Failures logged to {config.failure_log_filename}")