File size: 28,327 Bytes

# GreaterThan_MLP_V1.1_with_FailuresAnalysis.py
"""
The objective of GreaterThan_MLP_V1.0.py is to establish a fundamental performance baseline 
for a numerical comparison task using a deliberately simple Multi-Layer Perceptron (MLP). 
It avoids all natural language processing techniques by treating the problem as a pure binary classification 
on a fixed-size vector. The dataset consists of synthetically generated pairs of 
two-digit decimal numbers (e.g., 10.00 and 09.21), 
which are deconstructed and flattened into an 8-dimensional feature vector of their raw digits 
([1, 0, 0, 0,
0, 9, 2, 1]). 
The model is then trained to predict a single binary label (0 for left > right, 1 for right > left), 
directly testing the MLP's capability to learn the hierarchical rules of numerical magnitude 
from the positional values of the input digits alone.

The MLP model's task is to learn the rules of numerical magnitude from raw digits alone, 
treating the problem as a simple binary classification task. 
It's designed for maximum clarity and serves as a fundamental baseline for this reasoning problem.
The plan is clear: a simple MLP for binary classification. 
The 8-dimensional input vector, constructed from the two 4-digit numbers, will be the focus. 
The output will cleanly indicate which number is greater. Using on-the-fly data generation.
The generate_mlp_data function produces the correct 8-dimensional input vectors and binary labels. 
GreaterThan_MLP_V1.0.py presents a basic numerical comparison challenge using a rudimentary MLP as a baseline. 
The core approach hinges on framing the task as a binary classification problem on a fixed-length feature vector. 
Pairs of decimal numbers are converted into an 8-dimensional array of their digit values; 
for instance, 10.00 and 09.21 are transformed to [1, 0, 0, 0, 0, 9, 2, 1]. 
The model's training focuses on predicting whether one number is greater than another through a single binary label.

    test_cases = [
        ("Simple Greater", 10.00, 9.21), ("Simple Lesser", 5.50, 50.50),
        ("Decimal Greater", 54.13, 54.12), ("Decimal Lesser", 99.98, 99.99),
        ("Edge Case: Large Difference", 0.01, 99.99), ("Edge Case: Zero", 0.00, 5.00),
        ("Tricky: Same Integer Part", 25.80, 25.79), ("Tricky: Crossover", 49.99, 50.00),
    ]

The MLP baseline model performs remarkably well, achieving over 99.9% accuracy in deciding "GreaterThan". 
This indicates that the underlying logic of numerical comparison can be learned from raw digits by a simple neural network, 
provided the input is structured as a fixed-size vector.
However, even with high accuracy, failures still occur. Understanding why and on what data the model fails 
is the next critical step in ML engineering. This is how we discover dataset biases, edge cases, and architectural weaknesses.
Here is the modified script, GreaterThan_MLP_V1.1_with_FailuresAnalysis.py 
It incorporates to automatically detect and log failures to a CSV file when accuracy is high, 
creating a valuable dataset artifact for future analysis and the development of more robust models.

Here is the ouput of the first demonstration run in colab:
Model initialized with 9473 parameters.

--- Starting Training ---
Epoch [1/100], Train Loss: 0.4015, Train Acc: 82.67%, | Val Loss: 0.1690, Val Acc: 97.03%
Epoch [2/100], Train Loss: 0.1743, Train Acc: 92.94%, | Val Loss: 0.0974, Val Acc: 98.04%
Epoch [3/100], Train Loss: 0.1300, Train Acc: 94.54%, | Val Loss: 0.0741, Val Acc: 98.61%
Epoch [4/100], Train Loss: 0.1112, Train Acc: 95.20%, | Val Loss: 0.0618, Val Acc: 98.96%
Epoch [5/100], Train Loss: 0.1019, Train Acc: 95.61%, | Val Loss: 0.0565, Val Acc: 98.79%
Epoch [6/100], Train Loss: 0.0926, Train Acc: 96.04%, | Val Loss: 0.0498, Val Acc: 99.10%
  -> High accuracy detected. Scanning for failures...
    -> Logged 607 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
    -> Logged 180 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
Epoch [7/100], Train Loss: 0.0857, Train Acc: 96.33%, | Val Loss: 0.0456, Val Acc: 99.19%
  -> High accuracy detected. Scanning for failures...
    -> Logged 562 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
    -> Logged 161 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
Epoch [8/100], Train Loss: 0.0827, Train Acc: 96.47%, | Val Loss: 0.0430, Val Acc: 99.14%
  -> High accuracy detected. Scanning for failures...
    -> Logged 538 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
    -> Logged 171 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
Epoch [9/100], Train Loss: 0.0767, Train Acc: 96.73%, | Val Loss: 0.0398, Val Acc: 99.33%
  -> High accuracy detected. Scanning for failures...
    -> Logged 462 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
    -> Logged 133 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
Epoch [10/100], Train Loss: 0.0727, Train Acc: 96.87%, | Val Loss: 0.0376, Val Acc: 99.33%
  -> High accuracy detected. Scanning for failures...
    -> Logged 457 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
    -> Logged 134 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
Epoch [11/100], Train Loss: 0.0692, Train Acc: 97.04%, | Val Loss: 0.0380, Val Acc: 99.06%
  -> High accuracy detected. Scanning for failures...
    -> Logged 703 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
    -> Logged 189 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
Epoch [12/100], Train Loss: 0.0665, Train Acc: 97.17%, | Val Loss: 0.0333, Val Acc: 99.42%
  -> High accuracy detected. Scanning for failures...
    -> Logged 365 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
    -> Logged 117 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
Epoch [13/100], Train Loss: 0.0619, Train Acc: 97.36%, | Val Loss: 0.0316, Val Acc: 99.42%
  -> High accuracy detected. Scanning for failures...
    -> Logged 396 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
    -> Logged 115 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
Epoch [14/100], Train Loss: 0.0599, Train Acc: 97.46%, | Val Loss: 0.0301, Val Acc: 99.41%
  -> High accuracy detected. Scanning for failures...
    -> Logged 397 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
    -> Logged 119 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
Epoch [15/100], Train Loss: 0.0568, Train Acc: 97.63%, | Val Loss: 0.0282, Val Acc: 99.47%
  -> High accuracy detected. Scanning for failures...
    -> Logged 359 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
    -> Logged 107 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
Epoch [16/100], Train Loss: 0.0550, Train Acc: 97.72%, | Val Loss: 0.0266, Val Acc: 99.53%
  -> High accuracy detected. Scanning for failures...
    -> Logged 331 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
    -> Logged 94 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
Epoch [17/100], Train Loss: 0.0524, Train Acc: 97.80%, | Val Loss: 0.0256, Val Acc: 99.55%
  -> High accuracy detected. Scanning for failures...
    -> Logged 321 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
    -> Logged 91 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
Epoch [18/100], Train Loss: 0.0504, Train Acc: 97.93%, | Val Loss: 0.0240, Val Acc: 99.56%
  -> High accuracy detected. Scanning for failures...
    -> Logged 290 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
    -> Logged 87 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
Epoch [19/100], Train Loss: 0.0472, Train Acc: 98.04%, | Val Loss: 0.0228, Val Acc: 99.53%
  -> High accuracy detected. Scanning for failures...
    -> Logged 288 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
    -> Logged 93 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
Epoch [20/100], Train Loss: 0.0447, Train Acc: 98.16%, | Val Loss: 0.0216, Val Acc: 99.61%
  -> High accuracy detected. Scanning for failures...
    -> Logged 289 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
    -> Logged 78 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
Epoch [21/100], Train Loss: 0.0445, Train Acc: 98.12%, | Val Loss: 0.0201, Val Acc: 99.69%
  -> High accuracy detected. Scanning for failures...
    -> Logged 240 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
    -> Logged 63 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
Epoch [22/100], Train Loss: 0.0412, Train Acc: 98.29%, | Val Loss: 0.0191, Val Acc: 99.65%
  -> High accuracy detected. Scanning for failures...
    -> Logged 227 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
    -> Logged 70 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
Epoch [23/100], Train Loss: 0.0395, Train Acc: 98.35%, | Val Loss: 0.0181, Val Acc: 99.65%
  -> High accuracy detected. Scanning for failures...
    -> Logged 236 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
    -> Logged 70 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
Epoch [24/100], Train Loss: 0.0373, Train Acc: 98.48%, | Val Loss: 0.0170, Val Acc: 99.71%
  -> High accuracy detected. Scanning for failures...
    -> Logged 209 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
    -> Logged 58 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
Epoch [25/100], Train Loss: 0.0362, Train Acc: 98.53%, | Val Loss: 0.0164, Val Acc: 99.68%
  -> High accuracy detected. Scanning for failures...
    -> Logged 222 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
    -> Logged 64 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
Epoch [26/100], Train Loss: 0.0345, Train Acc: 98.61%, | Val Loss: 0.0153, Val Acc: 99.73%
  -> High accuracy detected. Scanning for failures...
    -> Logged 199 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
    -> Logged 53 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
Epoch [27/100], Train Loss: 0.0317, Train Acc: 98.74%, | Val Loss: 0.0149, Val Acc: 99.61%
  -> High accuracy detected. Scanning for failures...
    -> Logged 253 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
    -> Logged 78 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
Epoch [28/100], Train Loss: 0.0302, Train Acc: 98.80%, | Val Loss: 0.0134, Val Acc: 99.80%
  -> High accuracy detected. Scanning for failures...
    -> Logged 162 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
    -> Logged 40 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
Epoch [29/100], Train Loss: 0.0299, Train Acc: 98.80%, | Val Loss: 0.0127, Val Acc: 99.77%
  -> High accuracy detected. Scanning for failures...
    -> Logged 163 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
    -> Logged 46 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
Epoch [30/100], Train Loss: 0.0261, Train Acc: 98.98%, | Val Loss: 0.0125, Val Acc: 99.68%
  -> High accuracy detected. Scanning for failures...
    -> Logged 240 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
    -> Logged 64 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
Epoch [31/100], Train Loss: 0.0251, Train Acc: 99.05%, | Val Loss: 0.0110, Val Acc: 99.84%
  -> High accuracy detected. Scanning for failures...
    -> Logged 135 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
    -> Logged 32 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
Epoch [32/100], Train Loss: 0.0246, Train Acc: 99.01%, | Val Loss: 0.0108, Val Acc: 99.78%
  -> High accuracy detected. Scanning for failures...
    -> Logged 167 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
    -> Logged 43 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
Epoch [33/100], Train Loss: 0.0237, Train Acc: 99.07%, | Val Loss: 0.0103, Val Acc: 99.83%
  -> High accuracy detected. Scanning for failures...
    -> Logged 121 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
    -> Logged 34 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
Epoch [34/100], Train Loss: 0.0224, Train Acc: 99.14%, | Val Loss: 0.0096, Val Acc: 99.86%
  -> High accuracy detected. Scanning for failures...
    -> Logged 127 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
    -> Logged 29 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
Epoch [35/100], Train Loss: 0.0220, Train Acc: 99.15%, | Val Loss: 0.0092, Val Acc: 99.89%
  -> High accuracy detected. Scanning for failures...
    -> Logged 100 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
    -> Logged 23 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
Epoch [36/100], Train Loss: 0.0204, Train Acc: 99.22%, | Val Loss: 0.0090, Val Acc: 99.83%
  -> High accuracy detected. Scanning for failures...
    -> Logged 126 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
    -> Logged 34 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
Epoch [37/100], Train Loss: 0.0194, Train Acc: 99.25%, | Val Loss: 0.0083, Val Acc: 99.89%
  -> High accuracy detected. Scanning for failures...
    -> Logged 93 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
    -> Logged 23 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
Epoch [38/100], Train Loss: 0.0191, Train Acc: 99.25%, | Val Loss: 0.0081, Val Acc: 99.85%
  -> High accuracy detected. Scanning for failures...
    -> Logged 110 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
    -> Logged 30 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
Epoch [39/100], Train Loss: 0.0182, Train Acc: 99.31%, | Val Loss: 0.0076, Val Acc: 99.89%
  -> High accuracy detected. Scanning for failures...
    -> Logged 74 failures for 'train' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv
    -> Logged 22 failures for 'val' split to GreaterThan_MLP_V1.1_FailureAnalysis_failed_samples.csv

"""
# REFACTORING MISSION:
# This script objective is to perform
# automated failure analysis. When training or validation accuracy surpasses
# a 99% threshold, the script will automatically log the specific samples
# that the model failed on. These failures are appended to a CSV file for
# later inspection, which is invaluable for creating targeted test sets or
# improving the training data.

import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader
import random
import numpy as np
import zipfile
import os
import sys

# ==============================================================================
# Part 0: Configuration
# ==============================================================================
class Config:
    # --- Data ---
    num_samples = 100000 # Increased dataset size for more robust training
    train_split = 0.8

    # --- Model Architecture ---
    input_size = 8
    hidden_size_1 = 128
    hidden_size_2 = 64
    output_size = 1

    # --- Training ---
    learning_rate = 1e-4 # Slightly lower LR for finer tuning
    batch_size = 256
    epochs = 100 # Reduced epochs to 20 as convergence should be faster with more data
    weight_decay = 1e-4

    # --- NEW: Failure Analysis ---
    # The accuracy threshold to trigger logging of failed samples.
    failure_log_threshold = 99.0
    # The name of the script, used for the output CSV file.
    script_name = "GreaterThan_MLP_V1.1_FailureAnalysis"
    failure_log_filename = f"{script_name}_failed_samples.csv"

    # --- Device ---
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    print(f"Using device: {device}")

config = Config()

# For reproducibility
torch.manual_seed(1337)
random.seed(1337)
np.random.seed(1337)

# ==============================================================================
# Part 1: Colab Utility & Data Generation (Unchanged from V1.0)
# ==============================================================================
def is_in_colab():
    """Checks if the script is running in a Google Colab environment."""
    try:
        import google.colab
        return True
    except ImportError:
        return False

def generate_mlp_data(num_samples):
    """Generates synthetic data for the MLP."""
    print(f"Generating {num_samples} data points...")
    features, labels = [], []
    for _ in range(num_samples):
        a = round(random.uniform(0, 99.99), 2)
        b = round(random.uniform(0, 99.99), 2)
        while a == b:
            b = round(random.uniform(0, 99.99), 2)
        a_str, b_str = f"{a:05.2f}", f"{b:05.2f}"
        a_digits, b_digits = [int(d) for d in a_str if d.isdigit()], [int(d) for d in b_str if d.isdigit()]
        features.append(a_digits + b_digits)
        labels.append(0 if a > b else 1)
    X = torch.tensor(features, dtype=torch.float32)
    y = torch.tensor(labels, dtype=torch.float32).unsqueeze(1)
    print("Data generation complete.")
    return X, y

# ==============================================================================
# Part 2: Model Architecture (Unchanged from V1.0)
# ==============================================================================
class SimpleMLP(nn.Module):
    def __init__(self, input_size, hidden_size_1, hidden_size_2, output_size):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_size, hidden_size_1), nn.ReLU(), nn.Dropout(0.2),
            nn.Linear(hidden_size_1, hidden_size_2), nn.ReLU(), nn.Dropout(0.2),
            nn.Linear(hidden_size_2, output_size)
        )
    def forward(self, x):
        return self.net(x)

# ==============================================================================
# Part 3: MODIFIED Training and Evaluation Loop
# ==============================================================================

def log_failures(model, loader, split, epoch, filename, device):
    """
    NEW: Iterates through a data loader, finds incorrect predictions,
    and appends them to the specified CSV log file.
    """
    model.eval()
    failures_found = 0
    with torch.no_grad():
        for inputs, labels in loader:
            inputs, labels = inputs.to(device), labels.to(device)
            outputs = model(inputs)
            predicted = torch.round(torch.sigmoid(outputs))
            mismatch_indices = (predicted != labels).squeeze()
            
            if mismatch_indices.any():
                failed_inputs = inputs[mismatch_indices]
                failed_true_labels = labels[mismatch_indices]
                failed_pred_labels = predicted[mismatch_indices]
                
                with open(filename, 'a') as f:
                    for i in range(failed_inputs.size(0)):
                        # Format the input vector back into a readable string
                        input_vec_int = failed_inputs[i].cpu().numpy().astype(int)
                        num1_str = f"{input_vec_int[0]}{input_vec_int[1]}.{input_vec_int[2]}{input_vec_int[3]}"
                        num2_str = f"{input_vec_int[4]}{input_vec_int[5]}.{input_vec_int[6]}{input_vec_int[7]}"
                        
                        true_label = int(failed_true_labels[i].item())
                        pred_label = int(failed_pred_labels[i].item())
                        
                        f.write(f"{epoch},{split},{num1_str},{num2_str},{true_label},{pred_label}\n")
                        failures_found += 1
    if failures_found > 0:
        print(f"    -> Logged {failures_found} failures for '{split}' split to {filename}")

def train_model(model, train_loader, val_loader, optimizer, criterion, epochs, config):
    """The main training loop, now with failure logging."""
    print("\n--- Starting Training ---")
    
    # NEW: Initialize the failure log file with a header
    with open(config.failure_log_filename, 'w') as f:
        f.write("epoch,split,num1,num2,true_label(0:L>R;1:R>L),predicted_label\n")

    for epoch in range(epochs):
        model.train()
        train_loss, train_correct, train_total = 0, 0, 0
        for inputs, labels in train_loader:
            inputs, labels = inputs.to(config.device), labels.to(config.device)
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            train_loss += loss.item()
            predicted = torch.round(torch.sigmoid(outputs))
            train_total += labels.size(0)
            train_correct += (predicted == labels).sum().item()

        train_avg_loss = train_loss / len(train_loader)
        train_accuracy = 100 * train_correct / train_total
        
        model.eval()
        val_loss, val_correct, val_total = 0, 0, 0
        with torch.no_grad():
            for inputs, labels in val_loader:
                inputs, labels = inputs.to(config.device), labels.to(config.device)
                outputs = model(inputs)
                val_loss += criterion(outputs, labels).item()
                predicted = torch.round(torch.sigmoid(outputs))
                val_total += labels.size(0)
                val_correct += (predicted == labels).sum().item()
        
        val_avg_loss = val_loss / len(val_loader)
        val_accuracy = 100 * val_correct / val_total
        
        print(f"Epoch [{epoch+1}/{epochs}], "
              f"Train Loss: {train_avg_loss:.4f}, Train Acc: {train_accuracy:.2f}%, | "
              f"Val Loss: {val_avg_loss:.4f}, Val Acc: {val_accuracy:.2f}%")
              
        # --- NEW: Conditional Failure Logging ---
        if train_accuracy > config.failure_log_threshold or val_accuracy > config.failure_log_threshold:
            print(f"  -> High accuracy detected. Scanning for failures...")
            # Re-iterate over loaders to find and log the specific failures for this epoch
            log_failures(model, train_loader, 'train', epoch + 1, config.failure_log_filename, config.device)
            log_failures(model, val_loader, 'val', epoch + 1, config.failure_log_filename, config.device)

    print("--- Training Finished ---")

# ==============================================================================
# Part 4: MODIFIED Final Test Suite
# ==============================================================================
def run_final_tests(model, config):
    """Runs the trained model against a suite of hardcoded test cases and logs failures."""
    print("\n--- Running Final Test Suite ---")
    
    test_cases = [
        ("Simple Greater", 10.00, 9.21), ("Simple Lesser", 5.50, 50.50),
        ("Decimal Greater", 54.13, 54.12), ("Decimal Lesser", 99.98, 99.99),
        ("Edge Case: Large Difference", 0.01, 99.99), ("Edge Case: Zero", 0.00, 5.00),
        ("Tricky: Same Integer Part", 25.80, 25.79), ("Tricky: Crossover", 49.99, 50.00),
    ]
    
    results_log = "--- MLP Test Suite Results ---\n\n"
    correct_tests = 0
    
    model.eval()
    with torch.no_grad():
        for description, a, b in test_cases:
            a_str, b_str = f"{a:05.2f}", f"{b:05.2f}"
            a_digits, b_digits = [int(d) for d in a_str if d.isdigit()], [int(d) for d in b_str if d.isdigit()]
            feature_vector = torch.tensor(a_digits + b_digits, dtype=torch.float32).to(config.device)
            
            output = model(feature_vector)
            predicted_class = 1 if torch.sigmoid(output).item() > 0.5 else 0
            ground_truth_class = 0 if a > b else 1
            
            result = "CORRECT"
            if predicted_class != ground_truth_class:
                result = "INCORRECT"
                # --- NEW: Log failure to CSV ---
                with open(config.failure_log_filename, 'a') as f:
                    f.write(f"final_test,{description.replace(',',';')},{a_str},{b_str},{ground_truth_class},{predicted_class}\n")
            else:
                correct_tests += 1
            
            predicted_winner = "Left" if predicted_class == 0 else "Right"
            log_line = (f"Test: '{description}' | {a_str} vs {b_str}\n"
                        f"  -> Model says: {predicted_winner} is greater\n"
                        f"  -> Result: {result}\n" + "-"*30 + "\n")
            print(log_line)
            results_log += log_line

    final_accuracy = 100 * correct_tests / len(test_cases)
    summary = f"\nFinal Test Accuracy: {final_accuracy:.2f}% ({correct_tests}/{len(test_cases)} correct)\n"
    print(summary)
    results_log += summary
    
    return results_log

# ==============================================================================
# Main Execution Block
# ==============================================================================
if __name__ == '__main__':
    X, y = generate_mlp_data(config.num_samples)
    
    dataset = TensorDataset(X, y)
    train_size = int(config.train_split * len(dataset))
    val_size = len(dataset) - train_size
    train_dataset, val_dataset = torch.utils.data.random_split(dataset, [train_size, val_size])
    train_loader = DataLoader(train_dataset, batch_size=config.batch_size, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=config.batch_size, shuffle=False)
    
    model = SimpleMLP(config.input_size, config.hidden_size_1, config.hidden_size_2, config.output_size).to(config.device)
    print(f"\nModel initialized with {sum(p.numel() for p in model.parameters())} parameters.")
    
    optimizer = torch.optim.AdamW(model.parameters(), lr=config.learning_rate, weight_decay=config.weight_decay)
    criterion = nn.BCEWithLogitsLoss()
    
    train_model(model, train_loader, val_loader, optimizer, criterion, config.epochs, config)
    
    test_results = run_final_tests(model, config)
    
    if is_in_colab():
        print(f"\nDetected Google Colab environment. Zipping and downloading results...")
        results_filename = f"{config.script_name}_test_summary.txt"
        with open(results_filename, "w") as f:
            f.write(test_results)
            
        zip_filename = f"{config.script_name}_outputs.zip"
        with zipfile.ZipFile(zip_filename, 'w') as zipf:
            zipf.write(results_filename)
            # Also include the new failure log in the zip file
            if os.path.exists(config.failure_log_filename):
                zipf.write(config.failure_log_filename)
            
        try:
            from google.colab import files
            files.download(zip_filename)
            print(f"Downloaded {zip_filename} successfully.")
        except Exception as e:
            print(f"Could not initiate download. Error: {e}")
    else:
        print(f"\nNot running in Colab. Test results printed above. Failures logged to {config.failure_log_filename}")