ai-cookbook / src /theory /activations.qmd
Sébastien De Greef
chore: Update colorFrom in README.md and index.qmd
e3bf489
raw
history blame
5.2 kB
## **1. Sigmoid (Logistic)**
**Formula:** σ(x) = 1 / (1 + exp(-x))
**Strengths:** Maps any real-valued number to a value between 0 and 1, making it suitable for binary classification problems.
**Weaknesses:** Saturates (i.e., output values approach 0 or 1) for large inputs, leading to vanishing gradients during backpropagation.
**Usage:** Binary classification, logistic regression.
## **2. Hyperbolic Tangent (Tanh)**
**Formula:** tanh(x) = 2 / (1 + exp(-2x)) - 1
**Strengths:** Similar to sigmoid, but maps to (-1, 1), which can be beneficial for some models.
**Weaknesses:** Also saturates, leading to vanishing gradients.
**Usage:** Similar to sigmoid, but with a larger output range.
## **3. Rectified Linear Unit (ReLU)**
**Formula:** f(x) = max(0, x)
**Strengths:** Computationally efficient, non-saturating, and easy to compute.
**Weaknesses:** Not differentiable at x=0, which can cause issues during optimization.
**Usage:** Default activation function in many deep learning frameworks, suitable for most neural networks.
## **4. Leaky ReLU**
**Formula:** f(x) = max(αx, x), where α is a small constant (e.g., 0.01)
**Strengths:** Similar to ReLU, but allows a small fraction of the input to pass through, helping with dying neurons.
**Weaknesses:** Still non-differentiable at x=0.
**Usage:** Alternative to ReLU, especially when dealing with dying neurons.
## **5. Swish**
**Formula:** f(x) = x \* g(x), where g(x) is a learned function (e.g., sigmoid or ReLU)
**Strengths:** Self-gated, adaptive, and non-saturating.
**Weaknesses:** Computationally expensive, requires additional learnable parameters.
**Usage:** Can be used in place of ReLU or other activations, but may not always outperform them.
## **6. Softmax**
**Formula:** softmax(x) = exp(x) / Σ exp(x)
**Strengths:** Normalizes output to ensure probabilities sum to 1, making it suitable for multi-class classification.
**Weaknesses:** Only suitable for output layers with multiple classes.
**Usage:** Output layer activation for multi-class classification problems.
## **7. Softsign**
**Formula:** f(x) = x / (1 + |x|)
**Strengths:** Similar to sigmoid, but with a more gradual slope.
**Weaknesses:** Not commonly used, may not provide significant benefits over sigmoid or tanh.
**Usage:** Alternative to sigmoid or tanh in certain situations.
## **8. ArcTan**
**Formula:** f(x) = arctan(x)
**Strengths:** Non-saturating, smooth, and continuous.
**Weaknesses:** Not commonly used, may not outperform other activations.
**Usage:** Experimental or niche applications.
## **9. SoftPlus**
**Formula:** f(x) = log(1 + exp(x))
**Strengths:** Smooth, continuous, and non-saturating.
**Weaknesses:** Not commonly used, may not outperform other activations.
**Usage:** Experimental or niche applications.
## **10. Gaussian Error Linear Unit (GELU)**
**Formula:** f(x) = x \* Φ(x), where Φ is the cumulative distribution function of the standard normal distribution
**Strengths:** Non-saturating, smooth, and computationally efficient.
**Weaknesses:** Not as well-studied as ReLU or other activations.
**Usage:** Alternative to ReLU, especially in Bayesian neural networks.
## **11. Mish**
**Formula:** f(x) = x \* tanh(softplus(x))
**Strengths:** Non-saturating, smooth, and computationally efficient.
**Weaknesses:** Not as well-studied as ReLU or other activations.
**Usage:** Alternative to ReLU, especially in computer vision tasks.
## **12. Silu (SiLU)**
**Formula:** f(x) = x \* sigmoid(x)
**Strengths:** Non-saturating, smooth, and computationally efficient.
**Weaknesses:** Not as well-studied as ReLU or other activations.
**Usage:** Alternative to ReLU, especially in computer vision tasks.
## **13. GELU Approximation (GELU Approx.)**
**Formula:** f(x) ≈ 0.5 \* x \* (1 + tanh(√(2/π) \* (x + 0.044715 \* x^3)))
**Strengths:** Fast, non-saturating, and smooth.
**Weaknesses:** Approximation, not exactly equal to GELU.
**Usage:** Alternative to GELU, especially when computational efficiency is crucial.
## **14. SELU (Scaled Exponential Linear Unit)**
**Formula:** f(x) = λ { x if x > 0, α(e^x - 1) if x ≤ 0 }
**Strengths:** Self-normalizing, non-saturating, and computationally efficient.
**Weaknesses:** Requires careful initialization and α tuning.
**Usage:** Alternative to ReLU, especially in deep neural networks.
When choosing an activation function, consider the following:
* **Non-saturation:** Avoid activations that saturate (e.g., sigmoid, tanh) to prevent vanishing gradients.
* **Computational efficiency:** Choose activations that are computationally efficient (e.g., ReLU, Swish) for large models or real-time applications.
* **Smoothness:** Smooth activations (e.g., GELU, Mish) can help with optimization and convergence.
* **Domain knowledge:** Select activations based on the problem domain and desired output (e.g., softmax for multi-class classification).
* **Experimentation:** Try different activations and evaluate their performance on your specific task.