Spaces:

sebdg
/

ai-cookbook

Running

ai-cookbook / src /theory /activations.qmd

Sébastien De Greef

chore: Update colorFrom in README.md and index.qmd

e3bf489 about 1 year ago

5.2 kB


	## 1. Sigmoid (Logistic)

	Formula: σ(x) = 1 / (1 + exp(-x))

	Strengths: Maps any real-valued number to a value between 0 and 1, making it suitable for binary classification problems.

	Weaknesses: Saturates (i.e., output values approach 0 or 1) for large inputs, leading to vanishing gradients during backpropagation.

	Usage: Binary classification, logistic regression.

	## 2. Hyperbolic Tangent (Tanh)

	Formula: tanh(x) = 2 / (1 + exp(-2x)) - 1

	Strengths: Similar to sigmoid, but maps to (-1, 1), which can be beneficial for some models.

	Weaknesses: Also saturates, leading to vanishing gradients.

	Usage: Similar to sigmoid, but with a larger output range.

	## 3. Rectified Linear Unit (ReLU)

	Formula: f(x) = max(0, x)

	Strengths: Computationally efficient, non-saturating, and easy to compute.

	Weaknesses: Not differentiable at x=0, which can cause issues during optimization.

	Usage: Default activation function in many deep learning frameworks, suitable for most neural networks.

	## 4. Leaky ReLU

	Formula: f(x) = max(αx, x), where α is a small constant (e.g., 0.01)

	Strengths: Similar to ReLU, but allows a small fraction of the input to pass through, helping with dying neurons.

	Weaknesses: Still non-differentiable at x=0.

	Usage: Alternative to ReLU, especially when dealing with dying neurons.

	## 5. Swish

	Formula: f(x) = x \* g(x), where g(x) is a learned function (e.g., sigmoid or ReLU)

	Strengths: Self-gated, adaptive, and non-saturating.

	Weaknesses: Computationally expensive, requires additional learnable parameters.

	Usage: Can be used in place of ReLU or other activations, but may not always outperform them.

	## 6. Softmax

	Formula: softmax(x) = exp(x) / Σ exp(x)

	Strengths: Normalizes output to ensure probabilities sum to 1, making it suitable for multi-class classification.

	Weaknesses: Only suitable for output layers with multiple classes.

	Usage: Output layer activation for multi-class classification problems.

	## 7. Softsign

	Formula: f(x) = x / (1 + \|x\|)

	Strengths: Similar to sigmoid, but with a more gradual slope.

	Weaknesses: Not commonly used, may not provide significant benefits over sigmoid or tanh.

	Usage: Alternative to sigmoid or tanh in certain situations.

	## 8. ArcTan

	Formula: f(x) = arctan(x)

	Strengths: Non-saturating, smooth, and continuous.

	Weaknesses: Not commonly used, may not outperform other activations.

	Usage: Experimental or niche applications.

	## 9. SoftPlus

	Formula: f(x) = log(1 + exp(x))

	Strengths: Smooth, continuous, and non-saturating.

	Weaknesses: Not commonly used, may not outperform other activations.

	Usage: Experimental or niche applications.

	## 10. Gaussian Error Linear Unit (GELU)

	Formula: f(x) = x \* Φ(x), where Φ is the cumulative distribution function of the standard normal distribution

	Strengths: Non-saturating, smooth, and computationally efficient.

	Weaknesses: Not as well-studied as ReLU or other activations.

	Usage: Alternative to ReLU, especially in Bayesian neural networks.

	## 11. Mish

	Formula: f(x) = x \* tanh(softplus(x))

	Strengths: Non-saturating, smooth, and computationally efficient.

	Weaknesses: Not as well-studied as ReLU or other activations.

	Usage: Alternative to ReLU, especially in computer vision tasks.

	## 12. Silu (SiLU)

	Formula: f(x) = x \* sigmoid(x)

	Strengths: Non-saturating, smooth, and computationally efficient.

	Weaknesses: Not as well-studied as ReLU or other activations.

	Usage: Alternative to ReLU, especially in computer vision tasks.

	## 13. GELU Approximation (GELU Approx.)

	Formula: f(x) ≈ 0.5 \* x \* (1 + tanh(√(2/π) \* (x + 0.044715 \* x^3)))

	Strengths: Fast, non-saturating, and smooth.

	Weaknesses: Approximation, not exactly equal to GELU.

	Usage: Alternative to GELU, especially when computational efficiency is crucial.

	## 14. SELU (Scaled Exponential Linear Unit)

	Formula: f(x) = λ { x if x > 0, α(e^x - 1) if x ≤ 0 }

	Strengths: Self-normalizing, non-saturating, and computationally efficient.

	Weaknesses: Requires careful initialization and α tuning.

	Usage: Alternative to ReLU, especially in deep neural networks.

	When choosing an activation function, consider the following:

	* Non-saturation: Avoid activations that saturate (e.g., sigmoid, tanh) to prevent vanishing gradients.

	* Computational efficiency: Choose activations that are computationally efficient (e.g., ReLU, Swish) for large models or real-time applications.

	* Smoothness: Smooth activations (e.g., GELU, Mish) can help with optimization and convergence.

	* Domain knowledge: Select activations based on the problem domain and desired output (e.g., softmax for multi-class classification).

	* Experimentation: Try different activations and evaluate their performance on your specific task.


	## 1. Sigmoid (Logistic)

	Formula: σ(x) = 1 / (1 + exp(-x))

	Strengths: Maps any real-valued number to a value between 0 and 1, making it suitable for binary classification problems.

	Weaknesses: Saturates (i.e., output values approach 0 or 1) for large inputs, leading to vanishing gradients during backpropagation.

	Usage: Binary classification, logistic regression.

	## 2. Hyperbolic Tangent (Tanh)

	Formula: tanh(x) = 2 / (1 + exp(-2x)) - 1

	Strengths: Similar to sigmoid, but maps to (-1, 1), which can be beneficial for some models.

	Weaknesses: Also saturates, leading to vanishing gradients.

	Usage: Similar to sigmoid, but with a larger output range.

	## 3. Rectified Linear Unit (ReLU)

	Formula: f(x) = max(0, x)

	Strengths: Computationally efficient, non-saturating, and easy to compute.

	Weaknesses: Not differentiable at x=0, which can cause issues during optimization.

	Usage: Default activation function in many deep learning frameworks, suitable for most neural networks.

	## 4. Leaky ReLU

	Formula: f(x) = max(αx, x), where α is a small constant (e.g., 0.01)

	Strengths: Similar to ReLU, but allows a small fraction of the input to pass through, helping with dying neurons.

	Weaknesses: Still non-differentiable at x=0.

	Usage: Alternative to ReLU, especially when dealing with dying neurons.

	## 5. Swish

	Formula: f(x) = x \* g(x), where g(x) is a learned function (e.g., sigmoid or ReLU)

	Strengths: Self-gated, adaptive, and non-saturating.

	Weaknesses: Computationally expensive, requires additional learnable parameters.

	Usage: Can be used in place of ReLU or other activations, but may not always outperform them.

	## 6. Softmax

	Formula: softmax(x) = exp(x) / Σ exp(x)

	Strengths: Normalizes output to ensure probabilities sum to 1, making it suitable for multi-class classification.

	Weaknesses: Only suitable for output layers with multiple classes.

	Usage: Output layer activation for multi-class classification problems.

	## 7. Softsign

	Formula: f(x) = x / (1 + \|x\|)

	Strengths: Similar to sigmoid, but with a more gradual slope.

	Weaknesses: Not commonly used, may not provide significant benefits over sigmoid or tanh.

	Usage: Alternative to sigmoid or tanh in certain situations.

	## 8. ArcTan

	Formula: f(x) = arctan(x)

	Strengths: Non-saturating, smooth, and continuous.

	Weaknesses: Not commonly used, may not outperform other activations.

	Usage: Experimental or niche applications.

	## 9. SoftPlus

	Formula: f(x) = log(1 + exp(x))

	Strengths: Smooth, continuous, and non-saturating.

	Weaknesses: Not commonly used, may not outperform other activations.

	Usage: Experimental or niche applications.

	## 10. Gaussian Error Linear Unit (GELU)

	Formula: f(x) = x \* Φ(x), where Φ is the cumulative distribution function of the standard normal distribution

	Strengths: Non-saturating, smooth, and computationally efficient.

	Weaknesses: Not as well-studied as ReLU or other activations.

	Usage: Alternative to ReLU, especially in Bayesian neural networks.

	## 11. Mish

	Formula: f(x) = x \* tanh(softplus(x))

	Strengths: Non-saturating, smooth, and computationally efficient.

	Weaknesses: Not as well-studied as ReLU or other activations.

	Usage: Alternative to ReLU, especially in computer vision tasks.

	## 12. Silu (SiLU)

	Formula: f(x) = x \* sigmoid(x)

	Strengths: Non-saturating, smooth, and computationally efficient.

	Weaknesses: Not as well-studied as ReLU or other activations.

	Usage: Alternative to ReLU, especially in computer vision tasks.

	## 13. GELU Approximation (GELU Approx.)

	Formula: f(x) ≈ 0.5 \* x \* (1 + tanh(√(2/π) \* (x + 0.044715 \* x^3)))

	Strengths: Fast, non-saturating, and smooth.

	Weaknesses: Approximation, not exactly equal to GELU.

	Usage: Alternative to GELU, especially when computational efficiency is crucial.

	## 14. SELU (Scaled Exponential Linear Unit)

	Formula: f(x) = λ { x if x > 0, α(e^x - 1) if x ≤ 0 }

	Strengths: Self-normalizing, non-saturating, and computationally efficient.

	Weaknesses: Requires careful initialization and α tuning.

	Usage: Alternative to ReLU, especially in deep neural networks.

	When choosing an activation function, consider the following:

	* Non-saturation: Avoid activations that saturate (e.g., sigmoid, tanh) to prevent vanishing gradients.

	* Computational efficiency: Choose activations that are computationally efficient (e.g., ReLU, Swish) for large models or real-time applications.

	* Smoothness: Smooth activations (e.g., GELU, Mish) can help with optimization and convergence.

	* Domain knowledge: Select activations based on the problem domain and desired output (e.g., softmax for multi-class classification).

	* Experimentation: Try different activations and evaluate their performance on your specific task.