ehartford commited on
Commit
ae08fb7
·
verified ·
1 Parent(s): 67d3388

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -5
README.md CHANGED
@@ -458,9 +458,9 @@ ok can you please show the code to generate the pre-computed external commonsens
458
  > want to incorporate commonsense knowledge for a subset of input tokens marked by the
459
  > new [KNW] token; you may need to modify it according to your specific downstream NLP
460
  > task's requirements and desired level of commonsense integration.
461
- >
462
- > > Transformer architecture (Vaswani et al.) does poorly in theory of mind. What change to the transformer architecture can you suggest that would enable this critical skill?
463
- >
464
  > 1. **Attention over attention**: As mentioned earlier, we can add a higher-level attention module that computes attention over the
465
  > existing self-attention scores in each attention head. This could involve an additional linear layer followed by another softmax
466
  > activation applied to the original attention scores before they are used as weights when computing the final output sequence. By
@@ -472,8 +472,7 @@ ok can you please show the code to generate the pre-computed external commonsens
472
  > within each self-attention layer that compares two alternative input representations instead of just one actual input sequence and its
473
  > own past context as in standard transformer models. Specifically, for every timestep t, rather than computing attention weights
474
  > a_t^j(k) = softmax(Q_t^j(k)^T * K_t^j(k)) where Q, K, V are query, key, value matrices extracted from the current input sequence X_t
475
- > and j indexes different self-attention heads, we could instead compute two sets of attention weights a_t^j(k, l) = softmax(Q_t^j(k)^T
476
- > * (K_t^j(k) + alpha * (K_t^j(l) - K_t^j(k)))) for all pairs of timesteps k != l and some hyperparameter alpha > 0. This would
477
  > encourage each head to pay more or less attention to certain input tokens depending on whether they are likely counterfactual
478
  > alternatives given the rest of the sentence context, which could potentially help improve its ability to reason about what might have
479
  > happened if different words had been used instead.
 
458
  > want to incorporate commonsense knowledge for a subset of input tokens marked by the
459
  > new [KNW] token; you may need to modify it according to your specific downstream NLP
460
  > task's requirements and desired level of commonsense integration.
461
+
462
+ Transformer architecture (Vaswani et al.) does poorly in theory of mind. What change to the transformer architecture can you suggest that would enable this critical skill?
463
+
464
  > 1. **Attention over attention**: As mentioned earlier, we can add a higher-level attention module that computes attention over the
465
  > existing self-attention scores in each attention head. This could involve an additional linear layer followed by another softmax
466
  > activation applied to the original attention scores before they are used as weights when computing the final output sequence. By
 
472
  > within each self-attention layer that compares two alternative input representations instead of just one actual input sequence and its
473
  > own past context as in standard transformer models. Specifically, for every timestep t, rather than computing attention weights
474
  > a_t^j(k) = softmax(Q_t^j(k)^T * K_t^j(k)) where Q, K, V are query, key, value matrices extracted from the current input sequence X_t
475
+ > and j indexes different self-attention heads, we could instead compute two sets of attention weights a_t^j(k, l) = softmax(Q_t^j(k)^T * (K_t^j(k) + alpha * (K_t^j(l) - K_t^j(k)))) for all pairs of timesteps k != l and some hyperparameter alpha > 0. This would
 
476
  > encourage each head to pay more or less attention to certain input tokens depending on whether they are likely counterfactual
477
  > alternatives given the rest of the sentence context, which could potentially help improve its ability to reason about what might have
478
  > happened if different words had been used instead.