multi_head_attention#

ivy.multi_head_attention(query, /, *, key=None, value=None, batch_first=True, num_heads=8, scale=None, attention_mask=None, in_proj_weights=None, q_proj_weights=None, k_proj_weights=None, v_proj_weights=None, out_proj_weights=None, in_proj_bias=None, out_proj_bias=None, is_causal=False, key_padding_mask=None, bias_k=None, bias_v=None, static_k=None, static_v=None, add_zero_attn=False, return_attention_weights=False, average_attention_weights=True, dropout=0.0, training=False, out=None)[source]#

Apply multi-head attention to inputs x. This is an implementation of multi-headed attention as described in the paper “Attention is all you Need” (Vaswani et al., 2017). If query, key, value are the same, then this is self-attention. Each timestep in query attends to the corresponding sequence in key, and returns a fixed-width vector. This layer first projects query, key and value. These are (effectively) a list of tensors of length num_attention_heads, where the corresponding shapes are (batch_size, <query dimensions>, key_dim), (batch_size, <key/value dimensions>, key_dim), (batch_size, <key/value dimensions>, value_dim). Then, the query and key tensors are dot-producted and scaled. These are softmaxed to obtain attention probabilities. The value tensors are then interpolated by these probabilities, then concatenated back to a single tensor. Finally, the result tensor with the last dimension as value_dim can take a linear projection and return.

Parameters:
  • query (Union[Array, NativeArray]) – The query embeddings. Shape: (L, Q) or (N, L, Q), where L is the number of queries, N is the batch size, Q is the query embedding dimension.

  • key (Optional[Union[Array, NativeArray]], default: None) – The key embeddings. Shape: (S, K) or (N, S, K), where S is the number of keys, N is the batch size, K is the key embedding dimension.

  • value (Optional[Union[Array, NativeArray]], default: None) – The value embeddings. Shape (S, V) or (N, S, V), where S is the number of keys, N is the batch size, V is the value embedding dimension.

  • batch_first (bool, default: True) – If False, query, key and value will have shapes (L, N, Q), (S, N, K) and (S, N, V) respectively (if batched).

  • num_heads (int, default: 8) – The number of attention heads to use.

  • scale (Optional[float], default: None) – The value by which to scale the query-key similarity measure before softmax.

  • attention_mask (Optional[Union[Array, NativeArray]], default: None) – The mask to apply to the query-key values. Shape: (L, S) or (N*num_heads, L, S).

  • in_proj_weights (Optional[Union[Array, NativeArray]], default: None) – The weights used to project query, key and value. Shape: (3*E, E’), where E is the new embedding dimension and E’ is the input embedding dimension, i.e. E’ = Q = K = V.

  • q_proj_weights (Optional[Union[Array, NativeArray]], default: None) – The weights used to project query if in_proj_weights is None. Shape: (E, Q).

  • k_proj_weights (Optional[Union[Array, NativeArray]], default: None) – The weights used to project key if in_proj_weights is None. Shape: (E, K).

  • v_proj_weights (Optional[Union[Array, NativeArray]], default: None) – The weights used to project value if in_proj_weights is None. Shape: (E, V).

  • out_proj_weights (Optional[Union[Array, NativeArray]], default: None) – The weights used to project the attention output. Shape: (O, E), where O is the output embedding dimension.

  • in_proj_bias (Optional[Union[Array, NativeArray]], default: None) – The bias used when projecting query, key and value. Shape: (3*E,).

  • out_proj_bias (Optional[Union[Array, NativeArray]], default: None) – The bias used when projecting the output. Shape: (O,).

  • is_causal (bool, default: False) – If True, use a causal attention mask and ignore the provided attention_mask.

  • key_padding_mask (Optional[Union[Array, NativeArray]], default: None) – A binary mask to apply to the key sequence. Shape: (S,) or (N, S).

  • bias_k (Optional[Union[Array, NativeArray]], default: None) – An additional bias added to the key sequence. Shape: (E,).

  • bias_v (Optional[Union[Array, NativeArray]], default: None) – An additional bias added to the value sequence. Shape: (E,).

  • static_k (Optional[Union[Array, NativeArray]], default: None) – A static key to be used in the attention operators. Shape: (N*num_heads, S, E//num_heads).

  • static_v (Optional[Union[Array, NativeArray]], default: None) – A static value to be used in the attention operators. Shape: (N*num_heads, S, E//num_heads).

  • add_zero_attn (bool, default: False) – A boolean flag indicating whether to add a batch of zeros to key and value.

  • return_attention_weights (bool, default: False) – If True, return the attention weights alongside the attention output.

  • average_attention_weights (bool, default: True) – If True, the returned attention weights will be averaged across heads. Otherwise, the attention weights will be provided separately per head. Note that this flag only has an effect when return_attention_weights=True.

  • dropout (float, default: 0.0) – Specifies the dropout probability. Dropout is applied on the attention weights.

  • training (bool, default: False) – If True, dropout is used, otherwise dropout is not activated.

  • out (Optional[Array], default: None) – optional output array, for writing the result to. It must have a shape that the inputs broadcast to.

Return type:

Union[Array, NativeArray]

Returns:

  • ret – The output following the application of multi-head attention. Either output or (output, attention_weights). output will have shape (L, E) if the inputs were unbatched or (N, L, E) otherwise, and attention_weights will have shape (L, S) or (N, L, S) respectively. If batch_first is False and the inputs were batched, the output will have shape (L, N, E).

  • Both the description and the type hints above assumes an array input for simplicity,

  • but this function is nestable, and therefore also accepts ivy.Container

  • instances in place of any of the arguments.

Array.multi_head_attention(self, /, *, key=None, value=None, num_heads=8, scale=None, attention_mask=None, in_proj_weights=None, q_proj_weights=None, k_proj_weights=None, v_proj_weights=None, out_proj_weights=None, in_proj_bias=None, out_proj_bias=None, is_causal=False, key_padding_mask=None, bias_k=None, bias_v=None, static_k=None, static_v=None, add_zero_attn=False, return_attention_weights=False, average_attention_weights=True, dropout=0.0, training=False, out=None)[source]#
Return type:

Array

Container.multi_head_attention(self, /, *, key=None, value=None, num_heads=8, scale=None, attention_mask=None, in_proj_weights=None, q_proj_weights=None, k_proj_weights=None, v_proj_weights=None, out_proj_weights=None, in_proj_bias=None, out_proj_bias=None, is_causal=False, key_padding_mask=None, bias_k=None, bias_v=None, static_k=None, static_v=None, add_zero_attn=False, return_attention_weights=False, average_attention_weights=True, dropout=0.0, training=False, key_chains=None, to_apply=True, prune_unapplied=False, map_sequences=False, out=None)[source]#
Return type:

Container