Upload ModularStarEncoder

Browse files

Files changed (5) hide show

README.md +199 -0
config.json +44 -0
config.py +81 -0
model.safetensors +3 -0
modularStarEncoder.py +356 -0

README.md ADDED Viewed

	@@ -0,0 +1,199 @@

+---
+library_name: transformers
+tags: []
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]

config.json ADDED Viewed

	@@ -0,0 +1,44 @@

+{
+  "architectures": [
+    "ModularStarEncoder"
+  ],
+  "attention_dropout": 0.1,
+  "auto_map": {
+    "AutoConfig": "config.ModularStarEncoderConfig",
+    "AutoModel": "modularStarEncoder.ModularStarEncoder"
+  },
+  "bos_token_id": 0,
+  "conditional_size": 4,
+  "embedding_dropout": 0.1,
+  "eos_token_id": 0,
+  "hidden_act": "gelu_pytorch_tanh",
+  "hidden_size": 1024,
+  "initializer_range": 0.018042,
+  "intermediate_size": 12288,
+  "keys_to_ignore_at_inference": "past_key_values",
+  "layer_matryoshka_loss": true,
+  "layer_norm_eps": 1e-05,
+  "matryoshka_layers": [
+    4,
+    9,
+    18,
+    27,
+    36
+  ],
+  "max_position_embeddings": 2048,
+  "mlp_type": "default",
+  "model_type": "ModularStarEncoder",
+  "norm_epsilon": 1e-05,
+  "norm_type": "layer_norm",
+  "num_attention_heads": 16,
+  "num_hidden_layers": 36,
+  "num_key_value_heads": 4,
+  "residual_dropout": 0.1,
+  "rope_theta": 999999.4420358813,
+  "sliding_window": null,
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.39.3",
+  "use_bias": true,
+  "use_cache": false,
+  "vocab_size": 49156
+}

config.py ADDED Viewed

	@@ -0,0 +1,81 @@

+from transformers import PretrainedConfig
+from typing import List
+#STARCODER2_PRETRAINED_CONFIG_ARCHIVE_MAP = {}
+class ModularStarEncoderConfig(PretrainedConfig):
+    model_type = "ModularStarEncoder"
+    keys_to_ignore_at_inference = ["past_key_values"]
+    def __init__(
+        self,
+        attention_dropout= 0.1,
+        residual_dropout=  0.1,
+        embedding_dropout=  0.1,
+        bos_token_id=  0,
+        eos_token_id= 0,
+        hidden_act= "gelu_pytorch_tanh",
+        _attn_implementation="flash_attention_2",
+        hidden_size= 1024,
+        conditional_size= 4,
+        initializer_range= 0.018042,
+        intermediate_size= 12288,
+        max_position_embeddings= 2048,
+        mlp_type= "default",
+        model_type= "starcoder2",
+        torch_dtype= "bfloat16",
+        layer_matryoshka_loss= True,
+        matryoshka_layers= [4,9,18,27,36],
+        norm_epsilon= 1e-05,
+        layer_norm_eps=1e-05,
+        norm_type= "layer_norm",
+        num_attention_heads= 16,
+        num_hidden_layers= 36,
+        num_key_value_heads= 4,
+        rope_theta= 999999.4420358813,
+        sliding_window= None,
+        transformers_version= "4.39.3",
+        use_bias= True,
+        use_cache= False,
+        vocab_size= 49156,
+        pad_token_id=0,
+        **kwargs,
+    ):
+        if _attn_implementation not in ["flash_attention_2", "sdpa"]:
+            raise ValueError(f"`_attn_implementation` must be 'flash_attention_2', 'sdpa', got {_attn_implementation}.")
+        self.attention_dropout=attention_dropout ,
+        self.residual_dropout=  residual_dropout,
+        self.embedding_dropout=  embedding_dropout,
+        self.bos_token_id=  bos_token_id,
+        self.eos_token_id= eos_token_id,
+        self.hidden_act= hidden_act,
+        self._attn_implementation=_attn_implementation,
+        self.hidden_size= hidden_size,
+        self.conditional_size= conditional_size,
+        self.initializer_range= initializer_range,
+        self.intermediate_size= intermediate_size,
+        self.max_position_embeddings= max_position_embeddings,
+        self.mlp_type= mlp_type,
+        self.model_type= model_type,
+        self.torch_dtype= torch_dtype,
+        self.layer_matryoshka_loss= layer_matryoshka_loss,
+        self.matryoshka_layers= matryoshka_layers,
+        self.norm_epsilon= norm_epsilon,
+        self.layer_norm_eps=layer_norm_eps,
+        self.norm_type= norm_type,
+        self.num_attention_heads= num_attention_heads,
+        self.num_hidden_layers= num_hidden_layers,
+        self.num_key_value_heads= num_key_value_heads,
+        self.rope_theta= rope_theta,
+        self.sliding_window= sliding_window,
+        self.transformers_version= transformers_version,
+        self.use_bias= use_bias,
+        self.use_cache= use_cache,
+        self.vocab_size= vocab_size,
+        self.pad_token_id=pad_token_id,
+        super().__init__(
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            **kwargs)

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cb424de4f8c7ea7b7bc437b4c1aed15e71176e57ba2be1fd5558cd6887fbb866
+size 2123859442

modularStarEncoder.py ADDED Viewed

	@@ -0,0 +1,356 @@

+from transformers import  Starcoder2Model
+import sys
+from config import ModularStarEncoderConfig
+import os
+from dataclasses import dataclass
+from typing import Optional, Tuple, Union, List
+import sys
+import torch
+import torch.utils.checkpoint
+from torch import nn
+from torch.nn import  CrossEntropyLoss
+from transformers.activations import ACT2FN
+from transformers.modeling_utils import PreTrainedModel
+from transformers.utils import (
+    ModelOutput,
+    logging,
+)
+logger = logging.get_logger(__name__)
+class StarEncoder2PreTrainedModel(PreTrainedModel):
+    """
+    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+    models.
+    """
+    config_class = ModularStarEncoderConfig
+    base_model_prefix = "ModularStarEncoder"
+    model_type = "ModularStarEncoder"
+    supports_gradient_checkpointing = True
+    _supports_flash_attn_2 = True
+    _supports_sdpa = True
+    _supports_cache_class = True
+    def _init_weights(self, module):
+        """Initialize the weights"""
+        if isinstance(module, nn.Linear):
+            # Slightly different from the TF version which uses truncated_normal for initialization
+            # cf https://github.com/pytorch/pytorch/pull/5617
+            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.Embedding):
+            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
+            if module.padding_idx is not None:
+                module.weight.data[module.padding_idx].zero_()
+        elif isinstance(module, nn.LayerNorm):
+            module.bias.data.zero_()
+            module.weight.data.fill_(1.0)
+class StarEncoder2Pooler(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation = nn.Tanh()
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the last token.
+        last_token_tensor = hidden_states[:, -1]
+        pooled_output = self.dense(last_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+@dataclass
+class ModularStarEncoderOutput(ModelOutput):
+    """
+    Output type of [`BertForPreTraining`].
+    Args:
+        loss (*optional*, returned when `labels` is provided, `torch.FloatTensor` of shape `(1,)`):
+            Total loss as the sum of the masked language modeling loss and the next sequence prediction
+            (classification) loss.
+        prediction_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
+            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
+        seq_relationship_logits (`torch.FloatTensor` of shape `(batch_size, 2)`):
+            Prediction scores of the in context classification (classification) head (scores of True/False continuation
+            before SoftMax).
+        hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
+            shape `(batch_size, sequence_length, hidden_size)`.
+            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
+        attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+    """
+    projected_pooled_normalized: Optional[List[torch.FloatTensor]] = None
+    raw_hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+    attentions: Optional[Tuple[torch.FloatTensor]] = None
+    def forward(self, sequence_output, pooled_output,idx_layer: Optional[torch.Tensor] = None):
+        if self.is_matryoshka:
+            device_sequence = sequence_output.get_device()
+            if device_sequence<0:
+                device_sequence = "cpu"
+            prediction_scores = self.predictions(torch.cat([sequence_output , self.conditional_embeddings(torch.tensor(idx_layer,device=device_sequence).int()).expand(sequence_output.size()[0],sequence_output.size()[1],-1)],dim=-1))
+            seq_relationship_score = self.seq_relationship(torch.cat([pooled_output , self.conditional_embeddings(torch.tensor(idx_layer,device=device_sequence).int()).expand(pooled_output.size()[0],-1)],dim=-1))
+        else:
+            prediction_scores = self.predictions(sequence_output)
+            seq_relationship_score = self.seq_relationship(pooled_output)
+        return prediction_scores, seq_relationship_score
+def normalize(my_tensor):
+    embedding_norms = my_tensor.norm(dim=0)
+    normalizing_factor = torch.where(  # Only normalize embeddings with norm > 1.0.
+        embedding_norms > 1.0, embedding_norms, torch.tensor(1)
+    )
+    normalized_tensor = my_tensor / normalizing_factor
+    return normalized_tensor
+def pooling(x: torch.Tensor, mask: torch.Tensor) -> torch.Tensor:
+    """Pools a batch of vector sequences into a batch of vector global representations.
+    It does so by taking the average representation of the sequence, as indicated by the mask.
+    Args:
+        x (torch.Tensor): Batch of vector sequences with shape [B, T, F].
+        mask (torch.Tensor): Batch of masks with shape [B, T].
+    Returns:
+        torch.Tensor: Pooled version of the input batch with shape [B, F].
+    """
+    # Expand the mask to match the feature dimensions for proper masking
+    mask_expanded = mask.unsqueeze(-1)  # Shape [B, T, 1]
+    # Apply the mask to the input tensor
+    masked_x = x * mask_expanded  # Shape [B, T, F]
+    # Sum along the time dimension
+    sum_x = masked_x.sum(dim=1)  # Shape [B, F]
+    # Calculate the length of valid (non-padded) elements
+    valid_lengths = mask.sum(dim=1).clamp(min=1).unsqueeze(-1)  # Shape [B, 1]
+    # Calculate the average pooling, avoiding division by zero
+    pooled_x = sum_x / valid_lengths  # Shape [B, F]
+    return pooled_x
+def pool_and_normalize(
+    features_sequence: torch.Tensor,
+    attention_masks: torch.Tensor,
+    return_norms: bool = False,
+) -> Union[torch.Tensor, List[torch.Tensor]]:
+    """Temporal ooling of sequences of vectors and projection onto the unit sphere.
+    Args:
+        features_sequence (torch.Tensor): Inpute features with shape [B, T, F].
+        attention_masks (torch.Tensor): Pooling masks with shape [B, T, F].
+        return_norms (bool, optional): Whether to additionally return the norms. Defaults to False.
+    Returns:
+        Union[torch.Tensor, List[torch.Tensor]]: Pooled and normalized vectors with shape [B, F].
+    """
+    pooled_embeddings = pooling(features_sequence, attention_masks)
+    embedding_norms = pooled_embeddings.norm(dim=1)
+    normalizing_factor = torch.where(  # Only normalize embeddings with norm > 1.0.
+        embedding_norms > 1.0, embedding_norms, torch.ones_like(embedding_norms)
+    )
+    pooled_normalized_embeddings = pooled_embeddings / normalizing_factor[:, None]
+    if return_norms:
+        return pooled_normalized_embeddings, embedding_norms
+    else:
+        return pooled_normalized_embeddings
+def get_pooling_mask(
+    input_ids: torch.Tensor, sep_token_id: Union[int, float]
+) -> torch.Tensor:
+    """Gets pooling masks. For a sequence of input tokens, the mask will be
+    a sequence of zeros up until the first [SEP] occurrence, and 1 after that.
+    Args:
+        input_ids (torch.Tensor): Batch of input ids with shape [B, T].
+        sep_token_id (Union[int, float]): Id for [SEP] token.
+    Returns:
+        torch.Tensor: Batch of pooling masks with shape [B, T]
+    """
+    # idx indicates the first occurrence of sep_token_id per along dim 0 of input_ids
+    idx = (input_ids == sep_token_id).float().flip(1).argmax(1)
+    idx = input_ids.size(-1)-idx-1
+    repeated_idx = idx.unsqueeze(1).repeat(1, input_ids.size(1))
+    ranges = torch.arange(input_ids.size(1)).repeat(input_ids.size(0), 1)
+    pooling_mask = (repeated_idx <= ranges).long()
+    return pooling_mask
+def adapt_model(model,config,till_layer:int):
+    model = model.starEncoder2
+    encoder_config = config
+    layers = encoder_config.matryoshka_layers
+    feature_dim = encoder_config.hidden_size
+    model.projection_heads = torch.nn.ModuleList()
+    if till_layer:
+        print(f"ATTENTION: till layer is on, you are pruning the model keeping just the first {till_layer} layers")
+        model.layers = model.layers[:till_layer]
+        model.projection_heads.append(torch.nn.Sequential(
+                    torch.nn.Linear(feature_dim, feature_dim),
+                    torch.nn.LeakyReLU(),
+                    torch.nn.Linear(feature_dim, feature_dim),
+                ))
+    else:
+        for layer in layers:
+            model.projection_heads.append(torch.nn.Sequential(
+                    torch.nn.Linear(feature_dim, feature_dim),
+                    torch.nn.LeakyReLU(),
+                    torch.nn.Linear(feature_dim, feature_dim),
+                ))
+                #setting off causal masking
+    for layer in model.layers:
+        layer.self_attn.is_causal=False
+    model.temperature_coef = torch.nn.Parameter(torch.Tensor([10.0]),requires_grad=False)
+    return model
+class ModularStarEncoder(StarEncoder2PreTrainedModel):
+    _tied_weights_keys = ["predictions.decoder.bias", "cls.predictions.decoder.weight"]
+    config_class = ModularStarEncoderConfig
+    def __init__(self, config):
+        super().__init__(config)
+        self.model_type = "ModularStarEncoder"
+        for element in dir(config):
+            value = getattr(config, element)  # Get the attribute value
+            if (isinstance(value, tuple) or isinstance(value, list)) and len(value)>0:
+                setattr(config, element, value[0])
+        self.layer_matryoshka_loss = config.layer_matryoshka_loss
+        self.matryoshka_layers = config.matryoshka_layers
+        self.starEncoder2 = Starcoder2Model(config)
+        #setting off causal masking
+        for layer in self.starEncoder2.layers:
+            layer.self_attn.is_causal=False
+        # Initialize weights and apply final processing
+        self.post_init()
+        self.starEncoder2 = adapt_model(self ,config=config,till_layer=False)
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        #token_type_ids: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        labels: Optional[torch.Tensor] = None,
+        next_sentence_label: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        sep_token_id:Optional[int] = 49152,
+    ) -> Union[Tuple[torch.Tensor], ModularStarEncoderOutput]:
+        r"""
+            labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ...,
+                config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked),
+                the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`
+            next_sentence_label (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
+                This label is assigned to the in context loss:
+                - 0 indicates sequence B belongs to the same repository of A,
+                - 1 indicates sequence B is a random repository.
+            kwargs (`Dict[str, any]`, optional, defaults to *{}*):
+                Used to hide legacy arguments that have been deprecated.
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        source_embedding = self.starEncoder2(
+                input_ids,
+                attention_mask=attention_mask,
+                position_ids=position_ids,
+                inputs_embeds=inputs_embeds,
+                output_attentions=output_attentions,
+                output_hidden_states=True,
+                return_dict=return_dict,
+            ).hidden_states
+        DEVICE = source_embedding[-1].get_device()
+        try:
+            projection_fn = self.starEncoder2.module.projection_heads
+            temp_coef = self.starEncoder2.module.temperature_coef
+        except AttributeError:
+            projection_fn = self.starEncoder2.projection_heads
+            temp_coef = self.starEncoder2.temperature_coef
+        for head in projection_fn:
+            head.to(DEVICE)
+        temp_coef.to(DEVICE)
+        pooling_mask_source_targtes = get_pooling_mask(
+                input_ids, sep_token_id
+            )  # Pooling masks indicate the second [SEP] occurrence, 0 till SEP, then all ones.
+        pooled_and_normalized = []
+        for idx,matr_layer in enumerate(self.matryoshka_layers):
+            source_embedding_proj = projection_fn[idx](source_embedding[matr_layer])
+            normalized_source_embedding, embedding_norms = pool_and_normalize(
+                        source_embedding_proj,
+                        pooling_mask_source_targtes,
+                        return_norms=True,
+                    )
+            pooled_and_normalized.append(normalized_source_embedding)
+        return ModularStarEncoderOutput(
+            projected_pooled_normalized = pooled_and_normalized,
+            raw_hidden_states=source_embedding.hidden_states,
+            attentions=source_embedding.attentions,
+        )