---
license: mit
datasets:
- dblpacm
language:
- en
metrics:
- loss
- accuracy
- recall
- precision
- f1
tags:
- entity-matching
- similarity-comparison
- preprocessing
- neer-match
model-index:
- name: DBLP-ACM Entity Matching Model
  results:
  - task:
      type: entity-matching
      name: Entity Matching
    dataset:
      type: dblpacm
      name: DBLP-ACM
      config: default
      split: test
    metrics:
    - type: loss
      value: 1.9029e-09
      name: Test Loss
    - type: accuracy
      value: 0.9999
      name: Test Accuracy
    - type: recall
      value: 0.9932
      name: Test Recall
    - type: precision
      value: 0.9419
      name: Test Precision
    - type: f1
      value: 0.9668946637
      name: Test F1 Score
---

## Preprocessing

Before training, the `DBLP-ACM` dataset was preprocessed using the `prepare.format` function from the `neer-match-utilities` library. The following preprocessing steps were applied:

1. **Numeric Harmonization**:
   - Missing numeric values were filled with 0.
   - The `year` column was converted to numeric format.

2. **String Standardization**:
   - Missing string values were replaced with placeholders.
   - All string fields were capitalized to ensure consistency in text formatting.

These preprocessing steps ensured that the input data was harmonized and ready for training, improving the model's ability to compare and match records effectively.

---

## Similarity Map

The model uses a `SimilarityMap` to compute similarity scores between attributes of records. The following similarity metrics were applied:

```python
similarity_map = {
    "title": ["levenshtein", "jaro_winkler", "partial_ratio", "token_sort_ratio", "token_set_ratio", "partial_token_set_ratio"],
    "authors": ["levenshtein", "jaro_winkler", "partial_ratio", "token_sort_ratio", "token_set_ratio", "partial_token_set_ratio"],
    "venue": ["levenshtein", "jaro_winkler", "partial_ratio", "token_sort_ratio", "token_set_ratio", "partial_token_set_ratio", "notmissing"],
    "year" : ["euclidean", "gaussian", "notzero"],
}
```

---

## Fitting the Model

The model was trained using the `fit` method and the custom focal_loss loss function.

### Training Configuration
The training parameters deviated from the default values in the following ways:
- **Epochs**: 60
- **Mismatch Share**: 1.0

Before training, the labeled data was split into training and test data, using the `split_test_train` method of `neer_match_utilities` with a `test_ratio` 0f .8