maliedvp's picture
Update README.md
1ebca24 verified
|
raw
history blame
2.51 kB
metadata
license: mit
datasets:
  - dblpacm
language:
  - en
metrics:
  - loss
  - accuracy
  - recall
  - precision
  - f1
tags:
  - entity-matching
  - similarity-comparison
  - preprocessing
  - neer-match
model-index:
  - name: DBLP-ACM Entity Matching Model
    results:
      - task:
          type: entity-matching
          name: Entity Matching
        dataset:
          type: dblpacm
          name: DBLP-ACM
          config: default
          split: test
        metrics:
          - type: loss
            value: 1.9029e-9
            name: Test Loss
          - type: accuracy
            value: 0.9999
            name: Test Accuracy
          - type: recall
            value: 0.9932
            name: Test Recall
          - type: precision
            value: 0.9419
            name: Test Precision
          - type: f1
            value: 0.9668946637
            name: Test F1 Score

Preprocessing

Before training, the DBLP-ACM dataset was preprocessed using the prepare.format function from the neer-match-utilities library. The following preprocessing steps were applied:

  1. Numeric Harmonization:

    • Missing numeric values were filled with 0.
    • The year column was converted to numeric format.
  2. String Standardization:

    • Missing string values were replaced with placeholders.
    • All string fields were capitalized to ensure consistency in text formatting.

These preprocessing steps ensured that the input data was harmonized and ready for training, improving the model's ability to compare and match records effectively.


Similarity Map

The model uses a SimilarityMap to compute similarity scores between attributes of records. The following similarity metrics were applied:

similarity_map = {
    "title": ["levenshtein", "jaro_winkler", "partial_ratio", "token_sort_ratio", "token_set_ratio", "partial_token_set_ratio"],
    "authors": ["levenshtein", "jaro_winkler", "partial_ratio", "token_sort_ratio", "token_set_ratio", "partial_token_set_ratio"],
    "venue": ["levenshtein", "jaro_winkler", "partial_ratio", "token_sort_ratio", "token_set_ratio", "partial_token_set_ratio", "notmissing"],
    "year" : ["euclidean", "gaussian", "notzero"],
}

Fitting the Model

The model was trained using the fit method and the custom focal_loss loss function.

Training Configuration

The training parameters deviated from the default values in the following ways:

  • Epochs: 60
  • Mismatch Share: 1.0

Before training, the labeled data was split into training and test data, using the split_test_train method of neer_match_utilities with a test_ratio 0f .8