license: mit
datasets:
- dblpacm
language:
- en
metrics:
- loss
- accuracy
- recall
- precision
- f1
tags:
- entity-matching
- similarity-comparison
- preprocessing
- neer-match
model-index:
- name: DBLP-ACM Entity Matching Model
results:
- task:
type: entity-matching
name: Entity Matching
dataset:
type: dblpacm
name: DBLP-ACM
config: default
split: test
metrics:
- type: loss
value: 1.9029e-9
name: Test Loss
- type: accuracy
value: 0.9999
name: Test Accuracy
- type: recall
value: 0.9932
name: Test Recall
- type: precision
value: 0.9419
name: Test Precision
- type: f1
value: 0.9668946637
name: Test F1 Score
Preprocessing
Before training, the DBLP-ACM
dataset was preprocessed using the prepare.format
function from the neer-match-utilities
library. The following preprocessing steps were applied:
Numeric Harmonization:
- Missing numeric values were filled with 0.
- The
year
column was converted to numeric format.
String Standardization:
- Missing string values were replaced with placeholders.
- All string fields were capitalized to ensure consistency in text formatting.
These preprocessing steps ensured that the input data was harmonized and ready for training, improving the model's ability to compare and match records effectively.
Similarity Map
The model uses a SimilarityMap
to compute similarity scores between attributes of records. The following similarity metrics were applied:
similarity_map = {
"title": ["levenshtein", "jaro_winkler", "partial_ratio", "token_sort_ratio", "token_set_ratio", "partial_token_set_ratio"],
"authors": ["levenshtein", "jaro_winkler", "partial_ratio", "token_sort_ratio", "token_set_ratio", "partial_token_set_ratio"],
"venue": ["levenshtein", "jaro_winkler", "partial_ratio", "token_sort_ratio", "token_set_ratio", "partial_token_set_ratio", "notmissing"],
"year" : ["euclidean", "gaussian", "notzero"],
}
Fitting the Model
The model was trained using the fit
method and the custom focal_loss loss function.
Training Configuration
The training parameters deviated from the default values in the following ways:
- Epochs: 60
- Mismatch Share: 1.0
Before training, the labeled data was split into training and test data, using the split_test_train
method of neer_match_utilities
with a test_ratio
0f .8