maliedvp
/

dblpacm_neer_match_DL

entity-matching

similarity-comparison

Model card Files Files and versions Community

maliedvp commited on 3 days ago

Commit

1ebca24

·

verified ·

1 Parent(s): e78c24d

Update README.md

Files changed (1) hide show

README.md +89 -3

README.md CHANGED Viewed

@@ -1,3 +1,89 @@
----
-license: mit
----

+---
+license: mit
+datasets:
+- dblpacm
+language:
+- en
+metrics:
+- loss
+- accuracy
+- recall
+- precision
+- f1
+tags:
+- entity-matching
+- similarity-comparison
+- preprocessing
+- neer-match
+model-index:
+- name: DBLP-ACM Entity Matching Model
+  results:
+  - task:
+      type: entity-matching
+      name: Entity Matching
+    dataset:
+      type: dblpacm
+      name: DBLP-ACM
+      config: default
+      split: test
+    metrics:
+    - type: loss
+      value: 1.9029e-09
+      name: Test Loss
+    - type: accuracy
+      value: 0.9999
+      name: Test Accuracy
+    - type: recall
+      value: 0.9932
+      name: Test Recall
+    - type: precision
+      value: 0.9419
+      name: Test Precision
+    - type: f1
+      value: 0.9668946637
+      name: Test F1 Score
+---
+## Preprocessing
+Before training, the `DBLP-ACM` dataset was preprocessed using the `prepare.format` function from the `neer-match-utilities` library. The following preprocessing steps were applied:
+1. **Numeric Harmonization**:
+   - Missing numeric values were filled with 0.
+   - The `year` column was converted to numeric format.
+2. **String Standardization**:
+   - Missing string values were replaced with placeholders.
+   - All string fields were capitalized to ensure consistency in text formatting.
+These preprocessing steps ensured that the input data was harmonized and ready for training, improving the model's ability to compare and match records effectively.
+---
+## Similarity Map
+The model uses a `SimilarityMap` to compute similarity scores between attributes of records. The following similarity metrics were applied:
+```python
+similarity_map = {
+    "title": ["levenshtein", "jaro_winkler", "partial_ratio", "token_sort_ratio", "token_set_ratio", "partial_token_set_ratio"],
+    "authors": ["levenshtein", "jaro_winkler", "partial_ratio", "token_sort_ratio", "token_set_ratio", "partial_token_set_ratio"],
+    "venue": ["levenshtein", "jaro_winkler", "partial_ratio", "token_sort_ratio", "token_set_ratio", "partial_token_set_ratio", "notmissing"],
+    "year" : ["euclidean", "gaussian", "notzero"],
+}
+```
+---
+## Fitting the Model
+The model was trained using the `fit` method and the custom focal_loss loss function.
+### Training Configuration
+The training parameters deviated from the default values in the following ways:
+- **Epochs**: 60
+- **Mismatch Share**: 1.0
+Before training, the labeled data was split into training and test data, using the `split_test_train` method of `neer_match_utilities` with a `test_ratio` 0f .8