--- license: mit datasets: - dblpacm language: - en metrics: - loss - accuracy - recall - precision - f1 tags: - entity-matching - similarity-comparison - preprocessing - neer-match model-index: - name: DBLP-ACM Entity Matching Model results: - task: type: entity-matching name: Entity Matching dataset: type: dblpacm name: DBLP-ACM config: default split: test metrics: - type: loss value: 1.9029e-09 name: Test Loss - type: accuracy value: 0.9999 name: Test Accuracy - type: recall value: 0.9932 name: Test Recall - type: precision value: 0.9419 name: Test Precision - type: f1 value: 0.9668946637 name: Test F1 Score --- ## Preprocessing Before training, the `DBLP-ACM` dataset was preprocessed using the `prepare.format` function from the `neer-match-utilities` library. The following preprocessing steps were applied: 1. **Numeric Harmonization**: - Missing numeric values were filled with 0. - The `year` column was converted to numeric format. 2. **String Standardization**: - Missing string values were replaced with placeholders. - All string fields were capitalized to ensure consistency in text formatting. These preprocessing steps ensured that the input data was harmonized and ready for training, improving the model's ability to compare and match records effectively. --- ## Similarity Map The model uses a `SimilarityMap` to compute similarity scores between attributes of records. The following similarity metrics were applied: ```python similarity_map = { "title": ["levenshtein", "jaro_winkler", "partial_ratio", "token_sort_ratio", "token_set_ratio", "partial_token_set_ratio"], "authors": ["levenshtein", "jaro_winkler", "partial_ratio", "token_sort_ratio", "token_set_ratio", "partial_token_set_ratio"], "venue": ["levenshtein", "jaro_winkler", "partial_ratio", "token_sort_ratio", "token_set_ratio", "partial_token_set_ratio", "notmissing"], "year" : ["euclidean", "gaussian", "notzero"], } ``` --- ## Fitting the Model The model was trained using the `fit` method and the custom focal_loss loss function. ### Training Configuration The training parameters deviated from the default values in the following ways: - **Epochs**: 60 - **Mismatch Share**: 1.0 Before training, the labeled data was split into training and test data, using the `split_test_train` method of `neer_match_utilities` with a `test_ratio` 0f .8