maliedvp commited on
Commit
1ebca24
·
verified ·
1 Parent(s): e78c24d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +89 -3
README.md CHANGED
@@ -1,3 +1,89 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - dblpacm
5
+ language:
6
+ - en
7
+ metrics:
8
+ - loss
9
+ - accuracy
10
+ - recall
11
+ - precision
12
+ - f1
13
+ tags:
14
+ - entity-matching
15
+ - similarity-comparison
16
+ - preprocessing
17
+ - neer-match
18
+ model-index:
19
+ - name: DBLP-ACM Entity Matching Model
20
+ results:
21
+ - task:
22
+ type: entity-matching
23
+ name: Entity Matching
24
+ dataset:
25
+ type: dblpacm
26
+ name: DBLP-ACM
27
+ config: default
28
+ split: test
29
+ metrics:
30
+ - type: loss
31
+ value: 1.9029e-09
32
+ name: Test Loss
33
+ - type: accuracy
34
+ value: 0.9999
35
+ name: Test Accuracy
36
+ - type: recall
37
+ value: 0.9932
38
+ name: Test Recall
39
+ - type: precision
40
+ value: 0.9419
41
+ name: Test Precision
42
+ - type: f1
43
+ value: 0.9668946637
44
+ name: Test F1 Score
45
+ ---
46
+
47
+ ## Preprocessing
48
+
49
+ Before training, the `DBLP-ACM` dataset was preprocessed using the `prepare.format` function from the `neer-match-utilities` library. The following preprocessing steps were applied:
50
+
51
+ 1. **Numeric Harmonization**:
52
+ - Missing numeric values were filled with 0.
53
+ - The `year` column was converted to numeric format.
54
+
55
+ 2. **String Standardization**:
56
+ - Missing string values were replaced with placeholders.
57
+ - All string fields were capitalized to ensure consistency in text formatting.
58
+
59
+ These preprocessing steps ensured that the input data was harmonized and ready for training, improving the model's ability to compare and match records effectively.
60
+
61
+ ---
62
+
63
+ ## Similarity Map
64
+
65
+ The model uses a `SimilarityMap` to compute similarity scores between attributes of records. The following similarity metrics were applied:
66
+
67
+ ```python
68
+ similarity_map = {
69
+ "title": ["levenshtein", "jaro_winkler", "partial_ratio", "token_sort_ratio", "token_set_ratio", "partial_token_set_ratio"],
70
+ "authors": ["levenshtein", "jaro_winkler", "partial_ratio", "token_sort_ratio", "token_set_ratio", "partial_token_set_ratio"],
71
+ "venue": ["levenshtein", "jaro_winkler", "partial_ratio", "token_sort_ratio", "token_set_ratio", "partial_token_set_ratio", "notmissing"],
72
+ "year" : ["euclidean", "gaussian", "notzero"],
73
+ }
74
+ ```
75
+
76
+ ---
77
+
78
+ ## Fitting the Model
79
+
80
+ The model was trained using the `fit` method and the custom focal_loss loss function.
81
+
82
+ ### Training Configuration
83
+ The training parameters deviated from the default values in the following ways:
84
+ - **Epochs**: 60
85
+ - **Mismatch Share**: 1.0
86
+
87
+ Before training, the labeled data was split into training and test data, using the `split_test_train` method of `neer_match_utilities` with a `test_ratio` 0f .8
88
+
89
+