Preprocessing

Before training, the wer_leitet dataset was preprocessed using the prepare.format function from the neer-match-utilities library. The following preprocessing steps were applied:

  1. String Standardization:
    • Missing string values were replaced with placeholders.
    • All string fields were capitalized to ensure consistency in text formatting.
  2. Identification of Common Names
    • Common names were defined as those falling within the 95th percentile of the distribution for first and last names.

These preprocessing steps ensured that the input data was harmonized and ready for training, improving the model's ability to compare and match records effectively.


Similarity Map

The model uses a SimilarityMap to compute similarity scores between attributes of records. The following similarity metrics were applied:

similarity_map = {
    "main_info": ["levenshtein", "jaro_winkler", "partial_ratio", "token_sort_ratio", "token_set_ratio", "partial_token_set_ratio"],
    "Vorstand": ["levenshtein", "jaro_winkler", "notmissing"],
    "StVdAR": ["levenshtein", "jaro_winkler", "notmissing"],
    "address": ["levenshtein", "jaro_winkler", "partial_ratio", "token_sort_ratio", "token_set_ratio", "partial_token_set_ratio", "notmissing"],
    "birth_date" : ['discrete', "notmissing"],
    "raw_text": ["token_set_ratio", "partial_token_set_ratio", "notmissing"],
    "common_name" : ['discrete', "notmissing"],
    "common_surname" : ['discrete', "notmissing"],
}

Fitting the Model

The model was trained using the fit method and the binary cross-entropy (BCE) loss function.

Training Configuration

The training parameters deviated from the default values in the following ways:

  • Epochs: 150
  • Mismatch Share: 0.3

Before training, the labeled data was split into training and test data, using the split_test_train method of neer_match_utilities with a test_ratio 0f .3

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Unable to determine this model's library. Check the docs .

Evaluation results