Nils Durner's Blog Ahas, Breadcrumbs, Coding Epiphanies

German NER experiments: Presidio, spaCy, GLiNER

As I experimented with the Microsoft Presidio live demo for PII, I found that neither model does very well with German language when the objective is to also identify organization names. Cloning the HuggingFace space that hosts this demo allows one to enable use of other models (through setting the environment variable ALLOW_OTHER_MODELS = 1), but this approach largely failed because 1) not all model architectures are supported with HuggingFace models and 2) spaCy expects its own module loading mechanism to be used e.g. for it to load German language support. As a more scalable approach, I decided to set up my own NER ranking harness including a custom dataset.

This harness is, in its scrappy state, available at Github. This should not be used as-is, but rather serve as a reference. In addition to the ranker, it also includes a plugin for Presidio to allow the GLiNER model to be used. The dataset tests world knowledge and language disambiguation through inclusion of well-known and lesser known organization names such as BWM and Ostfriesische Brandkasse.

Findings:

  1. GLiNER can be integrated with Presidio and spaCy, through the gliner-spacy package
  2. spaCy does not, by default, work on organization names. Some sample configuration file elaborated that this is because of many false positives. My repo includes a config file that includes org names.
    1. (my ranker does indeed only measure if entities got successfully redacted - over-redactions are not penalized)
  3. there is a baseline of 10-30% score with my dataset attained with non-German and/or non-org recognizing configurations. Proper configurations achieve > 80%.

Overal results:

meta-llama/Llama-3.1-8B-Instruct: 86.44%
meta-llama/Llama-3.2-3B-Instruct: 81.36%
google/gemma-2-9b-it: 93.22%
gliner: 94.92%
spacy de_core_news_lg: 77.97%
spacy-gliner: 91.53%

In these tests, the Llamas stood out for over-redacting or otherwise falsifying the input text, which is not reflected negatively in the score. Gemma-2 did not appear to make such mistakes, but I didn’t check closely.