Preprocessing Module Overview

The preprocessing phase serves as a critical cornerstone for any Natural Language Understanding (NLU) pipeline. It aims to normalize the textual data, making it more congruent with the training set and thereby enhancing the efficiency of the model. Our preprocessing module provides a comprehensive suite of tasks for text sanitization, including:

  • Case Sensitivity Handling: Transforming all text to a uniform case (usually lower-case) for consistent interpretation.
  • Accents Normalization: Replacing accented characters with their base form (e.g., "ç" becomes "c").
  • Lemmatization: Reducing words to their base or root form.
  • Stop Words Filtering: Eliminating common words that don't add significant meaning to the text.
  • Punctuation Removal: Stripping the text of any punctuation marks.
  • Data Augmentation: Using Facebook's library to generate additional training samples that contain intentional errors.

Preprocessing is particularly indispensable when employing basic vectorization techniques like TF-IDF, which lack pretrained language knowledge.

However, it's worth noting that for advanced vectorizers that utilize embeddings, such as FastText and BERT-based models like FlauBERT, preprocessing might not only be unnecessary but could even impair the model's performance.