Utils
The utils
module provides essential text processing functions for YAKE (Yet Another Keyword Extractor), handling tokenization, normalization, and classification of textual elements.
Info: This documentation provides interactive code views for each function. Click on a function name to view its implementation.
Module Overview
The utils
module contains functions for text preprocessing, tokenization, and classification that support the keyword extraction pipeline.
Functions
Function Descriptions
pre_filter
Pre-filters text before processing by normalizing its format. It maintains paragraph structure while standardizing spacing and line breaks to improve the accuracy of subsequent text analysis steps.
Parameters:
text
(str): Raw input text to be pre-filtered
Returns:
- str: Normalized text with consistent spacing and paragraph structure
How it works:
- Splits the text into parts based on newline characters
- Detects if a part starts with a capital letter (potentially a new paragraph)
- Adds appropriate spacing between parts:
- Double newlines for parts starting with capital letters (likely new paragraphs)
- Single spaces for other parts (likely continuing text)
- Replaces all tab characters with spaces for consistent formatting
Example:
tokenize_sentences
Performs two-level tokenization: first dividing the text into sentences, then tokenizing each sentence into words while handling contractions and filtering out invalid tokens.
Parameters:
text
(str): The input text to tokenize
Returns:
- list: A nested list where each inner list contains the tokens of a sentence
Example:
get_tag
Categorizes words into different types based on their orthographic features (capitalization, digits, special characters), which affect keyword scoring and filtering.
Parameters:
word
(str): The word to classifyi
(int): Position of the word within its sentence (0 = first word)exclude
(set): Set of characters to consider as punctuation/special chars
Returns:
- str: A single character tag representing the word type:
"d"
: Digit or numeric value"u"
: Unusual word (mixed alphanumeric or special characters)"a"
: Acronym (all uppercase)"n"
: Proper noun (capitalized, not at start of sentence)"p"
: Plain word (default)
Example:
Module Constants
STOPWORD_WEIGHT
(str): Stopword weighting method for multi-word term scoring:"bi"
: Use bi-directional weighting (default, considers term connections)"h"
: Use direct term scores (treat stopwords like normal words)"none"
: Ignore stopwords completely
Usage in YAKE Pipeline
The utility functions serve as foundation components for the YAKE keyword extraction process:
pre_filter
normalizes the input texttokenize_sentences
breaks the text into processable tokensget_tag
classifies each token for further analysis
These functions are primarily used by the DataCore
class to build the data representation needed for keyword extraction.
Dependencies
The utils module relies on:
re
: For regular expression operationssegtok.segmenter
: For sentence segmentationsegtok.tokenizer
: For tokenization and contraction handling