Utils

The utils module provides essential text processing functions for YAKE (Yet Another Keyword Extractor), handling tokenization, normalization, and classification of textual elements.

Info: This documentation provides interactive code views for each function. Click on a function name to view its implementation.

Module Overview

"""
Text processing utility module for YAKE keyword extraction.
 
This module provides essential text preprocessing functions for the YAKE algorithm,
including text normalization, sentence segmentation, tokenization, and word
categorization. These utilities form the foundation for clean and consistent
text analysis throughout the keyword extraction pipeline.
"""
 
import re
from segtok.segmenter import split_multi
from segtok.tokenizer import web_tokenizer, split_contractions
 
# Stopword weighting method for multi-word term scoring:
# - "bi": Use bi-directional weighting (default, considers term connections)
# - "h": Use direct term scores (treat stopwords like normal words)
# - "none": Ignore stopwords completely
STOPWORD_WEIGHT = "bi"

The utils module contains functions for text preprocessing, tokenization, and classification that support the keyword extraction pipeline.

Functions

Function Descriptions

pre_filter

Pre-filters text before processing by normalizing its format. It maintains paragraph structure while standardizing spacing and line breaks to improve the accuracy of subsequent text analysis steps.

Parameters:

text (str): Raw input text to be pre-filtered

Returns:

str: Normalized text with consistent spacing and paragraph structure

How it works:

Splits the text into parts based on newline characters
Detects if a part starts with a capital letter (potentially a new paragraph)
Adds appropriate spacing between parts:
- Double newlines for parts starting with capital letters (likely new paragraphs)
- Single spaces for other parts (likely continuing text)
Replaces all tab characters with spaces for consistent formatting

Example:

from yake.data.utils import pre_filter
 
raw_text = "This is line one.\nThis is line two.\tAnd this has a tab."
normalized_text = pre_filter(raw_text)
print(normalized_text)
# Output: " This is line one. This is line two. And this has a tab."
 
raw_text = "This is line one.\nAnother paragraph.\nThis continues."
normalized_text = pre_filter(raw_text)
print(normalized_text)
# Output: " This is line one.\n\nAnother paragraph. This continues."

tokenize_sentences

Performs two-level tokenization: first dividing the text into sentences, then tokenizing each sentence into words while handling contractions and filtering out invalid tokens.

Parameters:

text (str): The input text to tokenize

Returns:

list: A nested list where each inner list contains the tokens of a sentence

Example:

from yake.utils import tokenize_sentences
 
text = "Hello world! This is a sample text. It has multiple sentences."
sentences = tokenize_sentences(text)
print(sentences)
# Output: [['Hello', 'world', '!'], ['This', 'is', 'a', 'sample', 'text', '.'], ['It', 'has', 'multiple', 'sentences', '.']]

get_tag

Categorizes words into different types based on their orthographic features (capitalization, digits, special characters), which affect keyword scoring and filtering.

Parameters:

word (str): The word to classify
i (int): Position of the word within its sentence (0 = first word)
exclude (set): Set of characters to consider as punctuation/special chars

Returns:

str: A single character tag representing the word type:
- "d": Digit or numeric value
- "u": Unusual word (mixed alphanumeric or special characters)
- "a": Acronym (all uppercase)
- "n": Proper noun (capitalized, not at start of sentence)
- "p": Plain word (default)

Example:

from yake.utils import get_tag
import string
 
exclude = set(string.punctuation)
 
# Examples of different word classifications
print(get_tag("Hello", 0, exclude))  # Output: "p" (plain word)
print(get_tag("Hello", 3, exclude))  # Output: "n" (proper noun, capitalized not at sentence start)
print(get_tag("123", 0, exclude))    # Output: "d" (digit)
print(get_tag("NASA", 0, exclude))   # Output: "a" (acronym)
print(get_tag("test@example", 0, exclude))  # Output: "u" (unusual)

Module Constants

STOPWORD_WEIGHT (str): Stopword weighting method for multi-word term scoring:
- "bi": Use bi-directional weighting (default, considers term connections)
- "h": Use direct term scores (treat stopwords like normal words)
- "none": Ignore stopwords completely

Usage in YAKE Pipeline

The utility functions serve as foundation components for the YAKE keyword extraction process:

pre_filter normalizes the input text
tokenize_sentences breaks the text into processable tokens
get_tag classifies each token for further analysis

These functions are primarily used by the DataCore class to build the data representation needed for keyword extraction.

Dependencies

The utils module relies on:

re: For regular expression operations
segtok.segmenter: For sentence segmentation
segtok.tokenizer: For tokenization and contraction handling

pre_filter(text)

tokenize_sentences(text)

get_tag(word, i, exclude)

On this page