SingleWord

The SingleWord class represents individual terms in YAKE (Yet Another Keyword Extractor), providing the statistical features and measurements used in keyword extraction.

Info: This documentation provides interactive code views for each method. Click on a function name to view its implementation.

Module Overview

"""
Single word term representation module for YAKE keyword extraction.
 
This module contains the SingleWord class which represents individual terms
in a document for keyword extraction. It tracks statistical features like
term frequency, position, and relationships with other terms to calculate
a relevance score for each word.
"""
 
import math
import numpy as np

The SingleWord class stores and calculates statistical features for individual terms, including frequency, position, spread, and relationship metrics. These features are used to calculate a relevance score that indicates the word's importance in the document.

Constructor

Parameters:

unique (str): The unique normalized term this object represents
idx (int): Unique identifier for the term in the document
graph (networkx.DiGraph): Word co-occurrence graph from the document

Example:

import networkx as nx
from yake.data import SingleWord
 
# Create a graph
g = nx.DiGraph()
 
# Initialize a single word
term = SingleWord("algorithm", 1, g)

Dictionary-Style Access

The SingleWord class provides dictionary-style attribute access for flexibility:

Example:

# Dictionary-style access
term["wfreq"] = 2.5
score = term["h"]
position = term.get("wpos", 1.0)

Core Methods

Properties

The SingleWord class provides property accessors for its main attributes:

Basic Properties

unique_term: The normalized form of the word
stopword: Boolean indicating if the term is a stopword
h: The final score of the term (lower is better in YAKE)
tf: Term frequency in the document
occurs: Dictionary of sentence occurrences

# Examples
word = term.unique_term
is_stopword = term.stopword
score = term.h
frequency = term.tf
occurrences = term.occurs
 
# Setter examples
term.stopword = True
term.h = 0.25
term.tf = 5.0

Feature Properties

wfreq: Word frequency metric
wcase: Word case metric (uppercase/proper noun)
wrel: Word relevance metric (based on graph connections)
wpos: Word position metric
wspread: Word spread across document
pl: Probability left (graph-based)
pr: Probability right (graph-based)

# Examples
frequency_metric = term.wfreq
case_metric = term.wcase
relevance = term.wrel
position_metric = term.wpos
spread_metric = term.wspread
left_probability = term.pl
right_probability = term.pr
 
# Setter examples
term.wfreq = 0.75
term.wcase = 0.5
term.wrel = 1.2

Feature Calculation Logic

The SingleWord class calculates several features that contribute to keyword scoring:

Word Frequency (`wfreq`)

Measures how frequent the term is compared to the average document term frequency.

# Normalized term frequency compared to document statistics
wfreq = term_frequency / (average_term_frequency + standard_deviation)

Higher values indicate more frequent terms relative to the document average.

Word Case (`wcase`)

Represents the significance of capitalization in determining proper nouns and acronyms.

# Case significance: higher values for acronyms and proper nouns
wcase = max(uppercase_freq, proper_noun_freq) / (1.0 + log(term_frequency))

Higher values indicate terms more likely to be acronyms or proper nouns.

Word Relevance (`wrel`)

Evaluates the term's importance based on its co-occurrence relationships.

# Relevance based on graph connection probabilities and term frequency
wrel = (0.5 + (graph_metrics["pwl"] * (tf / max_tf))) + 
       (0.5 + (graph_metrics["pwr"] * (tf / max_tf)))

Higher values indicate terms with more meaningful contextual relationships.

Word Position (`wpos`)

Considers the typical position of the word in sentences, with the intuition that important terms appear earlier.

# Position score based on median sentence position
wpos = math.log(math.log(3.0 + median_position_in_sentences))

Lower values indicate terms that tend to appear earlier in sentences.

Word Spread (`wspread`)

Measures how widely the term is distributed across the document's sentences.

# Document coverage: proportion of sentences containing the term
wspread = number_of_sentences_with_term / total_number_of_sentences

Higher values indicate terms that appear throughout the document.

Final Score Calculation

The final score (h) combines all metrics in a formula designed to rank candidate keywords:

# Lower scores indicate better keyword candidates
h = (wpos * wrel) / (wcase + (wfreq / wrel) + (wspread / wrel))

The formula balances:

Term position (earlier is better)
Term relevance (more connections is better)
Term case (proper nouns and acronyms preferred)
Term frequency (higher is better)
Term spread (wider distribution is better)

Lower scores indicate better keyword candidates in YAKE's ranking system.

Usage Example

import networkx as nx
import numpy as np
from yake.data import SingleWord
 
# Create a graph for co-occurrence
g = nx.DiGraph()
g.add_node(1)
 
# Initialize a word
term = SingleWord("algorithm", 1, g)
 
# Add occurrences
term.add_occur("n", 0, 5, 5)  # In sentence 0, position 5
term.add_occur("n", 1, 2, 15) # In sentence 1, position 2
term.add_occur("n", 2, 8, 35) # In sentence 2, position 8
 
# Update the score with statistics
stats = {
    "max_tf": 10.0,
    "avg_tf": 3.0,
    "std_tf": 2.0,
    "number_of_sentences": 5
}
term.update_h(stats)
 
# Get the final score
print(f"Keyword score for 'algorithm': {term.h:.4f}")

Dependencies

The SingleWord class relies on:

math: For logarithmic calculations
numpy: For statistical operations (median)
networkx: Implicitly through the provided graph parameter

init(unique, idx, graph)

getitem(key)

setitem(key, value)

get(key, default=None)

add_occur(tag, sent_id, pos_sent, pos_text)

get_metric(name)

set_metric(name, value)

get_graph_metrics()

update_h(stats, features=None)

On this page