SingleWord
The SingleWord
class represents individual terms in YAKE (Yet Another Keyword Extractor), providing the statistical features and measurements used in keyword extraction.
Info: This documentation provides interactive code views for each method. Click on a function name to view its implementation.
Module Overview
The SingleWord
class stores and calculates statistical features for individual terms, including frequency, position, spread, and relationship metrics. These features are used to calculate a relevance score that indicates the word's importance in the document.
Constructor
Parameters:
unique
(str): The unique normalized term this object representsidx
(int): Unique identifier for the term in the documentgraph
(networkx.DiGraph): Word co-occurrence graph from the document
Example:
Dictionary-Style Access
The SingleWord
class provides dictionary-style attribute access for flexibility:
Example:
Core Methods
Properties
The SingleWord
class provides property accessors for its main attributes:
Basic Properties
unique_term
: The normalized form of the wordstopword
: Boolean indicating if the term is a stopwordh
: The final score of the term (lower is better in YAKE)tf
: Term frequency in the documentoccurs
: Dictionary of sentence occurrences
Feature Properties
wfreq
: Word frequency metricwcase
: Word case metric (uppercase/proper noun)wrel
: Word relevance metric (based on graph connections)wpos
: Word position metricwspread
: Word spread across documentpl
: Probability left (graph-based)pr
: Probability right (graph-based)
Feature Calculation Logic
The SingleWord
class calculates several features that contribute to keyword scoring:
Word Frequency (wfreq
)
Measures how frequent the term is compared to the average document term frequency.
Higher values indicate more frequent terms relative to the document average.
Word Case (wcase
)
Represents the significance of capitalization in determining proper nouns and acronyms.
Higher values indicate terms more likely to be acronyms or proper nouns.
Word Relevance (wrel
)
Evaluates the term's importance based on its co-occurrence relationships.
Higher values indicate terms with more meaningful contextual relationships.
Word Position (wpos
)
Considers the typical position of the word in sentences, with the intuition that important terms appear earlier.
Lower values indicate terms that tend to appear earlier in sentences.
Word Spread (wspread
)
Measures how widely the term is distributed across the document's sentences.
Higher values indicate terms that appear throughout the document.
Final Score Calculation
The final score (h
) combines all metrics in a formula designed to rank candidate keywords:
The formula balances:
- Term position (earlier is better)
- Term relevance (more connections is better)
- Term case (proper nouns and acronyms preferred)
- Term frequency (higher is better)
- Term spread (wider distribution is better)
Lower scores indicate better keyword candidates in YAKE's ranking system.
Usage Example
Dependencies
The SingleWord
class relies on:
math
: For logarithmic calculationsnumpy
: For statistical operations (median)networkx
: Implicitly through the provided graph parameter