ComposedWord
The ComposedWord
class represents multi-word terms in YAKE (Yet Another Keyword Extractor), providing the foundation for analyzing and scoring potential keyword phrases.
Info: This documentation provides interactive code views for each method. Click on a function name to view its implementation.
Class Overview
The ComposedWord
class stores and aggregates information about multi-word keyword candidates, calculating combined scores from the properties of their constituent terms. It tracks statistics like term frequency, integrity, and provides methods to validate whether a phrase is likely to be a good keyword.
Constructor
Parameters:
terms
(list): A list of term tuples in the format(tag, word, term_obj)
where:tag
(str): The part-of-speech tag for the wordword
(str): The actual word textterm_obj
(SingleWord): The term object representation
Example:
Core Methods
Property Accessors
The ComposedWord
class uses a dictionary-based property system with property accessors for backward compatibility:
Basic Properties
tags
: Set of POS tag combinations for this candidatekw
: The original keyword textunique_kw
: Lowercase version of the keyword for uniqueness checkssize
: Number of terms in this candidateterms
: List of term objects in this candidatestart_or_end_stopwords
: Boolean indicating if the candidate starts or ends with stopwords
Scoring Properties
tf
: Term frequency of this candidateintegrity
: Integrity score (default: 1.0)h
: YAKE score for this candidate (lower is better)
Key Algorithms
Candidate Validation
Candidates are considered valid if:
- They contain no undefined ("u") or discarded ("d") POS tags
- They do not start or end with stopwords
Feature Composition
When analyzing multi-word terms, the ComposedWord
class composes features from its constituent terms:
For each feature, this method calculates:
- Sum of feature values across terms
- Product of feature values across terms
- A ratio metric: product/(sum+1) measuring feature consistency
Score Calculation
The YAKE score for a multi-word term is calculated using:
Where:
prod_h
: Product of the h-scores of all termssum_h
: Sum of the h-scores of all termstf_used
: Term frequency (or average term frequency for virtual terms)
Lower scores indicate better keyword candidates.
Stopword Handling
The ComposedWord
class handles stopwords differently based on the STOPWORD_WEIGHT
configuration:
"bi"
: Uses bi-directional co-occurrence probabilities to weight stopwords"h"
: Uses stopword h-scores directly (treats stopwords like normal words)"none"
: Ignores stopwords in scoring completely
Complete Usage Example
Dependencies
The ComposedWord
class relies on:
numpy
: For statistical calculationsjellyfish
: For string similarity measurement- Internal utility module:
utils
: For stopword weighting constants
Integration with YAKE
ComposedWord
works closely with the DataCore
class:
DataCore
generates candidateComposedWord
instances- Features are built for individual terms via
build_single_terms_features()
- Features for multi-word terms are built via
build_mult_terms_features()
- Candidates are scored using the
update_h()
method - Lower scores indicate better keyword candidates