YAKE LogoYAKE!
Documentation/Data

ComposedWord

The ComposedWord class represents multi-word terms in YAKE (Yet Another Keyword Extractor), providing the foundation for analyzing and scoring potential keyword phrases.

Info: This documentation provides interactive code views for each method. Click on a function name to view its implementation.

Class Overview

class ComposedWord:
    """
    Representation of a multi-word term in the document.
    
    This class stores and aggregates information about multi-word keyword candidates,
    calculating combined scores from the properties of their constituent terms.
    It tracks statistics like term frequency, integrity, and provides methods to
    validate whether a phrase is likely to be a good keyword.
    
    Attributes:
        See property accessors below for available attributes.
    """

The ComposedWord class stores and aggregates information about multi-word keyword candidates, calculating combined scores from the properties of their constituent terms. It tracks statistics like term frequency, integrity, and provides methods to validate whether a phrase is likely to be a good keyword.

Constructor

Parameters:

  • terms (list): A list of term tuples in the format (tag, word, term_obj) where:
    • tag (str): The part-of-speech tag for the word
    • word (str): The actual word text
    • term_obj (SingleWord): The term object representation

Example:

from yake.data import ComposedWord
 
# Create a composed word from term tuples
terms = [('n', 'natural', term_obj1), ('n', 'language', term_obj2)]
composed_word = ComposedWord(terms)
 
# Create an invalid composed word
invalid_composed = ComposedWord(None)

Core Methods

Property Accessors

The ComposedWord class uses a dictionary-based property system with property accessors for backward compatibility:

Basic Properties

  • tags: Set of POS tag combinations for this candidate
  • kw: The original keyword text
  • unique_kw: Lowercase version of the keyword for uniqueness checks
  • size: Number of terms in this candidate
  • terms: List of term objects in this candidate
  • start_or_end_stopwords: Boolean indicating if the candidate starts or ends with stopwords
# Examples
pos_tags = composed_word.tags
keyword = composed_word.kw
unique_key = composed_word.unique_kw
term_count = composed_word.size
term_objects = composed_word.terms
has_stopword_boundary = composed_word.start_or_end_stopwords

Scoring Properties

  • tf: Term frequency of this candidate
  • integrity: Integrity score (default: 1.0)
  • h: YAKE score for this candidate (lower is better)
# Examples
term_frequency = composed_word.tf
integrity_score = composed_word.integrity
yake_score = composed_word.h
 
# The tf property is settable
composed_word.tf = 5.0
 
# The h property is settable
composed_word.h = 0.25

Key Algorithms

Candidate Validation

Candidates are considered valid if:

  1. They contain no undefined ("u") or discarded ("d") POS tags
  2. They do not start or end with stopwords

Feature Composition

When analyzing multi-word terms, the ComposedWord class composes features from its constituent terms:

def get_composed_feature(self, feature_name, discart_stopword=True):
    """
    Get composed feature values for the n-gram.
    """
    # Get feature values from each term, filtering stopwords if requested
    list_of_features = [
        getattr(term, feature_name)
        for term in self.terms
        if (discart_stopword and not term.stopword) or not discart_stopword
    ]
    
    # Calculate aggregate statistics
    sum_f = sum(list_of_features)
    prod_f = np.prod(list_of_features)
    
    # Return the three aggregated values: sum, product, and product/(sum+1)
    return (sum_f, prod_f, prod_f / (sum_f + 1))

For each feature, this method calculates:

  • Sum of feature values across terms
  • Product of feature values across terms
  • A ratio metric: product/(sum+1) measuring feature consistency

Score Calculation

The YAKE score for a multi-word term is calculated using:

self.h = prod_h / ((sum_h + 1) * tf_used)

Where:

  • prod_h: Product of the h-scores of all terms
  • sum_h: Sum of the h-scores of all terms
  • tf_used: Term frequency (or average term frequency for virtual terms)

Lower scores indicate better keyword candidates.

Stopword Handling

The ComposedWord class handles stopwords differently based on the STOPWORD_WEIGHT configuration:

  • "bi": Uses bi-directional co-occurrence probabilities to weight stopwords
  • "h": Uses stopword h-scores directly (treats stopwords like normal words)
  • "none": Ignores stopwords in scoring completely

Complete Usage Example

from yake.data import ComposedWord
from yake.data.utils import STOPWORD_WEIGHT
 
# Create a sample composed word
terms = [("n", "natural", term_obj1), ("n", "language", term_obj2)]
composed_word = ComposedWord(terms)
 
# Update the candidate's score
composed_word.update_h()
 
# Check if the candidate is valid
if composed_word.is_valid():
    print(f"Candidate: {composed_word.kw}")
    print(f"Score: {composed_word.h:.4f}")
    print(f"Size: {composed_word.size}")
    print(f"Term Frequency: {composed_word.tf}")

Dependencies

The ComposedWord class relies on:

  • numpy: For statistical calculations
  • jellyfish: For string similarity measurement
  • Internal utility module:
    • utils: For stopword weighting constants

Integration with YAKE

ComposedWord works closely with the DataCore class:

  1. DataCore generates candidate ComposedWord instances
  2. Features are built for individual terms via build_single_terms_features()
  3. Features for multi-word terms are built via build_mult_terms_features()
  4. Candidates are scored using the update_h() method
  5. Lower scores indicate better keyword candidates

On this page