YAKE LogoYAKE!
Documentation/Core

KeywordExtractor Class

The KeywordExtractor class is the main entry point for YAKE (Yet Another Keyword Extractor), providing a simple API to extract meaningful keywords from textual content.

Info: This documentation provides interactive code views for each method. Click on a function name to view its implementation.

Module Overview

"""
Keyword extraction module for YAKE.
 
This module provides the KeywordExtractor class which serves as the main entry point 
for the YAKE keyword extraction algorithm. It handles configuration, stopword loading,
deduplication of similar keywords, and the entire extraction pipeline from raw text 
to ranked keywords.
"""
 
import os
import jellyfish
from yake.data import DataCore
from .Levenshtein import Levenshtein

The KeywordExtractor class handles the configuration, preprocessing, and extraction of keywords from text documents using statistical features without relying on dictionaries or external corpora.

Constructor

Parameters:

  • lan (str, optional): Language for stopwords (default: "en")
  • n (int, optional): Maximum n-gram size (default: 3)
  • dedup_lim (float, optional): Similarity threshold for deduplication (default: 0.9)
  • dedup_func (str, optional): Deduplication function to use (default: "seqm")
  • window_size (int, optional): Size of word window for co-occurrence (default: 1)
  • top (int, optional): Maximum number of keywords to return (default: 20)
  • features (list, optional): List of features to use for scoring (default: None = all features)
  • stopwords (set, optional): Custom stopwords set (default: None, loads from language file)

Core Methods

Parameters:

  • text (str): The text to extract keywords from

Returns:

  • list: A list of tuples containing (keyword, score) pairs, sorted by relevance (lower scores are better)

Helper Methods

Similarity Functions

Usage Examples

Basic Usage

from yake import KeywordExtractor
 
text = """
Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence
concerned with the interactions between computers and human language, in particular how to program computers
to process and analyze large amounts of natural language data.
"""
 
# Simple example with default parameters
kw_extractor = KeywordExtractor()
keywords = kw_extractor.extract_keywords(text)
 
# Print the keywords and their scores
for kw, score in keywords:
    print(f"{kw}: {score:.4f}")

Customized Usage

from yake import KeywordExtractor
 
# Create a custom stopwords set
custom_stopwords = {"the", "a", "an", "in", "on", "at", "of", "for", "with"}
 
# Initialize with custom parameters
kw_extractor = KeywordExtractor(
    lan="en",               # Language
    n=2,                   # Maximum n-gram size
    dedup_lim=0.8,         # Deduplication threshold
    dedup_func="jaro",     # Deduplication function
    window_size=2,         # Window size
    top=10,                # Number of keywords to extract
    stopwords=custom_stopwords
)
 
text = "Machine learning is the study of computer algorithms that improve automatically through experience."
keywords = kw_extractor.extract_keywords(text)
 
# Print the top 10 keywords
for kw, score in keywords:
    print(f"{kw}: {score:.4f}")

Deduplication Functions

The KeywordExtractor supports multiple string similarity algorithms for deduplication:

  1. Jaro-Winkler ("jaro", "jaro_winkler"): Based on character matches with higher weights for prefix matches

  2. Levenshtein Ratio ("levs"): Based on Levenshtein edit distance normalized by string length

  3. SequenceMatcher ("seqm", "sequencematcher"): Based on Python's difflib sequence matching algorithm

Dependencies

The module relies on:

  • os: For file operations and path handling
  • jellyfish: For Jaro-Winkler string similarity
  • yake.data.DataCore: For core data representation
  • .Levenshtein: For Levenshtein distance and ratio calculations

On this page