YAKE LogoYAKE!
Documentation/Data

DataCore

The DataCore class is the foundation of YAKE (Yet Another Keyword Extractor), providing the core data representation for document analysis and keyword extraction.

Info: This documentation provides interactive code views for each method. Click on a function name to view its implementation.

Class Overview

class DataCore:
    """
    Core data representation for document analysis and keyword extraction.
    
    This class processes text documents to identify potential keywords based on 
    statistical features and contextual relationships between terms. It maintains 
    the document's structure, processes individual terms, and generates candidate 
    keywords.
    
    Attributes:
        See property accessors below for available attributes.
    """

The DataCore class processes text documents to identify potential keywords based on statistical features and contextual relationships.

Constructor

Parameters:

  • text (str): The input text to analyze for keyword extraction
  • stopword_set (set): A set of stopwords to filter out non-content words
  • config (dict, optional): Configuration options including:
    • windows_size (int): Size of word window for co-occurrence (default: 2)
    • n (int): Maximum length of keyword phrases (default: 3)
    • tags_to_discard (set): POS tags to ignore (default: d)
    • exclude (set): Characters to exclude (default: string.punctuation)

Example:

from yake.data import DataCore
import string
from yake.stopword_remover import StopwordRemover
 
# Get stopwords
stopword_remover = StopwordRemover("en")
stopword_set = stopword_remover.get_stopword_set()
 
# Initialize with default configuration
data = DataCore("Sample text for analysis", stopword_set)
 
# Initialize with custom configuration
config = {
    "windows_size": 3,
    "n": 4,
    "tags_to_discard": {"u", "d", "p"},
    "exclude": set(string.punctuation)
}
data = DataCore("Sample text for analysis", stopword_set, config)

Core Methods

Public API Methods

Property Accessors

The DataCore class includes various property accessors for backward compatibility:

Configuration Properties

  • exclude: Characters to exclude from processing
  • tags_to_discard: Part-of-speech tags to ignore during analysis
  • stopword_set: Set of stopwords to filter out
  • g: DirectedGraph representing term co-occurrences
# Examples
excluded_chars = data.exclude
ignored_tags = data.tags_to_discard
stopwords = data.stopword_set
graph = data.g

Text Statistics Properties

  • number_of_sentences: Count of sentences in the processed text
  • number_of_words: Total number of words processed
# Examples
sentence_count = data.number_of_sentences
word_count = data.number_of_words

Collection Properties

  • terms: Dictionary of SingleWord objects representing individual terms
  • candidates: Dictionary of ComposedWord objects representing keyword candidates
  • sentences_obj: Processed sentence objects
  • sentences_str: Raw sentence strings from the original text
  • freq_ns: Frequency of n-grams by length
# Examples
all_terms = data.terms
all_candidates = data.candidates
processed_sentences = data.sentences_obj
raw_sentences = data.sentences_str
ngram_frequencies = data.freq_ns

Complete Usage Example

from yake.data import DataCore
from yake.stopword_remover import StopwordRemover
 
# Initialize stopwords
stopword_remover = StopwordRemover("en")
stopword_set = stopword_remover.get_stopword_set()
 
# Create DataCore instance
text = "Natural language processing is a field of artificial intelligence that focuses on the interaction between computers and humans using natural language."
data = DataCore(text, stopword_set)
 
# Build features for keyword extraction
data.build_single_terms_features()
data.build_mult_terms_features()
 
# Extract top candidates
candidates = [(cand.unique_kw, cand.h) for cand in data.candidates.values() if cand.is_valid()]
candidates.sort(key=lambda x: x[1])  # Sort by score (lower is better in YAKE)
 
# Print top 5 keywords
for keyword, score in candidates[:5]:
    print(f"{keyword}: {score:.4f}")

Dependencies

The DataCore class relies on:

  • string: For punctuation constants
  • networkx: For graph representation (co-occurrences)
  • numpy: For statistical calculations
  • segtok: For tokenization
  • Internal utility modules:
    • utils: For pre-filtering and tokenization
    • single_word: For representing individual terms
    • composed_word: For representing multi-word candidates

On this page