DataCore

The DataCore class is the foundation of YAKE (Yet Another Keyword Extractor), providing the core data representation for document analysis and keyword extraction.

Info: This documentation provides interactive code views for each method. Click on a function name to view its implementation.

Class Overview

class DataCore:
    """
    Core data representation for document analysis and keyword extraction.
    
    This class processes text documents to identify potential keywords based on 
    statistical features and contextual relationships between terms. It maintains 
    the document's structure, processes individual terms, and generates candidate 
    keywords.
    
    Attributes:
        See property accessors below for available attributes.
    """

The DataCore class processes text documents to identify potential keywords based on statistical features and contextual relationships.

Constructor

Parameters:

text (str): The input text to analyze for keyword extraction
stopword_set (set): A set of stopwords to filter out non-content words
config (dict, optional): Configuration options including:
- windows_size (int): Size of word window for co-occurrence (default: 2)
- n (int): Maximum length of keyword phrases (default: 3)
- tags_to_discard (set): POS tags to ignore (default: d)
- exclude (set): Characters to exclude (default: string.punctuation)

Example:

from yake.data import DataCore
import string
from yake.stopword_remover import StopwordRemover
 
# Get stopwords
stopword_remover = StopwordRemover("en")
stopword_set = stopword_remover.get_stopword_set()
 
# Initialize with default configuration
data = DataCore("Sample text for analysis", stopword_set)
 
# Initialize with custom configuration
config = {
    "windows_size": 3,
    "n": 4,
    "tags_to_discard": {"u", "d", "p"},
    "exclude": set(string.punctuation)
}
data = DataCore("Sample text for analysis", stopword_set, config)

Core Methods

Public API Methods

Property Accessors

The DataCore class includes various property accessors for backward compatibility:

Configuration Properties

exclude: Characters to exclude from processing
tags_to_discard: Part-of-speech tags to ignore during analysis
stopword_set: Set of stopwords to filter out
g: DirectedGraph representing term co-occurrences

# Examples
excluded_chars = data.exclude
ignored_tags = data.tags_to_discard
stopwords = data.stopword_set
graph = data.g

Text Statistics Properties

number_of_sentences: Count of sentences in the processed text
number_of_words: Total number of words processed

# Examples
sentence_count = data.number_of_sentences
word_count = data.number_of_words

Collection Properties

terms: Dictionary of SingleWord objects representing individual terms
candidates: Dictionary of ComposedWord objects representing keyword candidates
sentences_obj: Processed sentence objects
sentences_str: Raw sentence strings from the original text
freq_ns: Frequency of n-grams by length

# Examples
all_terms = data.terms
all_candidates = data.candidates
processed_sentences = data.sentences_obj
raw_sentences = data.sentences_str
ngram_frequencies = data.freq_ns

Complete Usage Example

from yake.data import DataCore
from yake.stopword_remover import StopwordRemover
 
# Initialize stopwords
stopword_remover = StopwordRemover("en")
stopword_set = stopword_remover.get_stopword_set()
 
# Create DataCore instance
text = "Natural language processing is a field of artificial intelligence that focuses on the interaction between computers and humans using natural language."
data = DataCore(text, stopword_set)
 
# Build features for keyword extraction
data.build_single_terms_features()
data.build_mult_terms_features()
 
# Extract top candidates
candidates = [(cand.unique_kw, cand.h) for cand in data.candidates.values() if cand.is_valid()]
candidates.sort(key=lambda x: x[1])  # Sort by score (lower is better in YAKE)
 
# Print top 5 keywords
for keyword, score in candidates[:5]:
    print(f"{keyword}: {score:.4f}")

Dependencies

The DataCore class relies on:

string: For punctuation constants
networkx: For graph representation (co-occurrences)
numpy: For statistical calculations
segtok: For tokenization
Internal utility modules:
- utils: For pre-filtering and tokenization
- single_word: For representing individual terms
- composed_word: For representing multi-word candidates

init(text, stopword_set, config=None)

_build(text, windows_size, n)

_process_sentence(sentence, sentence_id, pos_text, context)

_process_word(word, pos_text, context, word_context)

_update_cooccurrence(block_of_word_obj, term_obj, windows_size)

_generate_candidates(term, term_obj, block_of_word_obj, n)

get_tag(word, i)

build_candidate(candidate_string)

build_single_terms_features(features=None)

build_mult_terms_features(features=None)

get_term(str_word, save_non_seen=True)

add_cooccur(left_term, right_term)

add_or_update_composedword(cand)

On this page