Documentation/Data
DataCore
The DataCore
class is the foundation of YAKE (Yet Another Keyword Extractor), providing the core data representation for document analysis and keyword extraction.
Info: This documentation provides interactive code views for each method. Click on a function name to view its implementation.
Class Overview
The DataCore
class processes text documents to identify potential keywords based on statistical features and contextual relationships.
Constructor
Parameters:
text
(str): The input text to analyze for keyword extractionstopword_set
(set): A set of stopwords to filter out non-content wordsconfig
(dict, optional): Configuration options including:windows_size
(int): Size of word window for co-occurrence (default: 2)n
(int): Maximum length of keyword phrases (default: 3)tags_to_discard
(set): POS tags to ignore (default: d)exclude
(set): Characters to exclude (default: string.punctuation)
Example:
Core Methods
Public API Methods
Property Accessors
The DataCore
class includes various property accessors for backward compatibility:
Configuration Properties
exclude
: Characters to exclude from processingtags_to_discard
: Part-of-speech tags to ignore during analysisstopword_set
: Set of stopwords to filter outg
: DirectedGraph representing term co-occurrences
Text Statistics Properties
number_of_sentences
: Count of sentences in the processed textnumber_of_words
: Total number of words processed
Collection Properties
terms
: Dictionary ofSingleWord
objects representing individual termscandidates
: Dictionary ofComposedWord
objects representing keyword candidatessentences_obj
: Processed sentence objectssentences_str
: Raw sentence strings from the original textfreq_ns
: Frequency of n-grams by length
Complete Usage Example
Dependencies
The DataCore
class relies on:
string
: For punctuation constantsnetworkx
: For graph representation (co-occurrences)numpy
: For statistical calculationssegtok
: For tokenization- Internal utility modules:
utils
: For pre-filtering and tokenizationsingle_word
: For representing individual termscomposed_word
: For representing multi-word candidates