Skip to content

entry

de_wiktio.entry

This module provides methods to parse wikitext and extract data from Wiktionary pages.

Classes:

  • Entry

    Entry class for parsing wikitext from main content pages (ns = 0) of the German Wiktionary.

  • EntryFlexion

    Entry class for parsing wikitext from Flexion pages (ns = 108).

  • WordForm

    A class representing a word form.

  • Tools

    Collection of utility functions.

Classes

Entry(title, wikitext, status='OK', extracted_from=None)

Bases: _EntryBase

Entry class for parsing wikitext from main content pages (ns = 0) of the German Wiktionary.

This class deals with the German section of the page, i.e. the German-to-German dictionary. Therefore, it does not parse multilingual entries, such as English-to-German, French-to-German, etc...

Parameters:

  • title (str) –

    The title of the Wiktionary page.

  • wikitext (str) –

    The raw wikitext of the page.

  • status (str, default: 'OK' ) –

    The status of the entry.

  • extracted_from (str, default: None ) –

    The source of the extraction.

Methods:

  • from_export

    Create a class instance by fetching online the wiki page.

  • from_dump

    Create a class instance by fetching the wikitext from local dictionary.

  • get_wikidict

    Load the dictionary from the pickle file.

  • print_sections_tree

    Print the headings tree.

Attributes:

Source code in de_wiktio\entry.py
202
203
204
205
206
207
208
209
210
211
212
213
214
def __init__(self, title: str, wikitext:str, status:str='OK', extracted_from:str =None) -> None:
    """
    The Entry class constructor.

    Args:
        title: The title of the Wiktionary page.
        wikitext: The raw *wikitext* of the page.
        status: The status of the entry.
        extracted_from: The source of the extraction.
    """
    super().__init__(title, wikitext, status, extracted_from)
    self.german: Wikicode = self._get_section_de()
    """The German section of the page."""
Attributes
WIKIDICT: dict class-attribute instance-attribute

Class attribute: Dictionary of title-wikitext pairs.

To be accessed when using the from_dump class method.
This is a lazy attribute. It loads the dictionary when needed. After that, it is kept in memory as a class attribute so that the dictionary is not loaded multiple times when using the from_dump class method to create a new Entry object.

title: str = title instance-attribute

The title of the Wiktionary page.

text: str = wikitext instance-attribute

The wikitext of the page.

status: str = status instance-attribute

The status of the parsing and extraction.

Some values are: 'OK', 'No content for {title} in exported page' or 'No proper wiki namespace found for {title}'

extracted_from: str = extracted_from instance-attribute

The source of the extraction.

The possible values are: 'from dump', 'from export' or None. A None value indicates that the instance was created directly from the constructor passing the wikitext and title of the page.

parsed: Wikicode property

The parsed wikitext of the page.

The wikitext is parsed using the mwparserfromhell library.

NS: str = '0' class-attribute instance-attribute

Class attribute: The namespace of the entry, set to '0'.

german: Wikicode = self._get_section_de() instance-attribute

The German section of the page.

wordforms: List[WordForm] property

List of German word forms.

Functions
from_export(title) classmethod

Create a class instance by fetching online the wiki page.

Parameters:

  • title (str) –

    The title of the Wiktionary page.

Returns:

  • Self

    An EntryBase or a subclass instance

Source code in de_wiktio\entry.py
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
@classmethod
def from_export(cls, title: str) -> Self:      
    """Create a class instance by fetching online the wiki page.

    Args:
        title: The title of the Wiktionary page.

    Returns:
        An EntryBase or a subclass instance
    """
    fetched = PageExport(title)
    wikitext = fetched.wikitext

    if fetched.wikitext == '':
        status = f'No content for {title} in exported page'
    elif fetched.ns != cls.NS:
        status = f'No proper wiki namespace found for {title}'
        wikitext = ''
    else:
        status = 'OK'
    return cls(title, wikitext, status,'from export')
from_dump(title, dict_path=None) classmethod

Create a class instance by fetching the wikitext from local dictionary.

During the session, only one dictionary is loaded. The dictionary is loaded from the pickle file 'wikidict_{*cls.NS*}.pkl', which is located in dict_path.

Parameters:

  • title (str) –

    The title of the Wiktionary page to fetch.

  • dict_path (Optional[str], default: None ) –

    Path to the folder containing the dictionary. If None, the folder indicated in Settings will be used.

Returns:

  • Self

    An EntryBase or a subclass instance

Source code in de_wiktio\entry.py
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
@classmethod
def from_dump(cls, title: str, dict_path: Optional[str] = None) -> Self: 
    """
    Create a class instance by fetching the *wikitext* from local dictionary.

    During the session, only one dictionary is loaded. The dictionary is loaded from the pickle file `'wikidict_{*cls.NS*}.pkl'`, which is located in `dict_path`.  

    Args:
        title: The title of the Wiktionary page to fetch.
        dict_path: Path to the folder containing the dictionary. If `None`, the folder indicated in `Settings` will be used.

    Returns:
        An EntryBase or a subclass instance
    """
    wikidict = cls.get_wikidict(dict_path)
    wikitext = wikidict.get(title, '')
    status = 'OK' if wikitext != '' else f'No content for {title} in dump file'
    return cls(title, wikitext, status, 'from dump')
get_wikidict(dict_path=None) classmethod

Load the dictionary from the pickle file.

If the dictionary is already loaded, return the dictionary from memory. Per session, only one dictionary is loaded. The dictionary is loaded from the pickle file 'wikidict_{cls.NS}.pkl' in dict_path or in the folder indicated in Settings if dict_path is not provided.

Parameters:

  • dict_path (Optional[str], default: None ) –

    Path to the folder containing the dictionary. If None, the folder indicated in Settings will be used.

Returns:

  • Dict[str, str]

    A dictionary with page titles as keys and their corresponding wikitext as values.

Source code in de_wiktio\entry.py
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
@classmethod
def get_wikidict(cls, dict_path: Optional[str] = None) -> Dict[str, str]:
    """
    Load the dictionary from the pickle file. 

    If the dictionary is already loaded, return the dictionary from memory. Per session, only one dictionary is loaded. The dictionary is loaded from the pickle file 'wikidict_{*cls.NS*}.pkl' in `dict_path` or in the folder indicated in `Settings` if `dict_path` is not provided.

    Args:
        dict_path: Path to the folder containing the dictionary. If `None`, the folder indicated in `Settings` will be used.

    Returns:
        A dictionary with page titles as keys and their corresponding *wikitext* as values.
    """
    # if the dictionary is already loaded, return it
    if cls.WIKIDICT is not None:
        return cls.WIKIDICT

    # otherwise, load the dictionary
    if dict_path is None:
        cls.WIKIDICT = WikiDump.load_wikidict_by_ns(ns=cls.NS)
    else:
        _file = Path(dict_path) / f'wikidict_{cls.NS}.pkl'
        cls.WIKIDICT = WikiDump.load_wikidict_by_ns(file=_file, ns=cls.NS)

    return cls.WIKIDICT
print_sections_tree(section=None, level=2)

Print the headings tree.

Parameters:

  • section (Optional[Wikicode], default: None ) –

    The Wikicodesection to start printing from. If not provided, prints the headings tree of the entire wikitext.

  • level (int, default: 2 ) –

    The initial heading level from where to start printing.

Source code in de_wiktio\entry.py
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
def print_sections_tree(self, section: Optional[Wikicode] = None, level: int = 2) -> None:
    """
    Print the headings tree.

    Args:
        section: The `Wikicode`section to start printing from. If not provided, prints the headings tree of the entire wikitext.
        level: The initial heading level from where to start printing.
    """
    if section is None:
        section = self.parsed

    headings = section.filter_headings()
    for heading in headings:
        if level <= heading.level:
            print(' ' * 8 * (heading.level - level), heading.level, heading.title)

EntryFlexion(title, wikitext, status='OK', extracted_from=None)

Bases: _EntryBase

Entry class for parsing wikitext from Flexion pages (ns = 108).

Flexion pages hold the complete inflection tables for verbs and adjectives. These tables are referred to as Flexionseiten in the German Wiktionary. They are an extension of the inflection tables of the main content pages.

This class deals with the German section of the page, i.e. the German-to-German dictionary. Therefore, it does not parse multilingual entries, such as English-to-German, French-to-German, etc...

Parameters:

  • title (str) –

    The title of the Wiktionary page.

  • wikitext (str) –

    The raw wikitext of the page.

  • status (str, default: 'OK' ) –

    The status of the entry.

  • extracted_from (Optional[str], default: None ) –

    The source of the extraction.

Methods:

  • from_export

    Create a class instance by fetching online the wiki page.

  • from_dump

    Create a class instance by fetching the wikitext from local dictionary.

  • get_wikidict

    Load the dictionary from the pickle file.

  • print_sections_tree

    Print the headings tree.

  • inflections

    Retrieve a list of dictionaries from the inflection templates.

Attributes:

Source code in de_wiktio\entry.py
247
248
249
250
251
252
253
254
255
256
257
258
259
260
def __init__(self, title: str, wikitext: str, status: str = 'OK', extracted_from: Optional[str] = None) -> None:
    """
    The EntryFlexion class constructor.

    Args:
        title: The title of the Wiktionary page.
        wikitext: The raw *wikitext* of the page.
        status: The status of the entry.
        extracted_from: The source of the extraction.

    """
    super().__init__(title, wikitext, status, extracted_from)
    self.german: Wikicode = self._get_section_de()
    """The German section of the page."""
Attributes
WIKIDICT: dict class-attribute instance-attribute

Class attribute: Dictionary of title-wikitext pairs.

To be accessed when using the from_dump class method.
This is a lazy attribute. It loads the dictionary when needed. After that, it is kept in memory as a class attribute so that the dictionary is not loaded multiple times when using the from_dump class method to create a new Entry object.

title: str = title instance-attribute

The title of the Wiktionary page.

text: str = wikitext instance-attribute

The wikitext of the page.

status: str = status instance-attribute

The status of the parsing and extraction.

Some values are: 'OK', 'No content for {title} in exported page' or 'No proper wiki namespace found for {title}'

extracted_from: str = extracted_from instance-attribute

The source of the extraction.

The possible values are: 'from dump', 'from export' or None. A None value indicates that the instance was created directly from the constructor passing the wikitext and title of the page.

parsed: Wikicode property

The parsed wikitext of the page.

The wikitext is parsed using the mwparserfromhell library.

NS = '108' class-attribute instance-attribute

Class attribute: The namespace of the entry, set to '108'.

german: Wikicode = self._get_section_de() instance-attribute

The German section of the page.

pos: List[str] property

List of German Part Of Speech (POS).

The POS are extracted from the name of the flexion templates in the body.
The possible values are "Adjektiv", "Verb", "Adverb", "Gerundivum", or "Numerale".

flexion_tpls: List[Template] property

List of German flexion templates.

Templates are extracted from the body of the page.

Functions
from_export(title) classmethod

Create a class instance by fetching online the wiki page.

Parameters:

  • title (str) –

    The title of the Wiktionary page.

Returns:

  • Self

    An EntryBase or a subclass instance

Source code in de_wiktio\entry.py
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
@classmethod
def from_export(cls, title: str) -> Self:      
    """Create a class instance by fetching online the wiki page.

    Args:
        title: The title of the Wiktionary page.

    Returns:
        An EntryBase or a subclass instance
    """
    fetched = PageExport(title)
    wikitext = fetched.wikitext

    if fetched.wikitext == '':
        status = f'No content for {title} in exported page'
    elif fetched.ns != cls.NS:
        status = f'No proper wiki namespace found for {title}'
        wikitext = ''
    else:
        status = 'OK'
    return cls(title, wikitext, status,'from export')
from_dump(title, dict_path=None) classmethod

Create a class instance by fetching the wikitext from local dictionary.

During the session, only one dictionary is loaded. The dictionary is loaded from the pickle file 'wikidict_{*cls.NS*}.pkl', which is located in dict_path.

Parameters:

  • title (str) –

    The title of the Wiktionary page to fetch.

  • dict_path (Optional[str], default: None ) –

    Path to the folder containing the dictionary. If None, the folder indicated in Settings will be used.

Returns:

  • Self

    An EntryBase or a subclass instance

Source code in de_wiktio\entry.py
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
@classmethod
def from_dump(cls, title: str, dict_path: Optional[str] = None) -> Self: 
    """
    Create a class instance by fetching the *wikitext* from local dictionary.

    During the session, only one dictionary is loaded. The dictionary is loaded from the pickle file `'wikidict_{*cls.NS*}.pkl'`, which is located in `dict_path`.  

    Args:
        title: The title of the Wiktionary page to fetch.
        dict_path: Path to the folder containing the dictionary. If `None`, the folder indicated in `Settings` will be used.

    Returns:
        An EntryBase or a subclass instance
    """
    wikidict = cls.get_wikidict(dict_path)
    wikitext = wikidict.get(title, '')
    status = 'OK' if wikitext != '' else f'No content for {title} in dump file'
    return cls(title, wikitext, status, 'from dump')
get_wikidict(dict_path=None) classmethod

Load the dictionary from the pickle file.

If the dictionary is already loaded, return the dictionary from memory. Per session, only one dictionary is loaded. The dictionary is loaded from the pickle file 'wikidict_{cls.NS}.pkl' in dict_path or in the folder indicated in Settings if dict_path is not provided.

Parameters:

  • dict_path (Optional[str], default: None ) –

    Path to the folder containing the dictionary. If None, the folder indicated in Settings will be used.

Returns:

  • Dict[str, str]

    A dictionary with page titles as keys and their corresponding wikitext as values.

Source code in de_wiktio\entry.py
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
@classmethod
def get_wikidict(cls, dict_path: Optional[str] = None) -> Dict[str, str]:
    """
    Load the dictionary from the pickle file. 

    If the dictionary is already loaded, return the dictionary from memory. Per session, only one dictionary is loaded. The dictionary is loaded from the pickle file 'wikidict_{*cls.NS*}.pkl' in `dict_path` or in the folder indicated in `Settings` if `dict_path` is not provided.

    Args:
        dict_path: Path to the folder containing the dictionary. If `None`, the folder indicated in `Settings` will be used.

    Returns:
        A dictionary with page titles as keys and their corresponding *wikitext* as values.
    """
    # if the dictionary is already loaded, return it
    if cls.WIKIDICT is not None:
        return cls.WIKIDICT

    # otherwise, load the dictionary
    if dict_path is None:
        cls.WIKIDICT = WikiDump.load_wikidict_by_ns(ns=cls.NS)
    else:
        _file = Path(dict_path) / f'wikidict_{cls.NS}.pkl'
        cls.WIKIDICT = WikiDump.load_wikidict_by_ns(file=_file, ns=cls.NS)

    return cls.WIKIDICT
print_sections_tree(section=None, level=2)

Print the headings tree.

Parameters:

  • section (Optional[Wikicode], default: None ) –

    The Wikicodesection to start printing from. If not provided, prints the headings tree of the entire wikitext.

  • level (int, default: 2 ) –

    The initial heading level from where to start printing.

Source code in de_wiktio\entry.py
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
def print_sections_tree(self, section: Optional[Wikicode] = None, level: int = 2) -> None:
    """
    Print the headings tree.

    Args:
        section: The `Wikicode`section to start printing from. If not provided, prints the headings tree of the entire wikitext.
        level: The initial heading level from where to start printing.
    """
    if section is None:
        section = self.parsed

    headings = section.filter_headings()
    for heading in headings:
        if level <= heading.level:
            print(' ' * 8 * (heading.level - level), heading.level, heading.title)
inflections()

Retrieve a list of dictionaries from the inflection templates.

Returns:

  • List[Dict[str, str]]

    A list of dictionaries where each dictionary represents an inflection template.

Source code in de_wiktio\entry.py
290
291
292
293
294
295
296
def inflections(self) -> List[Dict[str, str]]:
    """Retrieve a list of dictionaries from the inflection templates.

    Returns:
        A list of dictionaries where each dictionary represents an inflection template.
    """
    return [Tools.template_to_dict(template) for template in self.flexion_tpls]

WordForm(wordform, entry=None)

A class representing a word form.

Future work
  • Add translations

Parameters:

  • wordform (Wikicode) –

    A Wikicode object containing the word form.

Methods:

  • inflections

    Retrieve a list of inflections of the word form from the main content page (ns = 0).

  • inflections_extended

    List of dictionaries with the inflections templates from the Flexion pages (ns = 108).

  • other_content_extract

    Extract other content such as Bedeutungen, Beispiele, Synonyme, or Sprichwörter.

Attributes:

  • wordform (Wikicode) –

    A Wikicode object containing the word form.

  • status (str) –

    The status of the word form.

  • entry (Entry) –

    The Entry object to which the word form belongs.

  • heading (Heading) –

    The heading of the word form.

  • pos (List[str]) –

    List of Part Of Speech (POS) or empty list if no POS are found.

  • wortart_tpls (List[Template]) –

    List of Wortart template objects. Returns an empty list if no Wortart template is found.

  • übersichten_tpls (List[Template]) –

    List of templates generating inflection tables (Flexionstabellen) in the main content page.

Source code in de_wiktio\entry.py
327
328
329
330
331
332
333
334
335
336
337
338
def __init__(self, wordform: Wikicode, entry: Entry = None) -> None:
    """The WordForm constructor.

    Args:
        wordform: A Wikicode object containing the word form.
    """
    self.wordform: Wikicode = wordform
    """A Wikicode object containing the word form."""
    self.status: str = 'OK'
    """The status of the word form."""
    self.entry: Entry = entry
    """The `Entry` object to which the word form belongs."""
Attributes
wordform: Wikicode = wordform instance-attribute

A Wikicode object containing the word form.

status: str = 'OK' instance-attribute

The status of the word form.

entry: Entry = entry instance-attribute

The Entry object to which the word form belongs.

heading: Heading property

The heading of the word form.

pos: List[str] property

List of Part Of Speech (POS) or empty list if no POS are found.

The POS are extracted from the Wortart templates of the word form.

wortart_tpls: List[Template] property

List of Wortart template objects. Returns an empty list if no Wortart template is found.

Note: In principle, one would expect only one Wortart template per word form, but in practice, there can be more than one.

übersichten_tpls: List[Template] property

List of templates generating inflection tables (Flexionstabellen) in the main content page.

These tables provide a brief overview (Übersicht) of the word form's inflections.
Full inflection tables can be found in the corresponding Flexionsseiten (Flexion wiki namespace).

Note: Most word forms have either none or only one Übersicht template, but there are cases where they have more than one, such as for 'Mars' and 'Partikel'.

Functions
inflections(all=False)

Retrieve a list of inflections of the word form from the main content page (ns = 0).

The inflections are extracted from the Übersicht templates from the main content page.

Parameters:

  • all (bool, default: False ) –

    If True, all keys of the inflections templates are returned, otherwise keys relating to the image ('Bild') are removed.

Returns:

  • List[Dict[str, str]]

    A list of dictionaries, where each dictionary represents an inflection table. The keys of the dictionaries are the parameter names of the Übersicht template, and the values are the corresponding parameter values.

Source code in de_wiktio\entry.py
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
def inflections(self, all: bool=False) -> List[Dict[str, str]]:
    """
    Retrieve a list of inflections of the word form from the main content page (ns = 0).

    The inflections are extracted from the *Übersicht* templates from the main content page.

    Args:
        all: If `True`, all keys of the inflections templates are returned, otherwise keys relating to the image ('Bild') are removed.

    Returns:
        A list of dictionaries, where each dictionary represents an inflection table. The keys of the dictionaries are the parameter names of the *Übersicht* template, and the values are the corresponding parameter values.  
    """
    flexions = [Tools.template_to_dict(template) for template in self.übersichten_tpls]
    if all:
        return flexions
    else:
        return [{k:v for k,v in flexion.items() 
            if 'Bild' not in k  
            and not k.isnumeric()} for flexion in flexions]
inflections_extended()

List of dictionaries with the inflections templates from the Flexion pages (ns = 108).

In general, this are small dictionaries, providing additional infomation for the contruction of extended inflection tables.

Returns:

  • List[Dict[str, str]]

    A list of dictionaries, or an empty list if no inflection templates are found or the word form is not a verb or adjective.

Source code in de_wiktio\entry.py
408
409
410
411
412
413
414
415
416
417
418
419
def inflections_extended(self) -> List[Dict[str, str]]:
    """List of dictionaries with the inflections templates from the Flexion pages (ns = 108).

    In general, this are small dictionaries, providing additional infomation for the contruction of extended inflection tables.

    Returns: 
        A list of dictionaries, or an empty list if no inflection templates are found or the word form is not a verb or adjective.
    """
    if self._flexionseite:
        return self._flexionseite.inflections()
    else:
        return []
other_content_extract(name, strip_code=True, strip_kw=None)

Extract other content such as Bedeutungen, Beispiele, Synonyme, or Sprichwörter.

Extracts other types of content which are located within the word form section of the page in separate paragraphs. The first line of the paragraph includes only a template without parameters, whose name is the type of content to extract. The content is extracted from the second line until the end of the paragraph.

Parameters:

  • name (str) –

    The name of the template to extract content from. i.e. "Bedeutungen", "Beispiele", "Synonyme", or "Sprichwörter", or any other template name that follows the same pattern.

  • strip_code (bool, default: True ) –

    Whether to strip wikitext code from the extracted content and return plain text.

  • strip_kw (Optional[Dict[str, str]], default: None ) –

    A dictionary of keyword arguments to pass to strip_code method of mwparserfromhell.nodes.Wikicode objects.

Returns:

  • str

    The extracted content, either as plain text or raw wikitext.

Source code in de_wiktio\entry.py
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
def other_content_extract(self, name: str, strip_code: bool = True, strip_kw: Optional[Dict[str, str]] = None) -> str:
    """Extract other content such as `Bedeutungen`, `Beispiele`, `Synonyme`, or `Sprichwörter`.

    Extracts other types of content which are located within the word form section of the page in separate paragraphs. The first line of the paragraph includes only a template without parameters, whose name is the type of content to extract. The content is extracted from the second line until the end of the paragraph.

    Args:
        name: The name of the template to extract content from. i.e. "Bedeutungen", "Beispiele", "Synonyme", or "Sprichwörter", or any other template name that follows the same pattern.
        strip_code: Whether to strip *wikitext* code from the extracted content and return plain text.
        strip_kw: A dictionary of keyword arguments to pass to `strip_code` method of [`mwparserfromhell.nodes.Wikicode`][mwparserfromhell.wikicode.Wikicode.strip_code] objects.

    Returns:
        The extracted content, either as plain text or raw *wikitext*.
    """
    text = str(self.wordform)
    pattern = r'\n\n\{\{' + name + r'\}\}\n(.+?)\n\n'
    search = re.search(pattern, text, re.DOTALL)

    if search is None:
        return  

    content = search.group(1)

    if strip_code:
        if strip_kw is not None:
            content = mwparserfromhell.parse(content).strip_code(**strip_kw)
        else:
            content = mwparserfromhell.parse(content).strip_code()

    return content

Tools

Collection of utility functions.

Methods:

Functions
template_to_dict(template) staticmethod

Get dictionary of paramenters from template object.

Although templates objects have many functionalities similar to dictionaries, they do not return values as strings, but as objects. This function converts these objects to a dictionary of strings.

Parameters:

  • template (Template) –

    A Template object.

Returns:

  • Dict[str, str]

    A dictionary of the template object.

Source code in de_wiktio\entry.py
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
@staticmethod
def template_to_dict(template: Template) -> Dict[str, str]:
   """Get dictionary of paramenters from template object.

   Although templates objects have many functionalities similar to dictionaries, they do not return values as strings, but as objects. This function converts these objects to a dictionary of strings.

   Args:
       template: A `Template` object.

   Returns:
       A dictionary of the template object.
   """
   params = {str(p.name).strip():str(p.value).strip() 
             for p in template.params}
   return params