entry
de_wiktio.entry
¶
This module provides methods to parse wikitext and extract data from Wiktionary pages.
Classes:
-
Entry
–Entry class for parsing wikitext from main content pages (ns =
0
) of the German Wiktionary. -
EntryFlexion
–Entry class for parsing wikitext from Flexion pages (ns =
108
). -
WordForm
–A class representing a word form.
-
Tools
–Collection of utility functions.
Classes¶
Entry(title, wikitext, status='OK', extracted_from=None)
¶
Bases: _EntryBase
Entry class for parsing wikitext from main content pages (ns = 0
) of the German Wiktionary.
This class deals with the German section of the page, i.e. the German-to-German dictionary. Therefore, it does not parse multilingual entries, such as English-to-German, French-to-German, etc...
Parameters:
-
title
(str
) –The title of the Wiktionary page.
-
wikitext
(str
) –The raw wikitext of the page.
-
status
(str
, default:'OK'
) –The status of the entry.
-
extracted_from
(str
, default:None
) –The source of the extraction.
Methods:
-
from_export
–Create a class instance by fetching online the wiki page.
-
from_dump
–Create a class instance by fetching the wikitext from local dictionary.
-
get_wikidict
–Load the dictionary from the pickle file.
-
print_sections_tree
–Print the headings tree.
Attributes:
-
WIKIDICT
(dict
) –Class attribute: Dictionary of title-wikitext pairs.
-
title
(str
) –The title of the Wiktionary page.
-
text
(str
) –The wikitext of the page.
-
status
(str
) –The status of the parsing and extraction.
-
extracted_from
(str
) –The source of the extraction.
-
parsed
(Wikicode
) –The parsed wikitext of the page.
-
NS
(str
) –Class attribute: The namespace of the entry, set to
'0'
. -
german
(Wikicode
) –The German section of the page.
-
wordforms
(List[WordForm]
) –List of German word forms.
Source code in de_wiktio\entry.py
202 203 204 205 206 207 208 209 210 211 212 213 214 |
|
Attributes¶
WIKIDICT: dict
class-attribute
instance-attribute
¶
Class attribute: Dictionary of title-wikitext pairs.
To be accessed when using the from_dump
class method.
This is a lazy attribute. It loads the dictionary when needed. After that, it is kept in memory as a class attribute so that the dictionary is not loaded multiple times when using the from_dump
class method to create a new Entry
object.
title: str = title
instance-attribute
¶
The title of the Wiktionary page.
text: str = wikitext
instance-attribute
¶
The wikitext of the page.
status: str = status
instance-attribute
¶
The status of the parsing and extraction.
Some values are: 'OK', 'No content for {title} in exported page' or 'No proper wiki namespace found for {title}'
extracted_from: str = extracted_from
instance-attribute
¶
The source of the extraction.
The possible values are: 'from dump', 'from export' or None
. A None
value indicates that the instance was created directly from the constructor passing the wikitext and title of the page.
parsed: Wikicode
property
¶
The parsed wikitext of the page.
The wikitext is parsed using the mwparserfromhell
library.
NS: str = '0'
class-attribute
instance-attribute
¶
Class attribute: The namespace of the entry, set to '0'
.
german: Wikicode = self._get_section_de()
instance-attribute
¶
The German section of the page.
wordforms: List[WordForm]
property
¶
List of German word forms.
Functions¶
from_export(title)
classmethod
¶
Create a class instance by fetching online the wiki page.
Parameters:
-
title
(str
) –The title of the Wiktionary page.
Returns:
-
Self
–An EntryBase or a subclass instance
Source code in de_wiktio\entry.py
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 |
|
from_dump(title, dict_path=None)
classmethod
¶
Create a class instance by fetching the wikitext from local dictionary.
During the session, only one dictionary is loaded. The dictionary is loaded from the pickle file 'wikidict_{*cls.NS*}.pkl'
, which is located in dict_path
.
Parameters:
-
title
(str
) –The title of the Wiktionary page to fetch.
-
dict_path
(Optional[str]
, default:None
) –Path to the folder containing the dictionary. If
None
, the folder indicated inSettings
will be used.
Returns:
-
Self
–An EntryBase or a subclass instance
Source code in de_wiktio\entry.py
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 |
|
get_wikidict(dict_path=None)
classmethod
¶
Load the dictionary from the pickle file.
If the dictionary is already loaded, return the dictionary from memory. Per session, only one dictionary is loaded. The dictionary is loaded from the pickle file 'wikidict_{cls.NS}.pkl' in dict_path
or in the folder indicated in Settings
if dict_path
is not provided.
Parameters:
-
dict_path
(Optional[str]
, default:None
) –Path to the folder containing the dictionary. If
None
, the folder indicated inSettings
will be used.
Returns:
Source code in de_wiktio\entry.py
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 |
|
print_sections_tree(section=None, level=2)
¶
Print the headings tree.
Parameters:
-
section
(Optional[Wikicode]
, default:None
) –The
Wikicode
section to start printing from. If not provided, prints the headings tree of the entire wikitext. -
level
(int
, default:2
) –The initial heading level from where to start printing.
Source code in de_wiktio\entry.py
176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 |
|
EntryFlexion(title, wikitext, status='OK', extracted_from=None)
¶
Bases: _EntryBase
Entry class for parsing wikitext from Flexion pages (ns = 108
).
Flexion pages hold the complete inflection tables for verbs and adjectives. These tables are referred to as Flexionseiten in the German Wiktionary. They are an extension of the inflection tables of the main content pages.
This class deals with the German section of the page, i.e. the German-to-German dictionary. Therefore, it does not parse multilingual entries, such as English-to-German, French-to-German, etc...
Parameters:
-
title
(str
) –The title of the Wiktionary page.
-
wikitext
(str
) –The raw wikitext of the page.
-
status
(str
, default:'OK'
) –The status of the entry.
-
extracted_from
(Optional[str]
, default:None
) –The source of the extraction.
Methods:
-
from_export
–Create a class instance by fetching online the wiki page.
-
from_dump
–Create a class instance by fetching the wikitext from local dictionary.
-
get_wikidict
–Load the dictionary from the pickle file.
-
print_sections_tree
–Print the headings tree.
-
inflections
–Retrieve a list of dictionaries from the inflection templates.
Attributes:
-
WIKIDICT
(dict
) –Class attribute: Dictionary of title-wikitext pairs.
-
title
(str
) –The title of the Wiktionary page.
-
text
(str
) –The wikitext of the page.
-
status
(str
) –The status of the parsing and extraction.
-
extracted_from
(str
) –The source of the extraction.
-
parsed
(Wikicode
) –The parsed wikitext of the page.
-
NS
–Class attribute: The namespace of the entry, set to
'108'
. -
german
(Wikicode
) –The German section of the page.
-
pos
(List[str]
) –List of German Part Of Speech (POS).
-
flexion_tpls
(List[Template]
) –List of German flexion templates.
Source code in de_wiktio\entry.py
247 248 249 250 251 252 253 254 255 256 257 258 259 260 |
|
Attributes¶
WIKIDICT: dict
class-attribute
instance-attribute
¶
Class attribute: Dictionary of title-wikitext pairs.
To be accessed when using the from_dump
class method.
This is a lazy attribute. It loads the dictionary when needed. After that, it is kept in memory as a class attribute so that the dictionary is not loaded multiple times when using the from_dump
class method to create a new Entry
object.
title: str = title
instance-attribute
¶
The title of the Wiktionary page.
text: str = wikitext
instance-attribute
¶
The wikitext of the page.
status: str = status
instance-attribute
¶
The status of the parsing and extraction.
Some values are: 'OK', 'No content for {title} in exported page' or 'No proper wiki namespace found for {title}'
extracted_from: str = extracted_from
instance-attribute
¶
The source of the extraction.
The possible values are: 'from dump', 'from export' or None
. A None
value indicates that the instance was created directly from the constructor passing the wikitext and title of the page.
parsed: Wikicode
property
¶
The parsed wikitext of the page.
The wikitext is parsed using the mwparserfromhell
library.
NS = '108'
class-attribute
instance-attribute
¶
Class attribute: The namespace of the entry, set to '108'
.
german: Wikicode = self._get_section_de()
instance-attribute
¶
The German section of the page.
pos: List[str]
property
¶
List of German Part Of Speech (POS).
The POS are extracted from the name of the flexion templates in the body.
The possible values are "Adjektiv", "Verb", "Adverb", "Gerundivum", or "Numerale".
flexion_tpls: List[Template]
property
¶
List of German flexion templates.
Templates are extracted from the body of the page.
Functions¶
from_export(title)
classmethod
¶
Create a class instance by fetching online the wiki page.
Parameters:
-
title
(str
) –The title of the Wiktionary page.
Returns:
-
Self
–An EntryBase or a subclass instance
Source code in de_wiktio\entry.py
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 |
|
from_dump(title, dict_path=None)
classmethod
¶
Create a class instance by fetching the wikitext from local dictionary.
During the session, only one dictionary is loaded. The dictionary is loaded from the pickle file 'wikidict_{*cls.NS*}.pkl'
, which is located in dict_path
.
Parameters:
-
title
(str
) –The title of the Wiktionary page to fetch.
-
dict_path
(Optional[str]
, default:None
) –Path to the folder containing the dictionary. If
None
, the folder indicated inSettings
will be used.
Returns:
-
Self
–An EntryBase or a subclass instance
Source code in de_wiktio\entry.py
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 |
|
get_wikidict(dict_path=None)
classmethod
¶
Load the dictionary from the pickle file.
If the dictionary is already loaded, return the dictionary from memory. Per session, only one dictionary is loaded. The dictionary is loaded from the pickle file 'wikidict_{cls.NS}.pkl' in dict_path
or in the folder indicated in Settings
if dict_path
is not provided.
Parameters:
-
dict_path
(Optional[str]
, default:None
) –Path to the folder containing the dictionary. If
None
, the folder indicated inSettings
will be used.
Returns:
Source code in de_wiktio\entry.py
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 |
|
print_sections_tree(section=None, level=2)
¶
Print the headings tree.
Parameters:
-
section
(Optional[Wikicode]
, default:None
) –The
Wikicode
section to start printing from. If not provided, prints the headings tree of the entire wikitext. -
level
(int
, default:2
) –The initial heading level from where to start printing.
Source code in de_wiktio\entry.py
176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 |
|
inflections()
¶
Retrieve a list of dictionaries from the inflection templates.
Returns:
-
List[Dict[str, str]]
–A list of dictionaries where each dictionary represents an inflection template.
Source code in de_wiktio\entry.py
290 291 292 293 294 295 296 |
|
WordForm(wordform, entry=None)
¶
A class representing a word form.
Future work
- Add translations
Parameters:
-
wordform
(Wikicode
) –A Wikicode object containing the word form.
Methods:
-
inflections
–Retrieve a list of inflections of the word form from the main content page (ns = 0).
-
inflections_extended
–List of dictionaries with the inflections templates from the Flexion pages (ns = 108).
-
other_content_extract
–Extract other content such as
Bedeutungen
,Beispiele
,Synonyme
, orSprichwörter
.
Attributes:
-
wordform
(Wikicode
) –A Wikicode object containing the word form.
-
status
(str
) –The status of the word form.
-
entry
(Entry
) –The
Entry
object to which the word form belongs. -
heading
(Heading
) –The heading of the word form.
-
pos
(List[str]
) –List of Part Of Speech (POS) or empty list if no POS are found.
-
wortart_tpls
(List[Template]
) –List of Wortart template objects. Returns an empty list if no Wortart template is found.
-
übersichten_tpls
(List[Template]
) –List of templates generating inflection tables (Flexionstabellen) in the main content page.
Source code in de_wiktio\entry.py
327 328 329 330 331 332 333 334 335 336 337 338 |
|
Attributes¶
wordform: Wikicode = wordform
instance-attribute
¶
A Wikicode object containing the word form.
status: str = 'OK'
instance-attribute
¶
The status of the word form.
entry: Entry = entry
instance-attribute
¶
The Entry
object to which the word form belongs.
heading: Heading
property
¶
The heading of the word form.
pos: List[str]
property
¶
List of Part Of Speech (POS) or empty list if no POS are found.
The POS are extracted from the Wortart templates of the word form.
wortart_tpls: List[Template]
property
¶
List of Wortart template objects. Returns an empty list if no Wortart template is found.
Note: In principle, one would expect only one Wortart template per word form, but in practice, there can be more than one.
übersichten_tpls: List[Template]
property
¶
List of templates generating inflection tables (Flexionstabellen) in the main content page.
These tables provide a brief overview (Übersicht) of the word form's inflections.
Full inflection tables can be found in the corresponding Flexionsseiten (Flexion wiki namespace).
Note: Most word forms have either none or only one Übersicht template, but there are cases where they have more than one, such as for 'Mars' and 'Partikel'.
Functions¶
inflections(all=False)
¶
Retrieve a list of inflections of the word form from the main content page (ns = 0).
The inflections are extracted from the Übersicht templates from the main content page.
Parameters:
-
all
(bool
, default:False
) –If
True
, all keys of the inflections templates are returned, otherwise keys relating to the image ('Bild') are removed.
Returns:
-
List[Dict[str, str]]
–A list of dictionaries, where each dictionary represents an inflection table. The keys of the dictionaries are the parameter names of the Übersicht template, and the values are the corresponding parameter values.
Source code in de_wiktio\entry.py
388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 |
|
inflections_extended()
¶
List of dictionaries with the inflections templates from the Flexion pages (ns = 108).
In general, this are small dictionaries, providing additional infomation for the contruction of extended inflection tables.
Returns:
-
List[Dict[str, str]]
–A list of dictionaries, or an empty list if no inflection templates are found or the word form is not a verb or adjective.
Source code in de_wiktio\entry.py
408 409 410 411 412 413 414 415 416 417 418 419 |
|
other_content_extract(name, strip_code=True, strip_kw=None)
¶
Extract other content such as Bedeutungen
, Beispiele
, Synonyme
, or Sprichwörter
.
Extracts other types of content which are located within the word form section of the page in separate paragraphs. The first line of the paragraph includes only a template without parameters, whose name is the type of content to extract. The content is extracted from the second line until the end of the paragraph.
Parameters:
-
name
(str
) –The name of the template to extract content from. i.e. "Bedeutungen", "Beispiele", "Synonyme", or "Sprichwörter", or any other template name that follows the same pattern.
-
strip_code
(bool
, default:True
) –Whether to strip wikitext code from the extracted content and return plain text.
-
strip_kw
(Optional[Dict[str, str]]
, default:None
) –A dictionary of keyword arguments to pass to
strip_code
method ofmwparserfromhell.nodes.Wikicode
objects.
Returns:
-
str
–The extracted content, either as plain text or raw wikitext.
Source code in de_wiktio\entry.py
422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 |
|
Tools
¶
Collection of utility functions.
Methods:
-
template_to_dict
–Get dictionary of paramenters from template object.
Functions¶
template_to_dict(template)
staticmethod
¶
Get dictionary of paramenters from template object.
Although templates objects have many functionalities similar to dictionaries, they do not return values as strings, but as objects. This function converts these objects to a dictionary of strings.
Parameters:
-
template
(Template
) –A
Template
object.
Returns:
Source code in de_wiktio\entry.py
486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 |
|