Usage
Extracting German Wiktionary Data¶
There are two ways to fetch, parse, and extract wikitext content:
- By fetching content online:
from de_wiktio.entry import Entry
entry = Entry.from_export('stark')
- Or from your machine. This requires preprocessing the dump files from the German wiktionary and storing the information locally in a dictionary. See below for instructions.
entry = Entry.from_dump('stark')
Both methods above return an Entry
object. From which,
- you can access the raw wikitext from the page.
print('type = ',type(entry),'\n')
print(entry.text[:500])
type = <class 'de_wiktio.entry.Entry'>
{{Siehe auch|[[stærk]], [[stärk]]}}
== stark ({{Sprache|Deutsch}}) ==
=== {{Wortart|Adjektiv|Deutsch}} ===
{{Deutsch Adjektiv Übersicht
|Positiv=stark
|Komparativ=stärker
|Superlativ=stärksten
|Bild 1=Weight lifting black and white.jpg|mini|1|eine ''starke'' [[Frau]] beim [[Gewichtheben]]
|Bild 2=Agonis Flexuosa - bark.jpg|mini|3|ein ''starker'' [[Baumstamm]]
|Bild 3=Snow pile (3123493946).jpg|mini|4|Es hat ''stark'' [[schneien|geschneit]].
}}
{{Worttrennung}}
:stark, {{Komp.}} stär·ker, {{Sup
- explore the headings tree.
# For the whole page:
entry.print_sections_tree()
2 stark ({{Sprache|Deutsch}})
3 {{Wortart|Adjektiv|Deutsch}}
4 {{Übersetzungen}}
2 stark ({{Sprache|Englisch}})
3 {{Wortart|Adjektiv|Englisch}}
4 {{Übersetzungen}}
3 {{Wortart|Adverb|Englisch}}
4 {{Übersetzungen}}
2 stark ({{Sprache|Schwedisch}})
3 {{Wortart|Adjektiv|Schwedisch}}
4 {{Übersetzungen}}
2 stark ({{Sprache|Deutsch}})
3 {{Wortart|Adjektiv|Deutsch}}
4 {{Übersetzungen}}
# For the German section:
entry.print_sections_tree(section=entry.german)
2 stark ({{Sprache|Deutsch}})
3 {{Wortart|Adjektiv|Deutsch}}
4 {{Übersetzungen}}
Entry
objects extract additional information from the German section:
-
- The list of German word forms, using
entry.wordforms
- Which returns a list of
WordForm
objects.
- The list of German word forms, using
print(len(entry.wordforms))
1
From a
WordForm
object, you can extract:
- The Part of the Speech
wordform = entry.wordforms[0]
print(f'{wordform.pos = }')
wordform.pos = ['Adjektiv']
- Word inflections for nouns, verbs, adjectives, and adverbs.
wordform = entry.wordforms[0]
inflections = wordform.inflections()
for flexion in inflections:
for k,v in flexion.items():
print(f'{k} = {v}')
print()
Positiv = stark
Komparativ = stärker
Superlativ = stärksten
-
- And other content, such as:
'Bedeutungen'
(meaning),'Beispiele'
(examples),'Synonyme'
(synonyms),'Sprichwörter'
(proverbs), among others.
for content_type in ['Bedeutungen', 'Beispiele', 'Synonyme', 'Sprichwörter']:
print(content_type.center(20, '-'))
content = wordform.other_content_extract(content_type)
print(content[:150], '\n')
----Bedeutungen-----
[1] mit Kraft ausgestattet, von Kraft geprägt, zeugend
[2] hohe Leistung erbringend; sehr leistungsfähig
[3] äußeren Einflüssen, Belastungen standhalt
-----Beispiele------
[1] Er hat viele Muskeln – er ist stark.
[1] Es weht ein starker Wind.
[1] Ein starker Mann kann schwere Sachen tragen.
[1] „Indiz für den verbreitete
------Synonyme------
[1] kräftig, kraftvoll
[2] effizient, leistungsfähig, leistungsstark, wirksam
[3] belastbar, dick, fest, robust, stabil, widerstandsfähig
[4] ausgeprä
----Sprichwörter----
Was dich nicht umbringt, macht dich stärker
Der Starke ist am mächtigsten allein.Friedrich Schiller: Wilhelm Tell (1804)
Working with dump files¶
To work with a dump file, you need to create a dictionary of page titles and wikitexts pairs. For this you will need to:
- Download and decompress the Wiktionary dump file.
- You can download the latest version here or refer to instructions for downloading specific versions in this Hands-on Guide.
- Specify the path to the decompressed file in
XML_FILE
. - Specify the folder where the dictionary should be saved in
DICT_PATH
.
# Specify your own paths
XML_FILE = r'path\to\xml\dewiktionary-20241020-pages-articles-multistream.xml'
DICT_PATH = r'path\to\dict'
The easiest way to get started is to set the path to the dictionary folder in
Settings
.
- This allows you to use the dump file data without repeatedly specifying the folder path.
from de_wiktio.settings import Settings
Settings.set(key='DICT_PATH', value=DICT_PATH)
The next code will load and parse the XML dump file and create and save dictionaries to pickle files in the specified folder.
To use the Entry.from_dump
method, you need to create two dictionaries:
- one for the main content namespace (ns =
'0'
) - another for the Flexion namespace (ns =
'108'
)
Grab a cup of coffee and wait—it might take a few minutes (between 4 and 5 minutes on my computer).
from de_wiktio.fetch import WikiDump
dump = WikiDump(XML_FILE)
_ = dump.create_dict_by_ns(ns='0')
_ = dump.create_dict_by_ns(ns='108')
You are now ready to work with
Entry
objects using the from_dump
class method.
- The first
Entry
created during the session loads the dictionary, so it takes longer (around 9 to 11 seconds on my computer). - From the second
Entry
onwards,Entry.from_dump
accesses the dictionary from memory, making it faster than the first entry creation but also faster than fetching the content online usingfrom_export
.
from de_wiktio.entry import Entry
# First entry
entry = Entry.from_dump('stark')
print(type(entry))
# Second entry
entry = Entry.from_dump('hoch')
print(type(entry))
<class 'de_wiktio.entry.Entry'>
<class 'de_wiktio.entry.Entry'>