Skip to content

Usage

Extracting German Wiktionary Data

There are two ways to fetch, parse, and extract wikitext content:

  • By fetching content online:
from de_wiktio.entry import Entry

entry = Entry.from_export('stark')
  • Or from your machine. This requires preprocessing the dump files from the German wiktionary and storing the information locally in a dictionary. See below for instructions.
entry = Entry.from_dump('stark')

Both methods above return an Entry object. From which,

  • you can access the raw wikitext from the page.
print('type = ',type(entry),'\n') 
print(entry.text[:500])
type =  <class 'de_wiktio.entry.Entry'> 

{{Siehe auch|[[stærk]], [[stärk]]}}
== stark ({{Sprache|Deutsch}}) ==
=== {{Wortart|Adjektiv|Deutsch}} ===

{{Deutsch Adjektiv Übersicht
|Positiv=stark
|Komparativ=stärker
|Superlativ=stärksten
|Bild 1=Weight lifting black and white.jpg|mini|1|eine ''starke'' [[Frau]] beim [[Gewichtheben]]
|Bild 2=Agonis Flexuosa - bark.jpg|mini|3|ein ''starker'' [[Baumstamm]]
|Bild 3=Snow pile (3123493946).jpg|mini|4|Es hat ''stark'' [[schneien|geschneit]].
}}

{{Worttrennung}}
:stark, {{Komp.}} stär·ker, {{Sup
  • explore the headings tree.
# For the whole page:
entry.print_sections_tree()
2  stark ({{Sprache|Deutsch}})
    3  {{Wortart|Adjektiv|Deutsch}}
        4  {{Übersetzungen}}
2  stark ({{Sprache|Englisch}})
    3  {{Wortart|Adjektiv|Englisch}}
        4  {{Übersetzungen}}
    3  {{Wortart|Adverb|Englisch}}
        4  {{Übersetzungen}}
2  stark ({{Sprache|Schwedisch}})
    3  {{Wortart|Adjektiv|Schwedisch}}
        4  {{Übersetzungen}}
2  stark ({{Sprache|Deutsch}})
    3  {{Wortart|Adjektiv|Deutsch}}
        4  {{Übersetzungen}}
# For the German section:
entry.print_sections_tree(section=entry.german) 
2  stark ({{Sprache|Deutsch}})
    3  {{Wortart|Adjektiv|Deutsch}}
        4  {{Übersetzungen}}

Entry objects extract additional information from the German section:

  • The list of German word forms, using entry.wordforms
    Which returns a list of WordForm objects.

print(len(entry.wordforms))
1

From a WordForm object, you can extract:

  • The Part of the Speech
wordform =  entry.wordforms[0]
print(f'{wordform.pos = }')
wordform.pos = ['Adjektiv']
  • Word inflections for nouns, verbs, adjectives, and adverbs.
wordform =  entry.wordforms[0]
inflections = wordform.inflections()
for flexion in inflections:
    for k,v in flexion.items():
        print(f'{k} = {v}')
    print()
Positiv = stark
Komparativ = stärker
Superlativ = stärksten
  • And other content, such as:
    'Bedeutungen' (meaning),'Beispiele' (examples), 'Synonyme' (synonyms), 'Sprichwörter' (proverbs), among others.
for content_type in ['Bedeutungen', 'Beispiele', 'Synonyme', 'Sprichwörter']:
    print(content_type.center(20, '-'))
    content = wordform.other_content_extract(content_type)
    print(content[:150], '\n')
----Bedeutungen-----
[1] mit Kraft ausgestattet, von Kraft geprägt, zeugend
[2] hohe Leistung erbringend; sehr leistungsfähig
[3] äußeren Einflüssen, Belastungen standhalt 

-----Beispiele------
[1] Er hat viele Muskeln – er ist stark.
[1] Es weht ein starker Wind.
[1] Ein starker Mann kann schwere Sachen tragen.
[1] „Indiz für den verbreitete 

------Synonyme------
[1] kräftig, kraftvoll
[2] effizient, leistungsfähig, leistungsstark, wirksam
[3] belastbar, dick, fest, robust, stabil, widerstandsfähig
[4] ausgeprä 

----Sprichwörter----
Was dich nicht umbringt, macht dich stärker
Der Starke ist am mächtigsten allein.Friedrich Schiller: Wilhelm Tell (1804) 

Working with dump files

To work with a dump file, you need to create a dictionary of page titles and wikitexts pairs. For this you will need to:

  1. Download and decompress the Wiktionary dump file.
    • You can download the latest version here or refer to instructions for downloading specific versions in this Hands-on Guide.
  2. Specify the path to the decompressed file in XML_FILE.
  3. Specify the folder where the dictionary should be saved in DICT_PATH.

# Specify your own paths
XML_FILE = r'path\to\xml\dewiktionary-20241020-pages-articles-multistream.xml'
DICT_PATH = r'path\to\dict'

The easiest way to get started is to set the path to the dictionary folder in Settings.

  • This allows you to use the dump file data without repeatedly specifying the folder path.
from de_wiktio.settings import Settings

Settings.set(key='DICT_PATH', value=DICT_PATH)

The next code will load and parse the XML dump file and create and save dictionaries to pickle files in the specified folder.

To use the Entry.from_dump method, you need to create two dictionaries:

  • one for the main content namespace (ns = '0')
  • another for the Flexion namespace (ns = '108')

Grab a cup of coffee and wait—it might take a few minutes (between 4 and 5 minutes on my computer).

from de_wiktio.fetch import WikiDump

dump = WikiDump(XML_FILE)
_ = dump.create_dict_by_ns(ns='0')
_ = dump.create_dict_by_ns(ns='108')

You are now ready to work with Entry objects using the from_dump class method.

  • The first Entry created during the session loads the dictionary, so it takes longer (around 9 to 11 seconds on my computer).
  • From the second Entry onwards, Entry.from_dump accesses the dictionary from memory, making it faster than the first entry creation but also faster than fetching the content online using from_export.
from de_wiktio.entry import Entry
# First entry  
entry = Entry.from_dump('stark')
print(type(entry))

# Second entry
entry = Entry.from_dump('hoch')
print(type(entry))
<class 'de_wiktio.entry.Entry'>
<class 'de_wiktio.entry.Entry'>