From Special Export

On this page

Importing Packages
Parsing the XML content
Displaying XML Structure
XML Namespaces Overview
Extracting Data
- element.find

Importing Packages¶

Python
import requests # to fetch info from URLs
import lxml.etree as ET # to parse XML documents

We will use the fetch function as described in our earlier tutorial on Special Exports, provided here for reference.

Python
def fetch(title):
    url = f'https://de.wiktionary.org/wiki/Spezial:Exportieren/{title}'
    resp = requests.get(url)
    resp.raise_for_status()
    return resp.text

Let us fetch the XML content for the page titled 'schön' as an example.

SourceResult

Python
xml_content = fetch('schön')
print(xml_content[:500])
print(f'{type(xml_content) = }')

XML
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.11/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.11/ http://www.mediawiki.org/xml/export-0.11.xsd" version="0.11" xml:lang="de">
  <siteinfo>
    <sitename>Wiktionary</sitename>
    <dbname>dewiktionary</dbname>
    <base>https://de.wiktionary.org/wiki/Wiktionary:Hauptseite</base>
    <generator>MediaWiki 1.44.0-wmf.22</generator>
    <case>case-sensitive</case>
    <namesp
type(xml_content) = <class 'str'>

Parsing the XML content¶

Now that we have retrieved the XML content, we will use lxml.etree to parse it.

In order to parse an XML string, which is what fetch returns, we will use the fromstring method. Later we will use the parse method to parse an XML file.

SourceResult

Python
# Parse the XML content into an ET Element
root = ET.fromstring(xml_content)

print(type(root)) # Output: <class 'lxml.etree._Element'>

Python Console Session
<class 'lxml.etree._Element'>

ET.fromstring returns an Element object with several useful properties.
From an Element object, you can extract:

its tag <tag_name> ... </tag_name>, using Element.tag
its attributes <tag_name attribut1="value1" attrib2="value2"> ..., using Element.attrib
its text <tag_name attribut1="value1"> some text </tag_name>, using Element.text

Let us create a dummy XML content to illustrate these:

SourceResult

Python
xml = """
<tag_name attribut1="value1" attrib2="value2"> some text </tag_name>
"""

element = ET.fromstring(xml)

print('tag'.center(20, '*'))
print(f'{element.tag = }')
print(f'{type(element.tag) = }')

print('attrib'.center(20, '*'))
print(f'{element.attrib = }')
print(f'{type(element.attrib) = }')

print('text'.center(20, '*'))
print(f'{element.text = }')
print(f'{type(element.text) = }')

Python Console Session
********tag*********
element.tag = 'tag_name'
type(element.tag) = <class 'str'>
*******attrib*******
element.attrib = {'attribut1': 'value1', 'attrib2': 'value2'}
type(element.attrib) = <class 'lxml.etree._Attrib'>
********text********
element.text = ' some text '
type(element.text) = <class 'str'>

Displaying XML Structure¶

XML data can often be large and complex, especially when deeply nested, which makes understanding its structure difficult.

To help with this, let us create a helper function to display the XML tags in a tree-like format. Since we do not know how deep the XML structure might go, the best strategy here is to use recursion as follows:

Python
def print_tags_tree(elem, level=0):
    # print indent, level and tag of the element
    print(' ' * 5 * level, level, elem.tag)
    for child in elem:
        # recursion to go as deep as possible
        print_tags_tree(child, level + 1)

Let us try it:

SourceResult

Python
print_tags_tree(root)

Python Console Session
 0 {http://www.mediawiki.org/xml/export-0.11/}mediawiki
      1 {http://www.mediawiki.org/xml/export-0.11/}siteinfo
           2 {http://www.mediawiki.org/xml/export-0.11/}sitename
           2 {http://www.mediawiki.org/xml/export-0.11/}dbname
           2 {http://www.mediawiki.org/xml/export-0.11/}base
           2 {http://www.mediawiki.org/xml/export-0.11/}generator
           2 {http://www.mediawiki.org/xml/export-0.11/}case
           2 {http://www.mediawiki.org/xml/export-0.11/}namespaces
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
      1 {http://www.mediawiki.org/xml/export-0.11/}page
           2 {http://www.mediawiki.org/xml/export-0.11/}title
           2 {http://www.mediawiki.org/xml/export-0.11/}ns
           2 {http://www.mediawiki.org/xml/export-0.11/}id
           2 {http://www.mediawiki.org/xml/export-0.11/}revision
                3 {http://www.mediawiki.org/xml/export-0.11/}id
                3 {http://www.mediawiki.org/xml/export-0.11/}parentid
                3 {http://www.mediawiki.org/xml/export-0.11/}timestamp
                3 {http://www.mediawiki.org/xml/export-0.11/}contributor
                     4 {http://www.mediawiki.org/xml/export-0.11/}username
                     4 {http://www.mediawiki.org/xml/export-0.11/}id
                3 {http://www.mediawiki.org/xml/export-0.11/}comment
                3 {http://www.mediawiki.org/xml/export-0.11/}origin
                3 {http://www.mediawiki.org/xml/export-0.11/}model
                3 {http://www.mediawiki.org/xml/export-0.11/}format
                3 {http://www.mediawiki.org/xml/export-0.11/}text
                3 {http://www.mediawiki.org/xml/export-0.11/}sha1

The output of print_tags_tree shows tags in a format that combines:

The namespace URI (e.g., {http://www.mediawiki.org/xml/export-0.11/})
The tag name (e.g., mediawiki, page)

Although knowing the namespace is important, as we will discover later, it makes the tree look very cluttered.

To address this, let us modify the helper function to allow printing only the tag names in the tree.

We will use the function QName, which splits the tag information of an element into its tag name and its namespace. Here is an example code using QName:

SourceResult

Python
print(f'{root.tag=}')

# Using the ET function QName
root_name = ET.QName(root)
# only tag name
print(f'{root_name.localname=}')
# only namespace
print(f'{root_name.namespace=}')

Python Console Session
root.tag='{http://www.mediawiki.org/xml/export-0.11/}mediawiki'
root_name.localname='mediawiki'
root_name.namespace='http://www.mediawiki.org/xml/export-0.11/'

Now, let us modify the print_tags_tree function to provide the option of printing only tag names when the only_tagnames parameter is set to True.

SourceResult

Python
def print_tags_tree(elem, level=0, only_tagnames=False):
    tagname = ET.QName(elem).localname if only_tagnames else elem.tag

    print(' ' * 5 * level, level, tagname)
    for child in elem:
        print_tags_tree(child, level + 1, only_tagnames)

print_tags_tree(root, only_tagnames=True)

Python Console Session
 0 mediawiki
      1 siteinfo
           2 sitename
           2 dbname
           2 base
           2 generator
           2 case
           2 namespaces
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
      1 page
           2 title
           2 ns
           2 id
           2 revision
                3 id
                3 parentid
                3 timestamp
                3 contributor
                     4 username
                     4 id
                3 comment
                3 origin
                3 model
                3 format
                3 text
                3 sha1

Better! This gives us a clear view of the structure:

The root element is mediawiki.

mediawiki has two children:
- siteinfo
  - Contains information about the domain (e.g., its sitename, Wiki namespace).
- page
  - Contains the most important information for this project.
  - page has four direct children: title, ns, id, and revision.
    - title contains the title of the page
      - For example, schön, Flexion:schön.
    - ns contains the Wiki namespace, which should not be confused with XML namespaces!
      - For example, 0, 108.
    - id is the unique identifier of the page
      - For example, 2930, 21734.
    - revision contains a revision of the page.
      - Each time a wiki page is modified, a revision element is added. In the XML extraction methods covered here, only the latest revision is retrieved, so we have only one revision element.
      - The raw wikitext is located here in the child element text.

The main goal of this section is to extract the page, title, ns, and text elements.

But first, let us briefly discuss XML namespaces. If you are already familiar with XML namespaces, feel free to skip this part.

XML Namespaces Overview¶

In XML, tag names and attributes are user-defined, which can lead to name conflicts when combining data from different XML files. To avoid these conflicts, XML uses a system of namespaces and prefixes. Each namespace is typically defined using a URI.

Namespaces are often declared in the root element of the XML (but not always, they can also be declared in children elements). To identify namespaces in an XML document, look for attributes beginning with xmlns and/or xmlns:prefix.

xmlns without a prefix: This denotes the default namespace, applying to the element where it appears and all its descendants (unless overridden).
xmlns:prefix: This is a prefixed namespace. It applies only to elements that explicitly use the prefix.

For example, in our Wiki XML, we see two namespaces defined at the root element:

XML
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.11/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 

This gives us:

Default Namespace: xmlns="http://www.mediawiki.org/xml/export-0.11/", which applies to all elements without prefixes.
Prefixed Namespace: xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" with the prefix xsi.

An even simpler approach is to use the nsmap method, which provides a dictionary mapping prefixes to their respective URIs.

SourceResult

Python
NAMESPACES = root.nsmap
for key, namespace in NAMESPACES.items():
    print('prefix:', key,'=> namespace-URI:', namespace)

Python Console Session
prefix: None => namespace-URI: http://www.mediawiki.org/xml/export-0.11/
prefix: xsi => namespace-URI: http://www.w3.org/2001/XMLSchema-instance

Extracting Data¶

`element.find`¶

To extract elements from the XML, we will use the find method, which searches for the first child element with a specified tag name or path.

Note that the following code will fail to find the page element and will return None. This is because lxml requires the correct namespace to be specified if the XML we are working with has declared any namespaces.

SourceResult

Python
# This will not work, because it lacks the required namespace
page = root.find('page')
print(page) # None

Python Console Session
None

You can specify the namespace in two ways:

Using the full {namespace}tagname notation
Passing a namespace dictionary as an argument to find

The following code will successfully retrieve the page element using each method:

SourceResult

Python
# Full notation 
page = root.find('{http://www.mediawiki.org/xml/export-0.11/}page')
print('Full notation:',page) 

# Namespace dictionary
# NAMESPACES = {None: 'http://www.mediawiki.org/xml/export-0.11/'}
NAMESPACES = root.nsmap
page = root.find('page', NAMESPACES)
print('Namespace dictionary:', page)

Python Console Session
Full notation: <Element {http://www.mediawiki.org/xml/export-0.11/}page at 0x7f9674a49340>
Namespace dictionary: <Element {http://www.mediawiki.org/xml/export-0.11/}page at 0x7f9674a49340>

We will stick to the namespace dictionary to avoid writing the URL each time we need to use the find method.

Now that we have successfully located the page element, we can retrieve its child elements ns and title.

SourceResult

Python
# Accessing Wiki namespace
ns = page.find('ns', NAMESPACES)
print(ns)
print(ns.text) # '0'

# Accessing the title of the page
title = page.find('title', NAMESPACES)
print(title)
print(title.text) # schön

Python Console Session
<Element {http://www.mediawiki.org/xml/export-0.11/}ns at 0x7f9674a361c0>
0
<Element {http://www.mediawiki.org/xml/export-0.11/}title at 0x7f9674a36180>
schön

Finally, we want to retrieve the main content, or wikitext, which is stored in the text element.

Note that we cannot use page.find('text', NAMESPACES) directly because text is not a direct child of page; it is nested under revision.

SourceResult

Python
# Print the tree of page, to find the path 
print_tags_tree(page, only_tagnames=True)

Python Console Session
 0 page
      1 title
      1 ns
      1 id
      1 revision
           2 id
           2 parentid
           2 timestamp
           2 contributor
                3 username
                3 id
           2 comment
           2 origin
           2 model
           2 format
           2 text
           2 sha1

Fortunately, the find method allows us to specify the path to a nested tag. In this case, we specify the path revision/text from page to text:

SourceResult

Python
wikitext = page.find('revision/text', NAMESPACES)
print(wikitext)
# Let's print the first 300 characters of the wikitext
print(wikitext.text[:300])

Python Console Session
<Element {http://www.mediawiki.org/xml/export-0.11/}text at 0x7f9674a43680>
{{Siehe auch|[[schon]]}}
{{Wort der Woche|26|2007}}
== schön ({{Sprache|Deutsch}}) ==
=== {{Wortart|Adjektiv|Deutsch}} ===

{{Deutsch Adjektiv Übersicht
|Positiv=schön
|Komparativ=schöner
|Superlativ=schönsten
|Bild 1=Jaguar E-type (serie III).jpg|mini|1|ein ''schönes'' [[Auto]]
|Bild 2=12er Anitra 

We are done; we have just retrieved the wikitext content from the XML string.

Before moving on to the next section, let us quickly recap what we have learned by using functions.

SourceResult

Python
import requests
import lxml.etree as ET

def fetch(title):
    url = f'https://de.wiktionary.org/wiki/Spezial:Exportieren/{title}'
    resp = requests.get(url)
    resp.raise_for_status()
    return resp.text

def fetch_wikitext(title):
    xml_content = fetch(title)
    root = ET.fromstring(xml_content)
    namespaces  = root.nsmap
    page = root.find('page', namespaces)
    wikitext = page.find('revision/text', namespaces)
    return wikitext.text 

# let us try it 
print(fetch_wikitext('schön')[:5000])

Python Console Session
{{Siehe auch|[[schon]]}}
{{Wort der Woche|26|2007}}
== schön ({{Sprache|Deutsch}}) ==
=== {{Wortart|Adjektiv|Deutsch}} ===

{{Deutsch Adjektiv Übersicht
|Positiv=schön
|Komparativ=schöner
|Superlativ=schönsten
|Bild 1=Jaguar E-type (serie III).jpg|mini|1|ein ''schönes'' [[Auto]]
|Bild 2=12er Anitra US5 Kiel2009.jpg|mini|1|eine ''schöne'' [[Yacht]]
|Bild 3=Gedeon Burkhard Berlinale 2008.jpg|mini|1|ein ''schöner'' [[Mann]]
}}

{{Worttrennung}}
:schön, {{Komp.}} schö·ner, {{Sup.}} am schöns·ten

{{Aussprache}}
:{{IPA}} {{Lautschrift|ʃøːn}}
:{{Hörbeispiele}} {{Audio|De-schön.ogg}}, {{Audio|De-schön fcm.ogg}}, {{Audio|De-schön2.ogg}}, {{Audio|De-at-schön.ogg|spr=at}}
:{{Reime}} {{Reim|øːn|Deutsch}}

{{Bedeutungen}}
:[1] ästhetisch, eine angenehme Wirkung auf die Sinne habend: zum Beispiel ein gutes Aussehen habend, sich gut anhörend
:[2] {{K|allgemein}} angenehm, gut, anständig
:[3] {{K|umgangssprachlich}} verstärkend im Sinne von [[beträchtlich]]
:[4] {{K|umgangssprachlich}} zustimmende Antwort auf eine Frage
:[5] {{K|umgangssprachlich}} so, wie es sich gehört
:[6] in festen Wendungen mit verschwommener Bedeutung [1, 2]

{{Herkunft}}
:{{goh.}} ''scōni'' „ansehnlich, glänzend, rein, herrlich“, {{gmh.}} ''schœn, schœne'' auch „schonend, freundlich“, {{mlg.}} ''schön, schöne'', {{dum.}} ''scōne''<ref>{{Lit-Pfeifer: Etymologisches Wörterbuch|A=6}}, Seite 1236.</ref>

{{Synonyme}}
:[1] [[ansprechend]], [[anziehend]], [[ästhetisch]], [[attraktiv]], [[hübsch]], [[dekorativ]]
:[2] [[angenehm]], [[gut]]
:[3] [[besonders]], [[beträchtlich]], [[sehr]], [[überaus]]
:[4] [[einverstanden]], [[okay]]
:[5] [[ordnungsgemäß]]

{{Gegenwörter}}
:[1] [[hässlich]], [[unschön]]
:[2] [[schlecht]], [[unangenehm]]
:[3] [[nicht]]
:[4] [[nein]]
:[5] {{österr.|:}} [[schirch]]

{{Unterbegriffe}}
:[1] [[formschön]]
:[1, 2] [[wunderschön]]

{{Beispiele}}
:[1] Sie hat ''schönes'' Haar. Das Musikstück ist ''schön.''
:[1] Sie sang ''schön,'' ''schöner'' als gewöhnlich, weil die Instrumentalisten ihr so vertraut waren. Am ''schönsten'' sang sie, als Viktor am Klavier saß.
:[1] „Nik Wallenda, Urenkel eines deutschen Zirkusakrobaten, hat als erster Mensch die Niagarafälle an ihrer ''schönsten'' und gefährlichsten Stelle überquert.“<ref>{{Per-Zeit Online|Online=https://www.zeit.de/zustimmung?url=https%3A%2F%2Fwww.zeit.de%2Fgesellschaft%2Fzeitgeschehen%2F2012-06%2Fakrobat-niagarafaelle-balance |Autor=Zeit Online |Titel= Zu Fuß übers große Wasser |Tag=16 |Monat= 06|Jahr= 2012|zugriff=2020-04-03}}</ref>
:[1, 2] Sille ist ''schön.''
:[1, 2] „Die Herzen bebten über die Kühnheit des jungen, ''schönen'', wagemutigen Paares.“<ref>{{Literatur | Autor= Karl May| Titel= Winnetou IV | Verlag= Neues Leben | Ort= Berlin | Jahr= 1993 [1910] }}, Seite 429.</ref>
:[2] Das hat er aber ''schön'' gemacht. Wir hatten ''schöne'' Ferientage. Es wäre ''schön,'' wenn wir uns wieder treffen. Es war ''schön'' von ihm, seiner Frau Blumen zu schenken.
:[2] Das ist ja eine ''schöne'' Geschichte! Oder anders gesagt: Das ist aber wirklich schlimm!
:[2] Du bist mir ja ein ''schöner'' Freund! Oder anders gesagt: Du bist wahrlich ein schlechter Freund!
:[3] Da wird sie ganz ''schön'' staunen. Also, da wird sie aber überrascht sein.
:[3] Das wird eine ''schöne'' Stange Geld kosten. Also, das wird wohl ziemlich teuer werden.
:[4] Lass uns doch mal wieder im Kino einen Film ansehen! – ''Schön,'' dann komm!
:[5] So, jetzt gehen wir ''schön'' ins Bett.
:[5] ''Schön'' aufpassen, wenn du über die Straße gehst!

{{Redewendungen}}
:[[das schöne Geschlecht|das ''schöne'' Geschlecht]] – die Frauen in ihrer Gesamtheit
:[[die schönen Künste|die ''schönen'' Künste]] – Dichtung, Musik, Malerei, Bildhauerei
:[[eine schöne Leich]] – …
:[[jemandem schöne Augen machen|jemandem ''schöne'' Augen machen]]
:[[schöne Worte machen|''schöne'' Worte machen]] – schmeicheln
:[[das ist zu schön, um wahr zu sein|Das ist zu ''schön'', um wahr zu sein]] – …
:[[wie es so schön heißt|wie es so ''schön'' heißt]] – …
:[[immer schön der Reihe nach|immer ''schön'' der Reihe nach]] – …
:[[schön ist, was gefällt|''Schön'' ist, was gefällt]]. – …
:[[schön und gut|''schön'' und gut]] – Zustimmung zu einem Argument, gefolgt von „Aber …“

{{Sprichwörter}}
:[1] [[aus einem schönen Morgen wird selten ein schöner Tag, aus einem schönen Mädchen wird meistens ein Schlumpersack|Aus einem ''schönen'' Morgen wird selten ein ''schöner'' Tag, aus einem ''schönen'' Mädchen wird meistens ein Schlumpersack]].

{{Charakteristische Wortkombinationen}}
:[1] ''schöne'' [[Auge]]n, ''schöne'' [[Bescherung]], ''schöne'' [[Tag]]e, ''schönes'' [[Wetter]], [[traumhaft]] ''schön'', [[atemberaubend]] ''schön''
:[3] ganz ''schön'' [[frech]], [[bitte]] ''schön'', [[danke]] ''schön'', [[recht]] ''schönen'' [[Dank]], ''schöne'' [[Gruß|Grüße]] [[ausrichten]]

{{Wortbildungen}}
:[[beschönigen]], [[bildschön]], [[Bitteschön]], [[Dankeschön]], [[formschön]], [[schönen]], [[schönfärben]], [[Schöngeist]], [[Schönheit]], [[Schönling]], [[schönmachen]], [[schönred

From Special Export

Importing Packages¶

Parsing the XML content¶

Displaying XML Structure¶

XML Namespaces Overview¶

Extracting Data¶

element.find¶

`element.find`¶