Skip to content

From Special Export

We will use the fetch function as described in our earlier tutorial on Special Export, provided here for reference.

The fetch() function
Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
import requests

def fetch(title):
    # Construct the URL for the XML export of the given page title
    url = f'https://de.wiktionary.org/wiki/Spezial:Exportieren/{title}'

    # Send a GET request
    resp = requests.get(url)

    # Check if the request was successful, and raise an error if not
    resp.raise_for_status()

    # Return the XML content as a string (bytes) 
    return resp.content

Let's fetch the XML content for the page titled 'schön' as an example.

Python
1
2
xml_content = fetch('schön')
print(xml_content[:500])
XML
1
b'<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.11/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.11/ http://www.mediawiki.org/xml/export-0.11.xsd" version="0.11" xml:lang="de">\n  <siteinfo>\n    <sitename>Wiktionary</sitename>\n    <dbname>dewiktionary</dbname>\n    <base>https://de.wiktionary.org/wiki/Wiktionary:Hauptseite</base>\n    <generator>MediaWiki 1.44.0-wmf.4</generator>\n    <case>case-sensitive</case>\n    <namespa'

Parsing the XML content

Now that we have retrieved the XML content, we will use lxml.etree to parse it.

In order to parse an XML string, which is what fetch returns, we will use the fromstring method. Later we will use the parse method to parse an XML file.

Python
1
2
3
4
5
6
import lxml.etree as ET

# Parse the XML content into an ET Element
root = ET.fromstring(xml_content)

print(type(root)) # Output: <class 'lxml.etree._Element'>
Python Console Session
1
<class 'lxml.etree._Element'>

ET.fromstring returns an Element object with several useful properties.
From an Element object, you can extract:

  • its tag <tag_name> ... </tag_name>, using Element.tag
  • its attributes <tag_name attribut1="value1" attrib2="value2"> ..., using Element.attrib
  • its text <tag_name attribut1="value1"> some text </tag_name>, using Element.text

Let us create a dummy XML content to illustrate these:

Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
xml = """
<tag_name attribut1="value1" attrib2="value2"> some text </tag_name>
"""

element = ET.fromstring(xml)

print('tag'.center(20, '*'))
print(f'{element.tag = }')
print(f'{type(element.tag) = }')

print('attrib'.center(20, '*'))
print(f'{element.attrib = }')
print(f'{type(element.attrib) = }')

print('text'.center(20, '*'))
print(f'{element.text = }')
print(f'{type(element.text) = }')
Python Console Session
1
2
3
4
5
6
7
8
9
********tag*********
element.tag = 'tag_name'
type(element.tag) = <class 'str'>
*******attrib*******
element.attrib = {'attribut1': 'value1', 'attrib2': 'value2'}
type(element.attrib) = <class 'lxml.etree._Attrib'>
********text********
element.text = ' some text '
type(element.text) = <class 'str'>

Displaying XML Structure

XML data can often be large and complex, especially when deeply nested, which makes understanding its structure difficult.

To help with this, let us create a helper function to display the XML tags in a tree-like format. Since we do not know how deep the XML structure might go, the best strategy here is to use recursion as follows:

Python
1
2
3
4
5
6
def print_tags_tree(elem, level=0):
    # print indent, level and tag of the element
    print(' ' * 5 * level, level, elem.tag)
    for child in elem:
        # recursion to go as deep as possible
        print_tags_tree(child, level + 1)

Let us try it:

Python
1
print_tags_tree(root)
Python Console Session
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
 0 {http://www.mediawiki.org/xml/export-0.11/}mediawiki
      1 {http://www.mediawiki.org/xml/export-0.11/}siteinfo
           2 {http://www.mediawiki.org/xml/export-0.11/}sitename
           2 {http://www.mediawiki.org/xml/export-0.11/}dbname
           2 {http://www.mediawiki.org/xml/export-0.11/}base
           2 {http://www.mediawiki.org/xml/export-0.11/}generator
           2 {http://www.mediawiki.org/xml/export-0.11/}case
           2 {http://www.mediawiki.org/xml/export-0.11/}namespaces
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
                3 {http://www.mediawiki.org/xml/export-0.11/}namespace
      1 {http://www.mediawiki.org/xml/export-0.11/}page
           2 {http://www.mediawiki.org/xml/export-0.11/}title
           2 {http://www.mediawiki.org/xml/export-0.11/}ns
           2 {http://www.mediawiki.org/xml/export-0.11/}id
           2 {http://www.mediawiki.org/xml/export-0.11/}revision
                3 {http://www.mediawiki.org/xml/export-0.11/}id
                3 {http://www.mediawiki.org/xml/export-0.11/}parentid
                3 {http://www.mediawiki.org/xml/export-0.11/}timestamp
                3 {http://www.mediawiki.org/xml/export-0.11/}contributor
                     4 {http://www.mediawiki.org/xml/export-0.11/}username
                     4 {http://www.mediawiki.org/xml/export-0.11/}id
                3 {http://www.mediawiki.org/xml/export-0.11/}comment
                3 {http://www.mediawiki.org/xml/export-0.11/}origin
                3 {http://www.mediawiki.org/xml/export-0.11/}model
                3 {http://www.mediawiki.org/xml/export-0.11/}format
                3 {http://www.mediawiki.org/xml/export-0.11/}text
                3 {http://www.mediawiki.org/xml/export-0.11/}sha1

The output of print_tags_tree shows tags in a format that combines:

  • The namespace URI (e.g., {http://www.mediawiki.org/xml/export-0.11/})
  • The tag name (e.g., mediawiki, page)

Although knowing the namespace is important, as we will discover later, it makes the tree look very cluttered.

To address this, let us modify the helper function to allow printing only the tag names in the tree.

We will use the function QName, which splits the tag information of an element into its tag name and its namespace. Here is an example code using QName:

Python
1
2
3
4
5
6
7
8
print(f'{root.tag=}')

# Using the ET function QName
root_name = ET.QName(root)
# only tag name
print(f'{root_name.localname=}')
# only namespace
print(f'{root_name.namespace=}')
Python Console Session
1
2
3
root.tag='{http://www.mediawiki.org/xml/export-0.11/}mediawiki'
root_name.localname='mediawiki'
root_name.namespace='http://www.mediawiki.org/xml/export-0.11/'

Now, let us modify the print_tags_tree function to provide the option of printing only tag names when the only_tagnames parameter is set to True.

Python
1
2
3
4
5
6
7
8
def print_tags_tree(elem, level=0, only_tagnames=False):
    tagname = ET.QName(elem).localname if only_tagnames else elem.tag

    print(' ' * 5 * level, level, tagname)
    for child in elem:
        print_tags_tree(child, level + 1, only_tagnames)

print_tags_tree(root, only_tagnames=True)
Python Console Session
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
 0 mediawiki
      1 siteinfo
           2 sitename
           2 dbname
           2 base
           2 generator
           2 case
           2 namespaces
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
                3 namespace
      1 page
           2 title
           2 ns
           2 id
           2 revision
                3 id
                3 parentid
                3 timestamp
                3 contributor
                     4 username
                     4 id
                3 comment
                3 origin
                3 model
                3 format
                3 text
                3 sha1

Better! This gives us a clear view of the structure:

The root element is mediawiki.

  • mediawiki has two children:
    • siteinfo
      • Contains information about the domain (e.g., its sitename, Wiki namespace).
    • page
      • Contains the most important information for this project.
      • page has four direct children: title, ns, id, and revision.
        • title contains the title of the page
          • For example, schön, Flexion:schön.
        • ns contains the Wiki namespace, which should not be confused with XML namespaces!
          • For example, 0, 108.
        • id is the unique identifier of the page
          • For example, 2930, 21734.
        • revision contains a revision of the page.
          • Each time a wiki page is modified, a revision element is added. In the XML extraction methods covered here, only the latest revision is retrieved, so we have only one revision element.
          • The raw wikitext is located here in the child element text.

The main goal of this section is to extract the page, title, ns, and text elements.

But first, let us briefly discuss XML namespaces. If you are already familiar with XML namespaces, feel free to skip this part.

XML Namespaces Overview

In XML, tag names and attributes are user-defined, which can lead to name conflicts when combining data from different XML files. To avoid these conflicts, XML uses a system of namespaces and prefixes. Each namespace is typically defined using a URI.

Namespaces are often declared in the root element of the XML (but not always, they can also be declared in children elements). To identify namespaces in an XML document, look for attributes beginning with xmlns and/or xmlns:prefix.

  • xmlns without a prefix: This denotes the default namespace, applying to the element where it appears and all its descendants (unless overridden).
  • xmlns:prefix: This is a prefixed namespace. It applies only to elements that explicitly use the prefix.

For example, in our Wiki XML, we see two namespaces defined at the root element:

XML
1
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.11/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 

This gives us:

  • Default Namespace: xmlns="http://www.mediawiki.org/xml/export-0.11/", which applies to all elements without prefixes.
  • Prefixed Namespace: xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" with the prefix xsi.

An even simpler approach is to use the nsmap method, which provides a dictionary mapping prefixes to their respective URIs.

Python
1
2
3
NAMESPACES = root.nsmap
for key, namespace in NAMESPACES.items():
    print('key:', key,'=> namespace:', namespace)
Python Console Session
1
2
key: None => namespace: http://www.mediawiki.org/xml/export-0.11/
key: xsi => namespace: http://www.w3.org/2001/XMLSchema-instance

Extracting Data

The find Method

To extract elements from the XML, we will use the find method, which searches for the first child element with a specified tag name or path.

Note that the following code will fail to find the page element and will return None. This is because lxml requires the correct namespace to be specified if the XML we are working with has declared any namespaces.

Python
1
2
3
# This will not work, because it lacks the required namespace
page = root.find('page')
print(page) # None
Python Console Session
1
None

You can specify the namespace in two ways:

  • Using the full {namespace}tagname notation
  • Passing a namespace dictionary as an argument to find

The following code will successfully retrieve the page element using each method:

Python
1
2
3
4
5
6
7
8
# Full notation 
page = root.find('{http://www.mediawiki.org/xml/export-0.11/}page')
print('Full notation:',page) 

# Namespace dictionary
NAMESPACES = {None: 'http://www.mediawiki.org/xml/export-0.11/'}
page = root.find('page', NAMESPACES)
print('Namespace dictionary:', page)
Python Console Session
1
2
Full notation: <Element {http://www.mediawiki.org/xml/export-0.11/}page at 0x1a68d0eb440>
Namespace dictionary: <Element {http://www.mediawiki.org/xml/export-0.11/}page at 0x1a68d0eb440>

We will stick to the namespace dictionary to avoid writing the URL each time we need to use the find method.

Now that we have successfully located the page element, we can retrieve its child elements ns and title.

Python
1
2
3
4
5
6
7
8
9
# Accessing Wiki namespace
ns = page.find('ns', NAMESPACES)
print(ns)
print(ns.text) # '0'

# Accessing the title of the page
title = page.find('title', NAMESPACES)
print(title)
print(title.text) # schön
Python Console Session
1
2
3
4
<Element {http://www.mediawiki.org/xml/export-0.11/}ns at 0x1a68d990740>
0
<Element {http://www.mediawiki.org/xml/export-0.11/}title at 0x1a68d992980>
schön

Finally, we want to retrieve the main content, or wikitext, which is stored in the text element.

Note that we cannot use page.find('text', NAMESPACES) directly because text is not a direct child of page; it is nested under revision.

Python
1
2
# Print the tree of page, to find the path 
print_tags_tree(page, only_tagnames=True)
Python Console Session
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
 0 page
      1 title
      1 ns
      1 id
      1 revision
           2 id
           2 parentid
           2 timestamp
           2 contributor
                3 username
                3 id
           2 comment
           2 origin
           2 model
           2 format
           2 text
           2 sha1

Fortunately, the find method allows us to specify the path to a nested tag. In this case, we specify the path revision/text from page to text:

Python
1
2
3
4
wikitext = page.find('revision/text', NAMESPACES)
print(wikitext)
# Let's print the first 300 characters of the wikitext
print(wikitext.text[:300])
Python Console Session
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
<Element {http://www.mediawiki.org/xml/export-0.11/}text at 0x1a68d973c00>
{{Siehe auch|[[schon]]}}
{{Wort der Woche|26|2007}}
== schön ({{Sprache|Deutsch}}) ==
=== {{Wortart|Adjektiv|Deutsch}} ===

{{Deutsch Adjektiv Übersicht
|Positiv=schön
|Komparativ=schöner
|Superlativ=schönsten
|Bild 1=Jaguar E-type (serie III).jpg|mini|1|ein ''schönes'' [[Auto]]
|Bild 2=12er Anitra 

We are done; we have just retrieved the wikitext content from the XML string.