From Special Export
We will use the fetch
function as described in our earlier tutorial on Special Export, provided here for reference.
The fetch()
function
Python | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
Let's fetch the XML content for the page titled 'schön'
as an example.
Python | |
---|---|
1 2 |
|
XML | |
---|---|
1 |
|
Parsing the XML content¶
Now that we have retrieved the XML content, we will use lxml.etree
to parse it.
In order to parse an XML string, which is what fetch
returns, we will use the fromstring
method. Later we will use the parse
method to parse an XML file.
Python | |
---|---|
1 2 3 4 5 6 |
|
Python Console Session | |
---|---|
1 |
|
ET.fromstring
returns an Element
object with several useful properties.
From an Element
object, you can extract:
- its tag
<tag_name> ... </tag_name>
, usingElement.tag
- its attributes
<tag_name attribut1="value1" attrib2="value2"> ...
, usingElement.attrib
- its text
<tag_name attribut1="value1"> some text </tag_name>
, usingElement.text
Let us create a dummy XML content to illustrate these:
Python | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
Python Console Session | |
---|---|
1 2 3 4 5 6 7 8 9 |
|
Displaying XML Structure¶
XML data can often be large and complex, especially when deeply nested, which makes understanding its structure difficult.
To help with this, let us create a helper function to display the XML tags in a tree-like format. Since we do not know how deep the XML structure might go, the best strategy here is to use recursion as follows:
Python | |
---|---|
1 2 3 4 5 6 |
|
Let us try it:
Python | |
---|---|
1 |
|
Python Console Session | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
|
The output of print_tags_tree
shows tags in a format that combines:
- The namespace URI (e.g.,
{http://www.mediawiki.org/xml/export-0.11/}
) - The tag name (e.g.,
mediawiki
,page
)
Although knowing the namespace is important, as we will discover later, it makes the tree look very cluttered.
To address this, let us modify the helper function to allow printing only the tag names in the tree.
We will use the function QName
, which splits the tag information of an element
into its tag name and its namespace. Here is an example code using QName
:
Python | |
---|---|
1 2 3 4 5 6 7 8 |
|
Python Console Session | |
---|---|
1 2 3 |
|
Now, let us modify the print_tags_tree
function to provide the option of printing only tag names when the only_tagnames
parameter is set to True
.
Python | |
---|---|
1 2 3 4 5 6 7 8 |
|
Python Console Session | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
|
Better! This gives us a clear view of the structure:
The root element is mediawiki
.
mediawiki
has two children:siteinfo
- Contains information about the domain (e.g., its sitename, Wiki namespace).
page
- Contains the most important information for this project.
page
has four direct children:title
,ns
,id
, andrevision
.title
contains the title of the page- For example,
schön
,Flexion:schön
.
- For example,
ns
contains the Wiki namespace, which should not be confused with XML namespaces!- For example,
0
,108
.
- For example,
id
is the unique identifier of the page- For example,
2930
,21734
.
- For example,
revision
contains a revision of the page.- Each time a wiki page is modified, a
revision
element is added. In the XML extraction methods covered here, only the latest revision is retrieved, so we have only onerevision
element. - The raw wikitext is located here in the child element
text
.
- Each time a wiki page is modified, a
The main goal of this section is to extract the page
, title
, ns
, and text
elements.
But first, let us briefly discuss XML namespaces. If you are already familiar with XML namespaces, feel free to skip this part.
XML Namespaces Overview¶
In XML, tag names and attributes are user-defined, which can lead to name conflicts when combining data from different XML files. To avoid these conflicts, XML uses a system of namespaces and prefixes. Each namespace is typically defined using a URI.
Namespaces are often declared in the root element of the XML (but not always, they can also be declared in children elements). To identify namespaces in an XML document, look for attributes beginning with xmlns
and/or xmlns:prefix
.
xmlns
without a prefix: This denotes the default namespace, applying to the element where it appears and all its descendants (unless overridden).xmlns:prefix
: This is a prefixed namespace. It applies only to elements that explicitly use the prefix.
For example, in our Wiki XML, we see two namespaces defined at the root element:
XML | |
---|---|
1 |
|
This gives us:
- Default Namespace:
xmlns="http://www.mediawiki.org/xml/export-0.11/"
, which applies to all elements without prefixes. - Prefixed Namespace:
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
with the prefixxsi
.
An even simpler approach is to use the nsmap
method, which provides a dictionary mapping prefixes to their respective URIs.
Python | |
---|---|
1 2 3 |
|
Python Console Session | |
---|---|
1 2 |
|
Extracting Data¶
The find
Method¶
To extract elements from the XML, we will use the find
method, which searches for the first child element with a specified tag name or path.
Note that the following code will fail to find the page
element and will return None
. This is because lxml
requires the correct namespace to be specified if the XML we are working with has declared any namespaces.
Python | |
---|---|
1 2 3 |
|
Python Console Session | |
---|---|
1 |
|
You can specify the namespace in two ways:
- Using the full
{namespace}tagname
notation - Passing a namespace dictionary as an argument to
find
The following code will successfully retrieve the page
element using each method:
Python | |
---|---|
1 2 3 4 5 6 7 8 |
|
Python Console Session | |
---|---|
1 2 |
|
We will stick to the namespace dictionary to avoid writing the URL each time we need to use the find
method.
Now that we have successfully located the page
element, we can retrieve its child elements ns
and title
.
Python | |
---|---|
1 2 3 4 5 6 7 8 9 |
|
Python Console Session | |
---|---|
1 2 3 4 |
|
Finally, we want to retrieve the main content, or wikitext, which is stored in the text
element.
Note that we cannot use page.find('text', NAMESPACES)
directly because text
is not a direct child of page
; it is nested under revision
.
Python | |
---|---|
1 2 |
|
Python Console Session | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
Fortunately, the find
method allows us to specify the path to a nested tag. In this case, we specify the path revision/text
from page
to text
:
Python | |
---|---|
1 2 3 4 |
|
Python Console Session | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 |
|
We are done; we have just retrieved the wikitext content from the XML string.