From Dump file
As in the previous section, we begin by importing the lxml.etree
module:
Python | |
---|---|
1 |
|
Setting Up Paths¶
To follow along in this section:
- You will need to download and decompress the Wiktionary dump file.
- Once you have done that, specify the path to the decompressed file in
XML_FILE
. - By the end of this section, we will save our result as a dictionary and store it locally.
- Therefore, do not forget to specify in which folder the dictionary should be saved in
DICT_PATH
.
- Therefore, do not forget to specify in which folder the dictionary should be saved in
Python | |
---|---|
1 2 3 4 5 |
|
Parsing the XML File¶
Since we are working with a file, we cannot use the ET.fromstring
function to parse the XML content. Instead, we must use the ET.parse
function.
Note that this process can take some time. On my computer, it takes approximately 42 seconds to load the entire XML tree.
Python | |
---|---|
1 2 3 4 5 6 7 8 |
|
Python Console Session | |
---|---|
1 2 |
|
The parser returns an ElementTree
object. We use the getroot()
method to access the root Element
.
Displaying the XML Structure¶
The XML structure of the dump file is quite large, so printing the entire tree would not only be inefficient but also quite overwhelming. To make it more manageable, let us modify our print_tags_tree
function.
We will add options to limit the number of children displayed for the root element and to control the depth of the tree.
Here is our updated print_tags_tree
function:
Python | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 |
|
To display only the first 5 direct children of the root element and limit the tree to the first level:
Python | |
---|---|
1 |
|
Python Console Session | |
---|---|
1 2 3 4 5 6 |
|
To view the first 3 children of the root element and display two levels of the tree:
Python | |
---|---|
1 |
|
Python Console Session | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
Extracting Data¶
As with the previous section, we are interested in extracting the page
, title
, ns
, and text
tags.
The main difference in structure here is that we now have multiple page
elements, and we want to extract all of them.
We cannot use find
, because it will return only the first page
. However, we can use the findall
method instead, which will return a list of all page
elements.
Python | |
---|---|
1 2 3 |
|
Python Console Session | |
---|---|
1 |
|
Notice that the latest dump file version contains more than one million pages, and retrieving them all takes approximately 45 seconds.
Since retrieving all pages is time-consuming, we will store the relevant information locally in a dictionary and save it as a pickle file for quicker access in the future.
We will create a dictionary, dict_0
, using page titles as keys and their wikitext as values. Additionally, we will restrict the pages we store to those within the main Wiki namespace ('0'
). We will discuss Wiki namespaces further when we parse wikitext.
This process may take a couple of minutes!
Python | |
---|---|
1 2 3 4 5 6 7 8 |
|
To check that our dictionary is correctly populated, let us print out part of the wikitext for a sample page:
Python | |
---|---|
1 |
|
Python Console Session | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 |
|
Saving the Dictionary Locally¶
Once the dictionary is built, we save it locally using the pickle
module, which allows us to store the dictionary in a serialized format. This way, we will not need to parse the XML file again in the future.
Python | |
---|---|
1 2 3 4 5 6 |
|
Loading Dictionary¶
The next time you need to retrieve wikitext, simply load the dictionary from the pickle file and select the title page you need!
Python | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 |
|
Python Console Session | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 |
|
And we are done! Now we can retrieve any wikitext by the page title.
Next, we can cover how to parse wikitext.