Workshop
A Hands-On Workshop on Parsing Wikitext¶
PyCon Austria
6 and 7 April 2025 | Eisenstadt
In this workshop, you will learn how to fetch, parse, and extract data from the German Wiktionary to gather linguistic information such as parts of speech, inflections, example sentences, and definitions. You will also learn where to find the data and how to process it using HTTP requests, as well as XML and Wikitext parsers.
Fetching XML data
- Special Exports - go to Google Colab
- Dump files - go to guide
Parsing XML
- Parsing XML from Special Export - go to Google Colab
- Parsing XML from Dump file - go to Google Colab
Parsing Wikitext
- Wiki Basics - go to guide
- Parsing Wikitext - go to Google Colab
Using Google Colab¶
In this workshop, we will use Google Colab, a free, cloud-based platform for running Python code in a Jupyter Notebook environment.
Follow these steps to get started:
- Make sure you are signed in with your Google account.
- Ensure that you have an active Internet connection on your device.
- You can follow and run the code on a tablet, but if you'd like to edit on the go, I recommend using a laptop for a smoother experience.
- To access the workshop material, simply click the links provided in the guide table of contents.
Running on Your Own Machine¶
If you prefer not to use Google Colab but still want to participate in the workshop, you can download the source code and run it on your local machine.
To get started, make sure you install the necessary dependencies for the workshop:
requests
,lxml
andmwparserfromhell
Text Only | |
---|---|
1 |
|
Download Materials of the Workshop¶
You can download the zip file with all the materials of the workshop, including:
- Jupyter Notebooks
- Python files
- Sample data