Skip to content

fetch

de_wiktio.fetch

This module provides methods to fetch and parse XML files from the Wiktionary domain.

Classes:

  • WikiDump

    This class provides methods to parse and process the XML dump file. It also creates and loads dictionaries of title-wikitext pairs.

  • PageExport

    This class provides methods to fetch and parse the XML content of a Wiktionary page and to extract the wikitext using the export tool (Spezial:Exportieren).

Functions:

  • fetch_page_Action_API

    Fetch online and return the XML content of a Wiktionary page for the given title using the Action API.

  • print_tags_tree

    Print the tree structure of an XML element, with options for customization.

Classes

WikiDump(xml_path=None)

This class provides methods to parse and process the XML dump file. It also creates and loads dictionaries of title-wikitext pairs.

Parameters:

  • xml_path (str, default: None ) –

    Path to the XML dump file to be processed. If None, the path indicated in Settings will be used.

Methods:

  • pages_by_ns

    Retrieve pages matching the Wiki namespace ns.

  • create_dict_by_ns

    Create a dictionary with titles as keys and the corresponding wikitext as values and saves it to a pickle file.

  • load_wikidict_by_ns

    Load a dictionary with page titles as keys and their corresponding wikitext as values from a pickle file.

Attributes:

Source code in de_wiktio\fetch.py
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
def __init__(self, xml_path: str = None):
    """
    WikiDump object constructor.

    Args:
        xml_path: Path to the XML dump file to be processed. If `None`, the path indicated in Settings will be used. 
    """
    if xml_path is None:
        xml_path = WikiDump.settings.get('XML_FILE')
        if xml_path is None:
            raise ValueError("Path not provided. Please provide a valid path to the XML file or set a valid XML_FILE in Settings")

    if not Path(xml_path).exists():
        raise FileNotFoundError(f"File not found: {xml_path}. Please provide a valid path or set a valid XML_FILE in Settings") 

    self.xml_path= Path(xml_path)

    # Instance attributes docstring 
    self.xml_path: Path
    "Path to the XML dump file to be processed."
Attributes
settings: Settings = Settings() class-attribute instance-attribute

The Settings object.

xml_path: Path instance-attribute

Path to the XML dump file to be processed.

tree: ET._ElementTree property

The lxml tree object from the XML file.

Lazy evaluation. This is a time consuming operation, so it is only computed when needed.

root: ET.Element property

The root element of the tree.

Lazy evaluation. This is a time consuming operation, so it is only computed when needed.

namespaces: Dict[str, str] property

Dictionary of XML namespaces of the root element.

pages: List[ET.Element] property

List of all page elements from the XML file.

This includes all pages from all wiki namespaces.
Lazy evaluation. This is a time consuming operation, so it is only computed when needed.

Functions
pages_by_ns(ns)

Retrieve pages matching the Wiki namespace ns.

Parameters:

  • ns (str) –

    The Wiki namespace identifier to filter pages (e.g., '0' for content pages, '108' for Flexion pages).

Returns:

Source code in de_wiktio\fetch.py
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
def pages_by_ns(self, ns: str) -> List[ET.Element]:
    """
    Retrieve pages matching the Wiki namespace `ns`. 

    Args:
        ns: The Wiki namespace identifier to filter pages (e.g., '0' for content pages, '108' for Flexion pages).

    Returns:
        A list of page elements.
    """
    elements = list()
    for p in self.pages:
        element = p.find('ns', namespaces=self.namespaces)
        if element.text == ns: 
            elements.append(p)
    return elements
create_dict_by_ns(ns, dict_path=None)

Create a dictionary with titles as keys and the corresponding wikitext as values and saves it to a pickle file.

Parameters:

  • ns (str) –

    The Wiki namespace identifier to filter pages (e.g., '0' for content pages, '108' for Flexion pages)

  • dict_path (str, default: None ) –

    The path where the dictionary should be saved. If not provided, the dictionary will be saved as 'wikidict_{ns}.pkl' in the folder indicated in Settings.

Returns:

  • Dict[str, str]

    A dictionary with page titles as keys and their corresponding wikitext as values.

Source code in de_wiktio\fetch.py
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
def create_dict_by_ns(self, ns: str, dict_path: str = None) -> Dict[str, str]:
    """
    Create a dictionary with titles as keys and the corresponding *wikitext* as values and saves it to a pickle file.

    Args:
        ns: The Wiki namespace identifier to filter pages (e.g., `'0'` for content pages, `'108'` for Flexion pages)
        dict_path: The path where the dictionary should be saved. If not provided, the dictionary will be saved as 'wikidict_{ns}.pkl' in the folder indicated in Settings.

    Returns:
        A dictionary with page titles as keys and their corresponding *wikitext* as values.
    """
    if dict_path is None:
        dict_path = WikiDump.settings.get('DICT_PATH')
        if dict_path is None:
            raise ValueError("Path not provided. Please provide a valid path to the dictionary or set a valid DICT_PATH in Settings")

    dict_path = Path(dict_path)

    if not dict_path.exists():
        raise FileNotFoundError(f"Folder not found: {dict_path}. Please provide a valid path or set a valid DICT_PATH in Settings")

    pages = self.pages_by_ns(ns)
    dic = dict()
    for p in pages:
        title = p.find('title', namespaces=self.namespaces)
        wikitext = p.find('revision/text', namespaces=self.namespaces)
        dic[title.text] = wikitext.text

    dict_file = dict_path / f'wikidict_{ns}.pkl'

    with open(dict_file, 'wb') as f:
        pickle.dump(dic, f)
    return dic
load_wikidict_by_ns(file=None, ns='0') classmethod

Load a dictionary with page titles as keys and their corresponding wikitext as values from a pickle file.

Parameters:

  • file (str, default: None ) –

    The path to the pickle file. If None, the file 'wikidict_{ns}.pkl' in the folder indicated in Settings will be used.

  • ns (str, default: '0' ) –

    The wikinamespace identifier to filter pages (e.g., '0' for content pages, '108' for Flexion pages).

Returns:

  • Dict[str, str]

    A dictionary with page titles as keys and their corresponding wikitext as values.

Raises:

Source code in de_wiktio\fetch.py
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
@classmethod
def load_wikidict_by_ns(cls, file: str = None, ns: str = '0') -> Dict[str, str]:
    """
    Load a dictionary with page titles as keys and their corresponding *wikitext* as values from a pickle file.

    Args:
        file: The path to the pickle file. If `None`, the file 'wikidict_{ns}.pkl' in the folder indicated in Settings will be used.	
        ns: The wikinamespace identifier to filter pages (e.g., `'0'` for content pages, `'108'` for Flexion pages). 

    Returns:
        A dictionary with page titles as keys and their corresponding *wikitext* as values.

    Raises:
        FileNotFoundError: If the file does not exist.
    """
    if file is None:
        dict_path = cls.settings.get('DICT_PATH')
        # print(type(dict_path))
        if dict_path is None:
            raise ValueError("Path not provided. Please provide a valid path to the dictionary or set a valid DICT_PATH in Settings")
        else:
            file = Path(dict_path) / f'wikidict_{ns}.pkl'
    else:
        file = Path(file)

    if not file.exists():
        raise FileNotFoundError(f"The file {file} does not exist. Please create it first using the 'create_dict_by_ns' method.")

    with open(file, 'rb') as f:
        dic = pickle.load(f)
    return dic

PageExport(title)

This class provides methods to fetch and parse the XML content of a Wiktionary page and to extract the wikitext using the export tool (Spezial:Exportieren).

Parameters:

  • title (str) –

    The title of the Wiktionary page to fetch.

Raises:

  • RequestException

    If the request fails.

Methods:

  • fetch

    Fetch and return the XML content of a Wiktionary page using the export tool.

Attributes:

  • title (str) –

    The title of the Wiktionary page to fetch.

  • xml (bytes) –

    The XML content of the requested Wiktionary page.

  • root (Element) –

    The root element of the tree.

  • namespaces (Dict[str, str]) –

    Dictionary of XML namespaces of the root element.

  • page (Element) –

    The page element.

  • wikitext (str) –

    The wikitext of the page as a string.

  • ns (str) –

    The Wiki namespace of the page as a string.

Source code in de_wiktio\fetch.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
def __init__(self, title: str) -> None:
    """"
    Initialize the PageExport class.

    Args:
        title: The title of the Wiktionary page to fetch.

    Raises:
        requests.exceptions.RequestException: If the request fails.
    """
    self.title: str = title
    self.xml = self.fetch()
    self.root = ET.fromstring(self.xml)
    self.namespaces = self.root.nsmap 

    # Instance attributes docstring 
    self.title: str
    "The title of the Wiktionary page to fetch."
    self.xml: bytes
    "The XML content of the requested Wiktionary page."
    self.root: ET.Element
    "The root element of the tree."
    self.namespaces: Dict[str, str]
    "Dictionary of XML namespaces of the root element."
Attributes
title: str instance-attribute

The title of the Wiktionary page to fetch.

xml: bytes instance-attribute

The XML content of the requested Wiktionary page.

root: ET.Element instance-attribute

The root element of the tree.

namespaces: Dict[str, str] instance-attribute

Dictionary of XML namespaces of the root element.

page: ET.Element property

The page element.

wikitext: str property

The wikitext of the page as a string.

If not found, an empty string is returned.

ns: str property

The Wiki namespace of the page as a string.

If not found, an empty string is returned.

Functions
fetch()

Fetch and return the XML content of a Wiktionary page using the export tool.

The XML data is retrieved using the following URL:
https://de.wiktionary.org/wiki/Spezial:Exportieren/{self.title}

Returns:

  • bytes

    the response.content - The XML content of the requested Wiktionary page.

Raises:

  • RequestException

    If the request fails.

Source code in de_wiktio\fetch.py
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
def fetch(self) -> bytes:
    """
    Fetch and return the XML content of a Wiktionary page using the export tool.

    The XML data is retrieved using the following URL:
    `https://de.wiktionary.org/wiki/Spezial:Exportieren/{self.title}`


    Returns:
        the response.content - The XML content of the requested Wiktionary page.

    Raises:
        requests.exceptions.RequestException: If the request fails.
    """
    url = f'https://de.wiktionary.org/wiki/Spezial:Exportieren/{self.title}'
    resp = requests.get(url)
    resp.raise_for_status()
    return resp.content 

Functions

fetch_page_Action_API(title)

Fetch online and return the XML content of a Wiktionary page for the given title using the Action API.

The XML data is retrieved from base URL:
https://de.wiktionary.org/w/api.php

Parameters:

  • title (str) –

    The title of the Wiktionary page to fetch.

Returns:

  • bytes ( bytes ) –

    The XML content of the requested Wiktionary page.

Source code in de_wiktio\fetch.py
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
def fetch_page_Action_API(title:str)-> bytes:
    """Fetch online and return the XML content of a Wiktionary page for the given title using the Action API.

    The XML data is retrieved from base URL:
    `https://de.wiktionary.org/w/api.php`

    Args:
        title: The title of the Wiktionary page to fetch.

    Returns:
        bytes: The XML content of the requested Wiktionary page.
    """
    url = "https://de.wiktionary.org/w/api.php"

    params = {
        "titles": title,
        "action": "query",
        "export": 1,
        "exportnowrap": 1
    }

    resp = requests.get(url=url, params=params)
    return resp.content

print_tags_tree(elem, only_tagnames=False, print_attributes=False, print_text=False, max_children=5, max_level=5, _level=0)

Print the tree structure of an XML element, with options for customization.

Parameters:

  • elem (Element) –

    The XML Element object whose tree structure is to be printed.

  • only_tagnames (bool, default: False ) –

    If True, print only the tag name without the namespace.

  • print_attributes (bool, default: False ) –

    If True, print the attributes of each element.

  • print_text (bool, default: False ) –

    If True, print the text content of each element.

  • max_children (int, default: 5 ) –

    The maximum number of children to print for the root element.

  • max_level (int, default: 5 ) –

    The maximum depth of the tree to print.

  • _level (int, default: 0 ) –

    The current recursion level. This is used internally and should not be set by the user.

Source code in de_wiktio\fetch.py
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
def print_tags_tree(
                    elem: ET.Element,
                    only_tagnames: bool = False,
                    print_attributes: bool = False,
                    print_text: bool = False,
                    max_children: int = 5,
                    max_level: int = 5,
                    _level: int = 0
                    ) -> None:
    """Print the tree structure of an XML element, with options for customization.

    Args:
        elem: The XML `Element` object whose tree structure is to be printed.
        only_tagnames: If True, print only the tag name without the namespace.
        print_attributes: If True, print the attributes of each element.
        print_text: If True, print the text content of each element.
        max_children: The maximum number of children to print for the root element.
        max_level: The maximum depth of the tree to print.
        _level: The current recursion level. This is used internally and should not be set by the user.
    """
    tagname = ET.QName(elem).localname if only_tagnames else elem.tag
    print(" " * 5 * _level, _level, tagname)

    if print_attributes:
        for attr in elem.attrib:
            print(" " * 5 * (_level + 1), attr, "=", elem.attrib[attr])

    if print_text:
        if elem.text is not None and elem.text.strip():
            print(" " * 5 * (_level + 1), elem.text)

    # Restrict depth
    if _level + 1 <= max_level:
        for child_index, child in enumerate(elem):
            print_tags_tree(child,
                print_attributes=print_attributes,
                print_text=print_text,
                only_tagnames=only_tagnames,
                max_children=max_children,
                max_level=max_level,
                _level=_level + 1)
            # Limit number of children of the root element
            if _level == 0 and child_index == max_children - 1:
                break