fetch

`de_wiktio.fetch` ¶

This module provides methods to fetch and parse XML files from the Wiktionary domain.

Classes:

WikiDump –

This class provides methods to parse and process the XML dump file. It also creates and loads dictionaries of title-wikitext pairs.
PageExport –

This class provides methods to fetch and parse the XML content of a Wiktionary page and to extract the wikitext using the export tool (Spezial:Exportieren).

Functions:

fetch_page_Action_API –

Fetch online and return the XML content of a Wiktionary page for the given title using the Action API.
print_tags_tree –

Print the tree structure of an XML element, with options for customization.

Classes¶

`WikiDump(xml_path=None)` ¶

This class provides methods to parse and process the XML dump file. It also creates and loads dictionaries of title-wikitext pairs.

Parameters:

xml_path (str, default: None ) –

Path to the XML dump file to be processed. If None, the path indicated in Settings will be used.

Methods:

pages_by_ns –

Retrieve pages matching the Wiki namespace ns.
create_dict_by_ns –

Create a dictionary with titles as keys and the corresponding wikitext as values and saves it to a pickle file.
load_wikidict_by_ns –

Load a dictionary with page titles as keys and their corresponding wikitext as values from a pickle file.

Attributes:

settings (Settings) –

The Settings object.
xml_path (Path) –

Path to the XML dump file to be processed.
tree (_ElementTree) –

The lxml tree object from the XML file.
root (Element) –

The root element of the tree.
namespaces (Dict[str, str]) –

Dictionary of XML namespaces of the root element.
pages (List[Element]) –

List of all page elements from the XML file.

Source code in de_wiktio\fetch.py

def __init__(self, xml_path: str = None):
    """
    WikiDump object constructor.

    Args:
        xml_path: Path to the XML dump file to be processed. If `None`, the path indicated in Settings will be used. 
    """
    if xml_path is None:
        xml_path = WikiDump.settings.get('XML_FILE')
        if xml_path is None:
            raise ValueError("Path not provided. Please provide a valid path to the XML file or set a valid XML_FILE in Settings")

    if not Path(xml_path).exists():
        raise FileNotFoundError(f"File not found: {xml_path}. Please provide a valid path or set a valid XML_FILE in Settings") 

    self.xml_path= Path(xml_path)

    # Instance attributes docstring 
    self.xml_path: Path
    "Path to the XML dump file to be processed."

Attributes¶

`settings: Settings = Settings()` `class-attribute` `instance-attribute` ¶

The Settings object.

`xml_path: Path` `instance-attribute` ¶

Path to the XML dump file to be processed.

`tree: ET._ElementTree` `property` ¶

The lxml tree object from the XML file.

Lazy evaluation. This is a time consuming operation, so it is only computed when needed.

`root: ET.Element` `property` ¶

The root element of the tree.

Lazy evaluation. This is a time consuming operation, so it is only computed when needed.

`namespaces: Dict[str, str]` `property` ¶

Dictionary of XML namespaces of the root element.

`pages: List[ET.Element]` `property` ¶

List of all page elements from the XML file.

This includes all pages from all wiki namespaces.
Lazy evaluation. This is a time consuming operation, so it is only computed when needed.

Functions¶

`pages_by_ns(ns)` ¶

Retrieve pages matching the Wiki namespace ns.

Parameters:

ns (str) –

The Wiki namespace identifier to filter pages (e.g., '0' for content pages, '108' for Flexion pages).

Returns:

List[Element] –

A list of page elements.

Source code in de_wiktio\fetch.py

def pages_by_ns(self, ns: str) -> List[ET.Element]:
    """
    Retrieve pages matching the Wiki namespace `ns`. 

    Args:
        ns: The Wiki namespace identifier to filter pages (e.g., '0' for content pages, '108' for Flexion pages).

    Returns:
        A list of page elements.
    """
    elements = list()
    for p in self.pages:
        element = p.find('ns', namespaces=self.namespaces)
        if element.text == ns: 
            elements.append(p)
    return elements

`create_dict_by_ns(ns, dict_path=None)` ¶

Create a dictionary with titles as keys and the corresponding wikitext as values and saves it to a pickle file.

Parameters:

ns (str) –

The Wiki namespace identifier to filter pages (e.g., '0' for content pages, '108' for Flexion pages)
dict_path (str, default: None ) –

The path where the dictionary should be saved. If not provided, the dictionary will be saved as 'wikidict_{ns}.pkl' in the folder indicated in Settings.

Returns:

Dict[str, str] –

A dictionary with page titles as keys and their corresponding wikitext as values.

Source code in de_wiktio\fetch.py

def create_dict_by_ns(self, ns: str, dict_path: str = None) -> Dict[str, str]:
    """
    Create a dictionary with titles as keys and the corresponding *wikitext* as values and saves it to a pickle file.

    Args:
        ns: The Wiki namespace identifier to filter pages (e.g., `'0'` for content pages, `'108'` for Flexion pages)
        dict_path: The path where the dictionary should be saved. If not provided, the dictionary will be saved as 'wikidict_{ns}.pkl' in the folder indicated in Settings.

    Returns:
        A dictionary with page titles as keys and their corresponding *wikitext* as values.
    """
    if dict_path is None:
        dict_path = WikiDump.settings.get('DICT_PATH')
        if dict_path is None:
            raise ValueError("Path not provided. Please provide a valid path to the dictionary or set a valid DICT_PATH in Settings")

    dict_path = Path(dict_path)

    if not dict_path.exists():
        raise FileNotFoundError(f"Folder not found: {dict_path}. Please provide a valid path or set a valid DICT_PATH in Settings")

    pages = self.pages_by_ns(ns)
    dic = dict()
    for p in pages:
        title = p.find('title', namespaces=self.namespaces)
        wikitext = p.find('revision/text', namespaces=self.namespaces)
        dic[title.text] = wikitext.text

    dict_file = dict_path / f'wikidict_{ns}.pkl'

    with open(dict_file, 'wb') as f:
        pickle.dump(dic, f)
    return dic

`load_wikidict_by_ns(file=None, ns='0')` `classmethod` ¶

Load a dictionary with page titles as keys and their corresponding wikitext as values from a pickle file.

Parameters:

file (str, default: None ) –

The path to the pickle file. If None, the file 'wikidict_{ns}.pkl' in the folder indicated in Settings will be used.
ns (str, default: '0' ) –

The wikinamespace identifier to filter pages (e.g., '0' for content pages, '108' for Flexion pages).

Returns:

Dict[str, str] –

A dictionary with page titles as keys and their corresponding wikitext as values.

Raises:

FileNotFoundError –

If the file does not exist.

Source code in de_wiktio\fetch.py

@classmethod
def load_wikidict_by_ns(cls, file: str = None, ns: str = '0') -> Dict[str, str]:
    """
    Load a dictionary with page titles as keys and their corresponding *wikitext* as values from a pickle file.

    Args:
        file: The path to the pickle file. If `None`, the file 'wikidict_{ns}.pkl' in the folder indicated in Settings will be used.	
        ns: The wikinamespace identifier to filter pages (e.g., `'0'` for content pages, `'108'` for Flexion pages). 

    Returns:
        A dictionary with page titles as keys and their corresponding *wikitext* as values.

    Raises:
        FileNotFoundError: If the file does not exist.
    """
    if file is None:
        dict_path = cls.settings.get('DICT_PATH')
        # print(type(dict_path))
        if dict_path is None:
            raise ValueError("Path not provided. Please provide a valid path to the dictionary or set a valid DICT_PATH in Settings")
        else:
            file = Path(dict_path) / f'wikidict_{ns}.pkl'
    else:
        file = Path(file)

    if not file.exists():
        raise FileNotFoundError(f"The file {file} does not exist. Please create it first using the 'create_dict_by_ns' method.")

    with open(file, 'rb') as f:
        dic = pickle.load(f)
    return dic

`PageExport(title)` ¶

This class provides methods to fetch and parse the XML content of a Wiktionary page and to extract the wikitext using the export tool (Spezial:Exportieren).

Parameters:

title (str) –

The title of the Wiktionary page to fetch.

Raises:

RequestException –

If the request fails.

Methods:

fetch –

Fetch and return the XML content of a Wiktionary page using the export tool.

Attributes:

title (str) –

The title of the Wiktionary page to fetch.
xml (bytes) –

The XML content of the requested Wiktionary page.
root (Element) –

The root element of the tree.
namespaces (Dict[str, str]) –

Dictionary of XML namespaces of the root element.
page (Element) –

The page element.
wikitext (str) –

The wikitext of the page as a string.
ns (str) –

The Wiki namespace of the page as a string.

Source code in de_wiktio\fetch.py

def __init__(self, title: str) -> None:
    """"
    Initialize the PageExport class.

    Args:
        title: The title of the Wiktionary page to fetch.

    Raises:
        requests.exceptions.RequestException: If the request fails.
    """
    self.title: str = title
    self.xml = self.fetch()
    self.root = ET.fromstring(self.xml)
    self.namespaces = self.root.nsmap 

    # Instance attributes docstring 
    self.title: str
    "The title of the Wiktionary page to fetch."
    self.xml: bytes
    "The XML content of the requested Wiktionary page."
    self.root: ET.Element
    "The root element of the tree."
    self.namespaces: Dict[str, str]
    "Dictionary of XML namespaces of the root element."

Attributes¶

`title: str` `instance-attribute` ¶

The title of the Wiktionary page to fetch.

`xml: bytes` `instance-attribute` ¶

The XML content of the requested Wiktionary page.

`root: ET.Element` `instance-attribute` ¶

The root element of the tree.

`namespaces: Dict[str, str]` `instance-attribute` ¶

Dictionary of XML namespaces of the root element.

`page: ET.Element` `property` ¶

The page element.

`wikitext: str` `property` ¶

The wikitext of the page as a string.

If not found, an empty string is returned.

`ns: str` `property` ¶

The Wiki namespace of the page as a string.

If not found, an empty string is returned.

Functions¶

`fetch()` ¶

Fetch and return the XML content of a Wiktionary page using the export tool.

The XML data is retrieved using the following URL:
https://de.wiktionary.org/wiki/Spezial:Exportieren/{self.title}

Returns:

bytes –

the response.content - The XML content of the requested Wiktionary page.

Raises:

RequestException –

If the request fails.

Source code in de_wiktio\fetch.py

def fetch(self) -> bytes:
    """
    Fetch and return the XML content of a Wiktionary page using the export tool.

    The XML data is retrieved using the following URL:
    `https://de.wiktionary.org/wiki/Spezial:Exportieren/{self.title}`


    Returns:
        the response.content - The XML content of the requested Wiktionary page.

    Raises:
        requests.exceptions.RequestException: If the request fails.
    """
    url = f'https://de.wiktionary.org/wiki/Spezial:Exportieren/{self.title}'
    resp = requests.get(url)
    resp.raise_for_status()
    return resp.content 

Functions¶

`fetch_page_Action_API(title)` ¶

Fetch online and return the XML content of a Wiktionary page for the given title using the Action API.

The XML data is retrieved from base URL:
https://de.wiktionary.org/w/api.php

Parameters:

title (str) –

The title of the Wiktionary page to fetch.

Returns:

bytes ( bytes ) –

The XML content of the requested Wiktionary page.

Source code in de_wiktio\fetch.py

def fetch_page_Action_API(title:str)-> bytes:
    """Fetch online and return the XML content of a Wiktionary page for the given title using the Action API.

    The XML data is retrieved from base URL:
    `https://de.wiktionary.org/w/api.php`

    Args:
        title: The title of the Wiktionary page to fetch.

    Returns:
        bytes: The XML content of the requested Wiktionary page.
    """
    url = "https://de.wiktionary.org/w/api.php"

    params = {
        "titles": title,
        "action": "query",
        "export": 1,
        "exportnowrap": 1
    }

    resp = requests.get(url=url, params=params)
    return resp.content

`print_tags_tree(elem, only_tagnames=False, print_attributes=False, print_text=False, max_children=5, max_level=5, _level=0)` ¶

Print the tree structure of an XML element, with options for customization.

Parameters:

elem (Element) –

The XML Element object whose tree structure is to be printed.
only_tagnames (bool, default: False ) –

If True, print only the tag name without the namespace.
print_attributes (bool, default: False ) –

If True, print the attributes of each element.
print_text (bool, default: False ) –

If True, print the text content of each element.
max_children (int, default: 5 ) –

The maximum number of children to print for the root element.
max_level (int, default: 5 ) –

The maximum depth of the tree to print.
_level (int, default: 0 ) –

The current recursion level. This is used internally and should not be set by the user.

Source code in de_wiktio\fetch.py

def print_tags_tree(
                    elem: ET.Element,
                    only_tagnames: bool = False,
                    print_attributes: bool = False,
                    print_text: bool = False,
                    max_children: int = 5,
                    max_level: int = 5,
                    _level: int = 0
                    ) -> None:
    """Print the tree structure of an XML element, with options for customization.

    Args:
        elem: The XML `Element` object whose tree structure is to be printed.
        only_tagnames: If True, print only the tag name without the namespace.
        print_attributes: If True, print the attributes of each element.
        print_text: If True, print the text content of each element.
        max_children: The maximum number of children to print for the root element.
        max_level: The maximum depth of the tree to print.
        _level: The current recursion level. This is used internally and should not be set by the user.
    """
    tagname = ET.QName(elem).localname if only_tagnames else elem.tag
    print(" " * 5 * _level, _level, tagname)

    if print_attributes:
        for attr in elem.attrib:
            print(" " * 5 * (_level + 1), attr, "=", elem.attrib[attr])

    if print_text:
        if elem.text is not None and elem.text.strip():
            print(" " * 5 * (_level + 1), elem.text)

    # Restrict depth
    if _level + 1 <= max_level:
        for child_index, child in enumerate(elem):
            print_tags_tree(child,
                print_attributes=print_attributes,
                print_text=print_text,
                only_tagnames=only_tagnames,
                max_children=max_children,
                max_level=max_level,
                _level=_level + 1)
            # Limit number of children of the root element
            if _level == 0 and child_index == max_children - 1:
                break

fetch

de_wiktio.fetch ¶

Classes¶

WikiDump(xml_path=None) ¶

Attributes¶

settings: Settings = Settings() class-attribute instance-attribute ¶

xml_path: Path instance-attribute ¶

tree: ET._ElementTree property ¶

root: ET.Element property ¶

namespaces: Dict[str, str] property ¶

pages: List[ET.Element] property ¶

Functions¶

pages_by_ns(ns) ¶

create_dict_by_ns(ns, dict_path=None) ¶

load_wikidict_by_ns(file=None, ns='0') classmethod ¶

PageExport(title) ¶

Attributes¶

title: str instance-attribute ¶

xml: bytes instance-attribute ¶

root: ET.Element instance-attribute ¶

namespaces: Dict[str, str] instance-attribute ¶

page: ET.Element property ¶

wikitext: str property ¶

ns: str property ¶

Functions¶

fetch() ¶

Functions¶

fetch_page_Action_API(title) ¶

print_tags_tree(elem, only_tagnames=False, print_attributes=False, print_text=False, max_children=5, max_level=5, _level=0) ¶

`de_wiktio.fetch` ¶

`WikiDump(xml_path=None)` ¶

`settings: Settings = Settings()` `class-attribute` `instance-attribute` ¶

`xml_path: Path` `instance-attribute` ¶

`tree: ET._ElementTree` `property` ¶

`root: ET.Element` `property` ¶

`namespaces: Dict[str, str]` `property` ¶

`pages: List[ET.Element]` `property` ¶

`pages_by_ns(ns)` ¶

`create_dict_by_ns(ns, dict_path=None)` ¶

`load_wikidict_by_ns(file=None, ns='0')` `classmethod` ¶

`PageExport(title)` ¶

`title: str` `instance-attribute` ¶

`xml: bytes` `instance-attribute` ¶

`root: ET.Element` `instance-attribute` ¶

`namespaces: Dict[str, str]` `instance-attribute` ¶

`page: ET.Element` `property` ¶

`wikitext: str` `property` ¶

`ns: str` `property` ¶

`fetch()` ¶

`fetch_page_Action_API(title)` ¶

`print_tags_tree(elem, only_tagnames=False, print_attributes=False, print_text=False, max_children=5, max_level=5, _level=0)` ¶