fetch
de_wiktio.fetch
¶
This module provides methods to fetch and parse XML files from the Wiktionary domain.
Classes:
-
WikiDump
–This class provides methods to parse and process the XML dump file. It also creates and loads dictionaries of title-wikitext pairs.
-
PageExport
–This class provides methods to fetch and parse the XML content of a Wiktionary page and to extract the wikitext using the export tool (Spezial:Exportieren).
Functions:
-
fetch_page_Action_API
–Fetch online and return the XML content of a Wiktionary page for the given title using the Action API.
-
print_tags_tree
–Print the tree structure of an XML element, with options for customization.
Classes¶
WikiDump(xml_path=None)
¶
This class provides methods to parse and process the XML dump file. It also creates and loads dictionaries of title-wikitext pairs.
Parameters:
-
xml_path
(str
, default:None
) –Path to the XML dump file to be processed. If
None
, the path indicated in Settings will be used.
Methods:
-
pages_by_ns
–Retrieve pages matching the Wiki namespace
ns
. -
create_dict_by_ns
–Create a dictionary with titles as keys and the corresponding wikitext as values and saves it to a pickle file.
-
load_wikidict_by_ns
–Load a dictionary with page titles as keys and their corresponding wikitext as values from a pickle file.
Attributes:
-
settings
(Settings
) –The
Settings
object. -
xml_path
(Path
) –Path to the XML dump file to be processed.
-
tree
(_ElementTree
) –The lxml tree object from the XML file.
-
root
(Element
) –The root element of the tree.
-
namespaces
(Dict[str, str]
) –Dictionary of XML namespaces of the root element.
-
pages
(List[Element]
) –List of all page elements from the XML file.
Source code in de_wiktio\fetch.py
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
|
Attributes¶
settings: Settings = Settings()
class-attribute
instance-attribute
¶
The Settings
object.
xml_path: Path
instance-attribute
¶
Path to the XML dump file to be processed.
tree: ET._ElementTree
property
¶
The lxml tree object from the XML file.
Lazy evaluation. This is a time consuming operation, so it is only computed when needed.
root: ET.Element
property
¶
The root element of the tree.
Lazy evaluation. This is a time consuming operation, so it is only computed when needed.
namespaces: Dict[str, str]
property
¶
Dictionary of XML namespaces of the root element.
pages: List[ET.Element]
property
¶
List of all page elements from the XML file.
This includes all pages from all wiki namespaces.
Lazy evaluation. This is a time consuming operation, so it is only computed when needed.
Functions¶
pages_by_ns(ns)
¶
Retrieve pages matching the Wiki namespace ns
.
Parameters:
-
ns
(str
) –The Wiki namespace identifier to filter pages (e.g., '0' for content pages, '108' for Flexion pages).
Returns:
Source code in de_wiktio\fetch.py
79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 |
|
create_dict_by_ns(ns, dict_path=None)
¶
Create a dictionary with titles as keys and the corresponding wikitext as values and saves it to a pickle file.
Parameters:
-
ns
(str
) –The Wiki namespace identifier to filter pages (e.g.,
'0'
for content pages,'108'
for Flexion pages) -
dict_path
(str
, default:None
) –The path where the dictionary should be saved. If not provided, the dictionary will be saved as 'wikidict_{ns}.pkl' in the folder indicated in Settings.
Returns:
Source code in de_wiktio\fetch.py
96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 |
|
load_wikidict_by_ns(file=None, ns='0')
classmethod
¶
Load a dictionary with page titles as keys and their corresponding wikitext as values from a pickle file.
Parameters:
-
file
(str
, default:None
) –The path to the pickle file. If
None
, the file 'wikidict_{ns}.pkl' in the folder indicated in Settings will be used. -
ns
(str
, default:'0'
) –The wikinamespace identifier to filter pages (e.g.,
'0'
for content pages,'108'
for Flexion pages).
Returns:
Raises:
-
FileNotFoundError
–If the file does not exist.
Source code in de_wiktio\fetch.py
131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 |
|
PageExport(title)
¶
This class provides methods to fetch and parse the XML content of a Wiktionary page and to extract the wikitext using the export tool (Spezial:Exportieren).
Parameters:
-
title
(str
) –The title of the Wiktionary page to fetch.
Raises:
-
RequestException
–If the request fails.
Methods:
-
fetch
–Fetch and return the XML content of a Wiktionary page using the export tool.
Attributes:
-
title
(str
) –The title of the Wiktionary page to fetch.
-
xml
(bytes
) –The XML content of the requested Wiktionary page.
-
root
(Element
) –The root element of the tree.
-
namespaces
(Dict[str, str]
) –Dictionary of XML namespaces of the root element.
-
page
(Element
) –The page element.
-
wikitext
(str
) –The wikitext of the page as a string.
-
ns
(str
) –The Wiki namespace of the page as a string.
Source code in de_wiktio\fetch.py
166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 |
|
Attributes¶
title: str
instance-attribute
¶
The title of the Wiktionary page to fetch.
xml: bytes
instance-attribute
¶
The XML content of the requested Wiktionary page.
root: ET.Element
instance-attribute
¶
The root element of the tree.
namespaces: Dict[str, str]
instance-attribute
¶
Dictionary of XML namespaces of the root element.
page: ET.Element
property
¶
The page element.
wikitext: str
property
¶
The wikitext of the page as a string.
If not found, an empty string is returned.
ns: str
property
¶
The Wiki namespace of the page as a string.
If not found, an empty string is returned.
Functions¶
fetch()
¶
Fetch and return the XML content of a Wiktionary page using the export tool.
The XML data is retrieved using the following URL:
https://de.wiktionary.org/wiki/Spezial:Exportieren/{self.title}
Returns:
-
bytes
–the response.content - The XML content of the requested Wiktionary page.
Raises:
-
RequestException
–If the request fails.
Source code in de_wiktio\fetch.py
192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 |
|
Functions¶
fetch_page_Action_API(title)
¶
Fetch online and return the XML content of a Wiktionary page for the given title using the Action API.
The XML data is retrieved from base URL:
https://de.wiktionary.org/w/api.php
Parameters:
-
title
(str
) –The title of the Wiktionary page to fetch.
Returns:
-
bytes
(bytes
) –The XML content of the requested Wiktionary page.
Source code in de_wiktio\fetch.py
248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 |
|
print_tags_tree(elem, only_tagnames=False, print_attributes=False, print_text=False, max_children=5, max_level=5, _level=0)
¶
Print the tree structure of an XML element, with options for customization.
Parameters:
-
elem
(Element
) –The XML
Element
object whose tree structure is to be printed. -
only_tagnames
(bool
, default:False
) –If True, print only the tag name without the namespace.
-
print_attributes
(bool
, default:False
) –If True, print the attributes of each element.
-
print_text
(bool
, default:False
) –If True, print the text content of each element.
-
max_children
(int
, default:5
) –The maximum number of children to print for the root element.
-
max_level
(int
, default:5
) –The maximum depth of the tree to print.
-
_level
(int
, default:0
) –The current recursion level. This is used internally and should not be set by the user.
Source code in de_wiktio\fetch.py
273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 |
|