Wiki Dump Extractor Reference

This section contains the API reference for the Wiki Dump Extractor package.

Main Extractor

class wiki_dump_extractor.wiki_dump_extractor.ExtractorBase[source]

Bases: ABC

extract_disambiguation_page_titles(output_file: str | Path, page_limit: int = None)[source]

Extract disambiguation pages from the dump file.

Parameters

output_filestr

Path where to save the output Avro file.

page_limitint, optional

Maximum number of pages to extract, by default None

extract_pages_to_avro(output_file: str | Path, redirects_db_path: str | Path | None = None, ignored_fields: List[str] = None, fields: List[str] = None, batch_size: int = 10000, page_limit: int = None, codec: str = 'zstandard', page_filter: Callable[[Page], bool] | None = None)[source]

Convert the XML dump file to an Avro file.

Parameters

output_filestr

Path where to save the output Avro file.

redirects_db_pathstr | Path, optional

Path where to save the redirects LMDB database.

ignored_fieldsList[str], optional

Fields to ignore, by default [“text”]

batch_sizeint, optional

Number of pages per batch, by default 10_000

page_limitint, optional

Maximum number of pages to extract, by default None

codecstr, optional

Codec to use for compression, by default “zstandard”

page_filterCallable[[Page], bool], optional

A function that takes a Page object and returns a boolean. If the function returns False, the page will not be included in the Avro file.

iter_page_batches(batch_size: int, page_limit: int | None = None, page_filter: Callable[[Page], bool] | None = None) Iterator[List[Page]][source]

Iterate over pages in batches.

Each return is a list of Page objects with fields title, page_id, timestamp, redirect_title, revision_id, and text.

This method iterates over the pages in the dump file and yields batches of pages. If a limit is provided, the iteration will stop after the specified number of batches have been returned.

Parameters

batch_sizeint

The number of pages per batch.

page_limitint | None, optional

The maximum number of pages to return.

page_filterCallable[[Page], bool], optional

A function that takes a Page object and returns a boolean. If the function returns False, the page will not be included in the batch.

Returns

Iterator[list[Page]]

An iterator over lists of pages.

iter_pages(page_limit: int | None = None, page_filter: Callable[[Page], bool] | None = None) Iterator[Page][source]

Iterate over all pages in the dump file.

The returned elements are Page objects with fields title, page_id, timestamp, redirect_title, revision_id, and text.

process_page_batches_in_parallel(process_fn: Callable[[List[Page], int], Any], num_workers: int, batch_size: int, page_limit: int | None = None, page_filter: Callable[[Page], bool] | None = None, ordered_results: bool = False)[source]

Apply a function to each batch of pages in parallel.

This method is useful for speeding up the extraction of pages from the dump file.

Parameters

actionCallable[[List[Page], int], Any]

A function that takes a list of Page objects and an integer and returns any value.

num_workersint

The number of workers to use.

batch_sizeint

The number of pages per batch.

page_limitint | None, optional

The maximum number of pages to return.

page_filterCallable[[Page], bool], optional

A function that takes a Page object and returns a boolean. If the function returns False, the page will not be included in the batch.

async process_pages_async(process_fn, num_workers=5, page_limit=None, page_filter=None, ordered=False) AsyncGenerator[Tuple[Page, Any], None][source]
class wiki_dump_extractor.wiki_dump_extractor.Page(title: str = '', text: str = '', page_id: int = 0, timestamp: datetime = None, redirect_title: str | None = None, revision_id: str = '')[source]

Bases: object

Represents a page in the Wikipedia dump.

Attributes

page_id: int

The ID of the page.

title: str

The title of the page.

timestamp: datetime

The timestamp of the page.

redirect_title: str | None

The title of the page if it is a redirect.

revision_id: str

The ID of the revision.

text: str

The text of the page.

classmethod from_xml(elem: Element, namespace: str) Page[source]
classmethod get_avro_schema(ignored_fields=None, fields=None) dict[source]
get_wikipedia_url() str[source]
page_id: int = 0
redirect_title: str | None = None
revision_id: str = ''
text: str = ''
timestamp: datetime = None
title: str = ''
to_dict() dict[source]
class wiki_dump_extractor.wiki_dump_extractor.WikiAvroDumpExtractor(file_path: str, index_dir: str | Path | None = None)[source]

Bases: ExtractorBase

extract_pages_titles_to_new_dump(page_titles: List[str], output_avro_file: str | Path, replace_file: bool = True, batch_size: int = 1000, redirects_env: Environment | None = None, ignore_titles_not_found: bool = False)[source]

Extract a list of pages by their titles to a new Avro file.

Parameters

page_titlesList[str]

The titles of the pages to extract.

output_avro_filestr

Path where to save the output Avro file.

replace_filebool, optional

Whether to replace the output file if it exists.

batch_sizeint, optional

Size of batches to process at once.

redirects_envlmdb.Environment, optional

LMDB environment containing redirects mapping.

ignore_titles_not_foundbool, optional

Whether to ignore titles that are not found in the index.

get_page_batch_by_title(titles: List[str], redirects_env: Environment | None = None, ignore_titles_not_found: bool = False) List[Page][source]

Get a batch of pages by their titles.

Parameters

titlesList[str]

The titles of the pages to get.

redirects_envlmdb.Environment, optional

LMDB environment containing redirects mapping.

ignore_titles_not_foundbool, optional

Whether to ignore titles that are not found in the index.

get_page_by_title(title: str) Page[source]

Get a page by its title.

Parameters

titlestr

The title of the page to get.

index_pages(index_dir: str | Path)[source]

Index the pages in the Avro file.

Parameters

index_filestr

Path where to save the index file.

iter_pages_by_title(titles: List[str]) Generator[Page, None, None][source]

Get a page by its title.

Parameters

titlestr

The title of the page to get.

class wiki_dump_extractor.wiki_dump_extractor.WikiXmlDumpExtractor(file_path: str | Path)[source]

Bases: ExtractorBase

A class for extracting pages from a MediaWiki XML dump file. This class provides functionality to parse and extract pages from MediaWiki XML dump files, which can be either uncompressed (.xml) or bzip2 compressed (.xml.bz2). It handles the XML namespace detection automatically and provides iterators for processing pages individuallyor in batches.

Parameters

file_pathstr | Path

Path to the MediaWiki XML dump file (.xml or .xml.bz2)

Examples

>>> extractor = WikiDumpExtractor("dump.xml.bz2")
>>> for page in extractor.iter_pages():
...     print(page.title)
>>> # Process pages in batches
>>> for batch in extractor.iter_page_batches(batch_size=100):
...     process_batch(batch)
extract_pages_to_new_xml(output_file: str | Path, limit: int | None = 50)[source]

Create a smaller XML dump file by extracting a limited number of pages.

This is useful for debugging, testing, creating examples, etc.

Parameters

output_filestr | Path

Path where to save the output XML file. Can be a .xml or .xml.bz2 file.

page_limitint, optional

Maximum number of pages to extract, by default 70

Utility Modules

Date Utilities

class wiki_dump_extractor.date_utils.DashYMDFormat[source]

Bases: DateFormat

Format for YYYY-MM-DD dates.

classmethod match_to_date(match: Match) Date[source]

Convert a regex match to a datetime object.

Parameters

matchre.Match

The regex match object containing the date information

Returns

datetime

The parsed datetime object

Raises

ValueError

If the match cannot be converted to a valid datetime

name: ClassVar[str] = 'DASH_YMD'
pattern: ClassVar[Pattern] = re.compile('\\b(\\d{1,4})[-/](\\d{1,2})[-/](\\d{1,2})(?:\\s+(BC|BCE))?\\b', re.IGNORECASE)
class wiki_dump_extractor.date_utils.Date(year: int, month: int | None = None, day: int | None = None, is_approximate: bool = False)[source]

Bases: object

day: int | None
is_approximate: bool
month: int | None
to_dict() Dict[source]
to_string() str[source]
validate()[source]
year: int
class wiki_dump_extractor.date_utils.DateFormat[source]

Bases: ABC

Base class for all date format detectors.

classmethod convert_month_to_number(month_name: str) int[source]

Convert month name to its numerical representation.

Raises

ValueError

If the month name is not recognized

classmethod list_dates(text: str) bool[source]

Check if the text contains any dates detected by regex patterns.

abstractmethod classmethod match_to_date(match: Match) datetime[source]

Convert a regex match to a datetime object.

Parameters

matchre.Match

The regex match object containing the date information

Returns

datetime

The parsed datetime object

Raises

ValueError

If the match cannot be converted to a valid datetime

name: ClassVar[str]
pattern: ClassVar[Pattern]
class wiki_dump_extractor.date_utils.DateRange(start: wiki_dump_extractor.date_utils.Date, end: wiki_dump_extractor.date_utils.Date)[source]

Bases: object

end: Date
classmethod from_parsed_string(date: str) DateRange[source]

Parse a string representation of a date or date range into a DateRange object.

Examples: 1810 -> ~1810/01/01 - ~1810/12/31 1810-1812 -> ~1810/01/01 - ~1812/12/31 1810/1812 -> ~1810/01/01 - ~1812/12/31 1810/03/05 -> 1810/03/05 - 1810/03/05 1810/03 -> ~1810/03/01 - ~1810/03/31 1810/03 - 1812/05 -> ~1810/03/01 - ~1812/05/31 1810/03/05 - 1812/05/07 -> 1810/03/05 - 1812/05/07 1611/1612 - 1615/1617 -> ~1611/01/01 - ~1617/12/31 1930s - 1940s -> ~1930/01/01 - ~1949/12/31 1930s -> ~1930/01/01 - ~1939/12/31

start: Date
to_dict() Dict[source]
to_string() str[source]
class wiki_dump_extractor.date_utils.DayMonthYearFormat[source]

Bases: DateFormat

Format for DD Month YYYY dates.

classmethod match_to_date(match: Match) Date[source]

Convert a regex match to a datetime object.

Parameters

matchre.Match

The regex match object containing the date information

Returns

datetime

The parsed datetime object

Raises

ValueError

If the match cannot be converted to a valid datetime

name: ClassVar[str] = 'DAY_MONTH_YEAR'
pattern: ClassVar[Pattern] = re.compile("\\b\n        (\\d{1,2})                          # Day (1-2 digits)\n        \\s+\n        (January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Mar|Apr|Jun|J, re.IGNORECASE|re.VERBOSE)
re_dmy = "\\b\n        (\\d{1,2})                          # Day (1-2 digits)\n        \\s+\n        (January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Mar|Apr|Jun|Jul|Aug|Sep|Oct|Nov|Dec)               # Month (provided externally)\n        [,\\s]+\n        (?:AD\\s*)?\n        (\\d{1,4})                         # Year (1-4 digits)\n        (?:\\s+(BC|BCE))?                    # Optional ' BC'\n        \\b\n    "
class wiki_dump_extractor.date_utils.DetectedDate(date: wiki_dump_extractor.date_utils.Date, format: str, date_str: str)[source]

Bases: object

date: Date
date_str: str
format: str
to_dict() Dict[source]
class wiki_dump_extractor.date_utils.MonthDayYearFormat[source]

Bases: DateFormat

Format for Month DD YYYY dates.

classmethod match_to_date(match: Match) Date[source]

Convert a regex match to a datetime object.

Parameters

matchre.Match

The regex match object containing the date information

Returns

datetime

The parsed datetime object

Raises

ValueError

If the match cannot be converted to a valid datetime

name: ClassVar[str] = 'MONTH_DAY_YEAR'
pattern: ClassVar[Pattern] = re.compile("\n        \\b\n        (January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Mar|Apr|Jun|Jul|Aug|Sep|Oct|Nov|Dec)        # Month name\n        \\s+\n        (, re.IGNORECASE|re.VERBOSE)
re_mdy = "\n        \\b\n        (January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Mar|Apr|Jun|Jul|Aug|Sep|Oct|Nov|Dec)        # Month name\n        \\s+\n        (\\d{1,2})                # Day (1 or 2 digits)\n        (?:st|nd|rd|th)?           # Optional ordinal suffix\n        [,\\s]+\n        (?:AD\\s*)?\n        (\\d{1,4})                # Year (1 to 4 digits)\n        (?:\\s+(BC|BCE))?           # Optional ' BC'\n        \\b\n    "
class wiki_dump_extractor.date_utils.MonthYearFormat[source]

Bases: DateFormat

Format for Month YYYY dates.

classmethod match_to_date(match: Match) Date[source]

Convert a regex match to a datetime object.

Parameters

matchre.Match

The regex match object containing the date information

Returns

datetime

The parsed datetime object

Raises

ValueError

If the match cannot be converted to a valid datetime

name: ClassVar[str] = 'MONTH_YEAR'
pattern: ClassVar[Pattern] = re.compile('\\b(January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Mar|Apr|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\\s*(?:AD\\s*)?(\\d{2,4})(?:\\s+(BC|BCE))?\\b', re.IGNORECASE)
class wiki_dump_extractor.date_utils.SlashDMYMDYFormat[source]

Bases: DateFormat

Format for DD/MM/YYYY or MM/DD/YYYY dates.

classmethod match_to_date(match: Match) Date[source]

Convert a regex match to a datetime object.

Parameters

matchre.Match

The regex match object containing the date information

Returns

datetime

The parsed datetime object

Raises

ValueError

If the match cannot be converted to a valid datetime

name: ClassVar[str] = 'SLASH_DMY_MDY'
pattern: ClassVar[Pattern] = re.compile('\\B[^|](\\d{1,2})[-/](\\d{1,2})[-/](\\d{1,4})(?:\\s+(BC|BCE))?\\b', re.IGNORECASE)
class wiki_dump_extractor.date_utils.WikiDateFormat[source]

Bases: DateFormat

Format for {{Birth date|YYYY|MM|DD|…}}.

classmethod match_to_date(match: Match) datetime[source]

Convert a regex match to a datetime object.

Parameters

matchre.Match

The regex match object containing the date information

Returns

datetime

The parsed datetime object

Raises

ValueError

If the match cannot be converted to a valid datetime

name: ClassVar[str] = 'WIKI_BIRTH_DATE'
pattern: ClassVar[Pattern] = re.compile('{{[^|]*\\|(\\d{1,4})\\|(\\d{1,2})\\|(\\d{1,2}).*}}', re.IGNORECASE)
class wiki_dump_extractor.date_utils.WrittenDateFormat[source]

Bases: DateFormat

Format for Month the day, year dates.

classmethod match_to_date(match: Match) datetime[source]

Convert a regex match to a datetime object.

Parameters

matchre.Match

The regex match object containing the date information

Returns

datetime

The parsed datetime object

Raises

ValueError

If the match cannot be converted to a valid datetime

name: ClassVar[str] = 'WRITTEN_DATE'
pattern: ClassVar[Pattern] = re.compile('\\b(January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Mar|Apr|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\\s+the\\s+(?:(\\d{1,2})(?:st|nd|rd|th)?|([a-z]+))[,\\s]+(\\d{1,4, re.IGNORECASE)
class wiki_dump_extractor.date_utils.YearFormat[source]

Bases: DateFormat

Format for YYYY dates.

classmethod match_to_date(match: Match) Date[source]

Convert a regex match to a datetime object.

Parameters

matchre.Match

The regex match object containing the date information

Returns

datetime

The parsed datetime object

Raises

ValueError

If the match cannot be converted to a valid datetime

name: ClassVar[str] = 'YEAR'
pattern: ClassVar[Pattern] = re.compile('\\b(?:c\\.|in|from|to)\\s*(?:AD\\s*)?(\\d{1,4})(?:\\s*(BC|BCE))?[\\s,\\.,\\)]', re.IGNORECASE)
wiki_dump_extractor.date_utils.extract_dates(text: str) List[Dict][source]

Extract dates from text with context information.

Parameters

textstr

The text to extract dates from.

Returns

List[Dict]

A list of dictionaries containing: - ‘date_str’: The original date string found - ‘format’: The name of the date format - ‘datetime’: The parsed datetime object (if parsing was successful)

Page Utilities

class wiki_dump_extractor.page_utils.Section(level: int, title: str, text: str = '', children: List[ForwardRef('Section')] = <factory>, parent: Optional[ForwardRef('Section')] = None)[source]

Bases: object

all_subsections_text_dict(text_dict: dict | None = None) dict[source]

Recursively collect text from a section and all its subsections.

Args:

text_dict: Dictionary to store section titles and texts

Returns:

Dictionary mapping section titles to their text content

children: List[Section]
classmethod from_page_section_texts(texts: List[str]) Section[source]

Build a tree of Section objects from a list of section texts.

classmethod from_page_text(text: str) Section[source]

Build a tree of Section objects from a page text.

from_single_section_text() Section[source]

Parse a heading string of the form ‘== Title ==’ or ‘=== Title ===’ and return a Section with the appropriate level and title.

get_section_text_by_title(title: str) str[source]
level: int
parent: Section | None = None
text: str = ''
title: str
property title_with_parents
to_dict()[source]

Convert the Section to a dictionary representation.

with_cleaned_text()[source]
wiki_dump_extractor.page_utils.extract_categories(text: str) List[str][source]

Extract categories from Wikipedia text.

Parameters

textstr

The Wikipedia page text to extract categories from.

Returns

List[str]

A list of category names with spaces normalized and sorted alphabetically.

wiki_dump_extractor.page_utils.extract_filenames(wiki_text)[source]

Extract the filename from a MediaWiki file link using regular expressions.

Args:

wiki_file_text (str): The MediaWiki file link text

Yields:

str: Each extracted filename found in the text

wiki_dump_extractor.page_utils.extract_geospatial_coordinates(text: str) Tuple[float, float] | None[source]

Return geographical coordinates (latitude, longitude) from Wikipedia page text.

Parameters

textstr

The wikipedia page text to extract coordinates from.

Returns

tuple[float, float] | None

The geographical coordinates (latitude, longitude) or None if no coordinates are found.

wiki_dump_extractor.page_utils.extract_infobox_category(text: str) str | None[source]

Extract the broad category from the infobox of a Wikipedia page.

Parameters

textstr

The wikipedia page text to extract the infobox category from.

Extract all the links of the form [[true page|text]] in a dict of the form {text: true page}

wiki_dump_extractor.page_utils.get_short_description(text: str) str[source]

Return the short description of the page.

wiki_dump_extractor.page_utils.parse_infobox(page_text: str) Tuple[dict, str][source]

Parse the infobox from a Wikipedia page text.

Example of infobox. This recognizes the “{{Infobox” pattern. then parses the fields starting with “|” as key-value pairs.

{{Infobox military conflict | conflict = First Battle of the Marne | partof = the [[Western Front (World War I)|Western Front]] of [[World War I]] | image = German soldiers Battle of Marne WWI.jpg | image_size = 300 | caption = German soldiers (wearing distinctive [[pickelhaube] | date = 5–14 September 1914 | place = [[Marne River]] near [[Brasles]], east of Paris | coordinates = {{coord|49|1|N|3|23|E|region:FR_type:event|display= inline}} | result = Allied victory | territory = German advance to Paris repulsed }}

Parameters

page_textstr

The wikipedia page text to extract the infobox from.

Returns

dict

The infobox as a dictionary.

wiki_dump_extractor.page_utils.remove_appendix_sections(text: str) str[source]

Remove sections like References, Notes, etc. from the text.

wiki_dump_extractor.page_utils.remove_comments_and_citations(text: str) str[source]

Return the text without comments and citations.

wiki_dump_extractor.page_utils.replace_nsbp_by_spaces(text: str) str[source]

Replace spaces with underscores in the text.

wiki_dump_extractor.page_utils.replace_titles_with_section_headers(text)[source]

Download Utilities

wiki_dump_extractor.download_utils.download_file(url, filepath, replace=False)[source]

Download a web file to a filepath, with the option to skip.

LLM Utilities

SQL Extractor

class wiki_dump_extractor.wiki_sql_extractor.WikiSqlExtractor(file_path)[source]

Bases: object

Extracts data from a Wikipedia SQL dump.

Parameters

file_pathstr or Path

The path to the Wikipedia SQL dump file.

Examples

>>> extractor = WikiSqlExtractor("enwiki-20240301-pages-articles-multistream.xml.bz2")
>>> df = extractor.to_pandas_dataframe(columns=[...])
to_pandas_dataframe(columns=None, max_rows=None, row_filter=None)[source]

Reads the data from the database and returns a pandas DataFrame.

This is optimized for memory consumption.

Parameters
row_filtercallable, optional

A function that takes a record and returns True if the record should be included in the DataFrame.

columnslist, optional

A list of columns to include in the DataFrame.

max_rowsint, optional

The maximum number of rows to read from the database.