Wiki Dump Extractor Reference¶

This section contains the API reference for the Wiki Dump Extractor package.

Main Extractor¶

class wiki_dump_extractor.wiki_dump_extractor.ExtractorBase[source]¶

Bases: ABC

extract_disambiguation_page_titles(output_file: str | Path, page_limit: int = None)[source]¶

Extract disambiguation pages from the dump file.

Parameters¶

output_filestr: Path where to save the output Avro file.
page_limitint, optional: Maximum number of pages to extract, by default None

extract_pages_to_avro(output_file: str | Path, redirects_db_path: str | Path | None = None, ignored_fields: List[str] = None, fields: List[str] = None, batch_size: int = 10000, page_limit: int = None, codec: str = 'zstandard', page_filter: Callable[[Page], bool] | None = None)[source]¶

Convert the XML dump file to an Avro file.

Parameters¶

output_filestr: Path where to save the output Avro file.
redirects_db_pathstr | Path, optional: Path where to save the redirects LMDB database.
ignored_fieldsList[str], optional: Fields to ignore, by default [“text”]
batch_sizeint, optional: Number of pages per batch, by default 10_000
page_limitint, optional: Maximum number of pages to extract, by default None
codecstr, optional: Codec to use for compression, by default “zstandard”
page_filterCallable[[Page], bool], optional: A function that takes a Page object and returns a boolean. If the function returns False, the page will not be included in the Avro file.

iter_page_batches(batch_size: int, page_limit: int | None = None, page_filter: Callable[[Page], bool] | None = None) → Iterator[List[Page]][source]¶

Iterate over pages in batches.

Each return is a list of Page objects with fields title, page_id, timestamp, redirect_title, revision_id, and text.

This method iterates over the pages in the dump file and yields batches of pages. If a limit is provided, the iteration will stop after the specified number of batches have been returned.

Parameters¶

batch_sizeint: The number of pages per batch.
page_limitint | None, optional: The maximum number of pages to return.
page_filterCallable[[Page], bool], optional: A function that takes a Page object and returns a boolean. If the function returns False, the page will not be included in the batch.

Returns¶

Iterator[list[Page]]: An iterator over lists of pages.

iter_pages(page_limit: int | None = None, page_filter: Callable[[Page], bool] | None = None) → Iterator[Page][source]¶

Iterate over all pages in the dump file.

The returned elements are Page objects with fields title, page_id, timestamp, redirect_title, revision_id, and text.

process_page_batches_in_parallel(process_fn: Callable[[List[Page], int], Any], num_workers: int, batch_size: int, page_limit: int | None = None, page_filter: Callable[[Page], bool] | None = None, ordered_results: bool = False)[source]¶

Apply a function to each batch of pages in parallel.

This method is useful for speeding up the extraction of pages from the dump file.

Parameters¶

actionCallable[[List[Page], int], Any]: A function that takes a list of Page objects and an integer and returns any value.
num_workersint: The number of workers to use.
batch_sizeint: The number of pages per batch.
page_limitint | None, optional: The maximum number of pages to return.
page_filterCallable[[Page], bool], optional: A function that takes a Page object and returns a boolean. If the function returns False, the page will not be included in the batch.

async process_pages_async(process_fn, num_workers=5, page_limit=None, page_filter=None, ordered=False) → AsyncGenerator[Tuple[Page, Any], None][source]¶

class wiki_dump_extractor.wiki_dump_extractor.Page(title: str = '', text: str = '', page_id: int = 0, timestamp: datetime = None, redirect_title: str | None = None, revision_id: str = '')[source]¶

Bases: object

Represents a page in the Wikipedia dump.

Attributes¶

page_id: int: The ID of the page.
title: str: The title of the page.
timestamp: datetime: The timestamp of the page.
redirect_title: str | None: The title of the page if it is a redirect.
revision_id: str: The ID of the revision.
text: str: The text of the page.

classmethod from_xml(elem: Element, namespace: str) → Page[source]¶

classmethod get_avro_schema(ignored_fields=None, fields=None) → dict[source]¶

get_wikipedia_url() → str[source]¶

page_id: int = 0¶

redirect_title: str | None = None¶

revision_id: str = ''¶

text: str = ''¶

timestamp: datetime = None¶

title: str = ''¶

to_dict() → dict[source]¶

class wiki_dump_extractor.wiki_dump_extractor.WikiAvroDumpExtractor(file_path: str, index_dir: str | Path | None = None)[source]¶

Bases: ExtractorBase

extract_pages_titles_to_new_dump(page_titles: List[str], output_avro_file: str | Path, replace_file: bool = True, batch_size: int = 1000, redirects_env: Environment | None = None, ignore_titles_not_found: bool = False)[source]¶

Extract a list of pages by their titles to a new Avro file.

Parameters¶

page_titlesList[str]: The titles of the pages to extract.
output_avro_filestr: Path where to save the output Avro file.
replace_filebool, optional: Whether to replace the output file if it exists.
batch_sizeint, optional: Size of batches to process at once.
redirects_envlmdb.Environment, optional: LMDB environment containing redirects mapping.
ignore_titles_not_foundbool, optional: Whether to ignore titles that are not found in the index.

get_page_batch_by_title(titles: List[str], redirects_env: Environment | None = None, ignore_titles_not_found: bool = False) → List[Page][source]¶

Get a batch of pages by their titles.

Parameters¶

titlesList[str]: The titles of the pages to get.
redirects_envlmdb.Environment, optional: LMDB environment containing redirects mapping.
ignore_titles_not_foundbool, optional: Whether to ignore titles that are not found in the index.

get_page_by_title(title: str) → Page[source]¶

Get a page by its title.

Parameters¶

titlestr: The title of the page to get.

index_pages(index_dir: str | Path)[source]¶

Index the pages in the Avro file.

Parameters¶

index_filestr: Path where to save the index file.

iter_pages_by_title(titles: List[str]) → Generator[Page, None, None][source]¶

Get a page by its title.

Parameters¶

titlestr: The title of the page to get.

class wiki_dump_extractor.wiki_dump_extractor.WikiXmlDumpExtractor(file_path: str | Path)[source]¶

Bases: ExtractorBase

A class for extracting pages from a MediaWiki XML dump file. This class provides functionality to parse and extract pages from MediaWiki XML dump files, which can be either uncompressed (.xml) or bzip2 compressed (.xml.bz2). It handles the XML namespace detection automatically and provides iterators for processing pages individuallyor in batches.

Parameters¶

file_pathstr | Path: Path to the MediaWiki XML dump file (.xml or .xml.bz2)

Examples¶

>>> extractor = WikiDumpExtractor("dump.xml.bz2")
>>> for page in extractor.iter_pages():
...     print(page.title)

>>> # Process pages in batches
>>> for batch in extractor.iter_page_batches(batch_size=100):
...     process_batch(batch)

extract_pages_to_new_xml(output_file: str | Path, limit: int | None = 50)[source]¶

Create a smaller XML dump file by extracting a limited number of pages.

This is useful for debugging, testing, creating examples, etc.

Parameters¶

output_filestr | Path: Path where to save the output XML file. Can be a .xml or .xml.bz2 file.
page_limitint, optional: Maximum number of pages to extract, by default 70

Utility Modules¶

Date Utilities¶

class wiki_dump_extractor.date_utils.DashYMDFormat[source]¶

Bases: DateFormat

Format for YYYY-MM-DD dates.

classmethod match_to_date(match: Match) → Date[source]¶

Convert a regex match to a datetime object.

Parameters¶

matchre.Match: The regex match object containing the date information

Returns¶

datetime: The parsed datetime object

Raises¶

ValueError: If the match cannot be converted to a valid datetime

name: ClassVar[str] = 'DASH_YMD'¶

pattern: ClassVar[Pattern] = re.compile('\\b(\\d{1,4})[-/](\\d{1,2})[-/](\\d{1,2})(?:\\s+(BC|BCE))?\\b', re.IGNORECASE)¶

class wiki_dump_extractor.date_utils.Date(year: int, month: int | None = None, day: int | None = None, is_approximate: bool = False)[source]¶

Bases: object

day: int | None¶

is_approximate: bool¶

month: int | None¶

to_dict() → Dict[source]¶

to_string() → str[source]¶

validate()[source]¶

year: int¶

class wiki_dump_extractor.date_utils.DateFormat[source]¶

Bases: ABC

Base class for all date format detectors.

classmethod convert_month_to_number(month_name: str) → int[source]¶

Convert month name to its numerical representation.

Raises¶

ValueError: If the month name is not recognized

classmethod list_dates(text: str) → bool[source]¶: Check if the text contains any dates detected by regex patterns.

abstractmethod classmethod match_to_date(match: Match) → datetime[source]¶

Convert a regex match to a datetime object.

Parameters¶

matchre.Match: The regex match object containing the date information

Returns¶

datetime: The parsed datetime object

Raises¶

ValueError: If the match cannot be converted to a valid datetime

name: ClassVar[str]¶

pattern: ClassVar[Pattern]¶

class wiki_dump_extractor.date_utils.DateRange(start: wiki_dump_extractor.date_utils.Date, end: wiki_dump_extractor.date_utils.Date)[source]¶

Bases: object

end: Date¶

classmethod from_parsed_string(date: str) → DateRange[source]¶

Parse a string representation of a date or date range into a DateRange object.

Examples: 1810 -> ~1810/01/01 - ~1810/12/31 1810-1812 -> ~1810/01/01 - ~1812/12/31 1810/1812 -> ~1810/01/01 - ~1812/12/31 1810/03/05 -> 1810/03/05 - 1810/03/05 1810/03 -> ~1810/03/01 - ~1810/03/31 1810/03 - 1812/05 -> ~1810/03/01 - ~1812/05/31 1810/03/05 - 1812/05/07 -> 1810/03/05 - 1812/05/07 1611/1612 - 1615/1617 -> ~1611/01/01 - ~1617/12/31 1930s - 1940s -> ~1930/01/01 - ~1949/12/31 1930s -> ~1930/01/01 - ~1939/12/31

start: Date¶

to_dict() → Dict[source]¶

to_string() → str[source]¶

class wiki_dump_extractor.date_utils.DayMonthYearFormat[source]¶

Bases: DateFormat

Format for DD Month YYYY dates.

classmethod match_to_date(match: Match) → Date[source]¶

Convert a regex match to a datetime object.

Parameters¶

matchre.Match: The regex match object containing the date information

Returns¶

datetime: The parsed datetime object

Raises¶

ValueError: If the match cannot be converted to a valid datetime

name: ClassVar[str] = 'DAY_MONTH_YEAR'¶

pattern: ClassVar[Pattern] = re.compile("\\b\n (\\d{1,2}) # Day (1-2 digits)\n \\s+\n (January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Mar|Apr|Jun|J, re.IGNORECASE|re.VERBOSE)¶

re_dmy = "\\b\n (\\d{1,2}) # Day (1-2 digits)\n \\s+\n (January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Mar|Apr|Jun|Jul|Aug|Sep|Oct|Nov|Dec) # Month (provided externally)\n [,\\s]+\n (?:AD\\s*)?\n (\\d{1,4}) # Year (1-4 digits)\n (?:\\s+(BC|BCE))? # Optional ' BC'\n \\b\n "¶

class wiki_dump_extractor.date_utils.DetectedDate(date: wiki_dump_extractor.date_utils.Date, format: str, date_str: str)[source]¶

Bases: object

date: Date¶

date_str: str¶

format: str¶

to_dict() → Dict[source]¶

class wiki_dump_extractor.date_utils.MonthDayYearFormat[source]¶

Bases: DateFormat

Format for Month DD YYYY dates.

classmethod match_to_date(match: Match) → Date[source]¶

Convert a regex match to a datetime object.

Parameters¶

matchre.Match: The regex match object containing the date information

Returns¶

datetime: The parsed datetime object

Raises¶

ValueError: If the match cannot be converted to a valid datetime

name: ClassVar[str] = 'MONTH_DAY_YEAR'¶

pattern: ClassVar[Pattern] = re.compile("\n \\b\n (January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Mar|Apr|Jun|Jul|Aug|Sep|Oct|Nov|Dec) # Month name\n \\s+\n (, re.IGNORECASE|re.VERBOSE)¶

re_mdy = "\n \\b\n (January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Mar|Apr|Jun|Jul|Aug|Sep|Oct|Nov|Dec) # Month name\n \\s+\n (\\d{1,2}) # Day (1 or 2 digits)\n (?:st|nd|rd|th)? # Optional ordinal suffix\n [,\\s]+\n (?:AD\\s*)?\n (\\d{1,4}) # Year (1 to 4 digits)\n (?:\\s+(BC|BCE))? # Optional ' BC'\n \\b\n "¶

class wiki_dump_extractor.date_utils.MonthYearFormat[source]¶

Bases: DateFormat

Format for Month YYYY dates.

classmethod match_to_date(match: Match) → Date[source]¶

Convert a regex match to a datetime object.

Parameters¶

matchre.Match: The regex match object containing the date information

Returns¶

datetime: The parsed datetime object

Raises¶

ValueError: If the match cannot be converted to a valid datetime

name: ClassVar[str] = 'MONTH_YEAR'¶

pattern: ClassVar[Pattern] = re.compile('\\b(January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Mar|Apr|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\\s*(?:AD\\s*)?(\\d{2,4})(?:\\s+(BC|BCE))?\\b', re.IGNORECASE)¶

class wiki_dump_extractor.date_utils.SlashDMYMDYFormat[source]¶

Bases: DateFormat

Format for DD/MM/YYYY or MM/DD/YYYY dates.

classmethod match_to_date(match: Match) → Date[source]¶

Convert a regex match to a datetime object.

Parameters¶

matchre.Match: The regex match object containing the date information

Returns¶

datetime: The parsed datetime object

Raises¶

ValueError: If the match cannot be converted to a valid datetime

name: ClassVar[str] = 'SLASH_DMY_MDY'¶

pattern: ClassVar[Pattern] = re.compile('\\B[^|](\\d{1,2})[-/](\\d{1,2})[-/](\\d{1,4})(?:\\s+(BC|BCE))?\\b', re.IGNORECASE)¶

class wiki_dump_extractor.date_utils.WikiDateFormat[source]¶

Bases: DateFormat

Format for {{Birth date|YYYY|MM|DD|…}}.

classmethod match_to_date(match: Match) → datetime[source]¶

Convert a regex match to a datetime object.

Parameters¶

matchre.Match: The regex match object containing the date information

Returns¶

datetime: The parsed datetime object

Raises¶

ValueError: If the match cannot be converted to a valid datetime

name: ClassVar[str] = 'WIKI_BIRTH_DATE'¶

pattern: ClassVar[Pattern] = re.compile('{{[^|]*\\|(\\d{1,4})\\|(\\d{1,2})\\|(\\d{1,2}).*}}', re.IGNORECASE)¶

class wiki_dump_extractor.date_utils.WrittenDateFormat[source]¶

Bases: DateFormat

Format for Month the day, year dates.

classmethod match_to_date(match: Match) → datetime[source]¶

Convert a regex match to a datetime object.

Parameters¶

matchre.Match: The regex match object containing the date information

Returns¶

datetime: The parsed datetime object

Raises¶

ValueError: If the match cannot be converted to a valid datetime

name: ClassVar[str] = 'WRITTEN_DATE'¶

pattern: ClassVar[Pattern] = re.compile('\\b(January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Mar|Apr|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\\s+the\\s+(?:(\\d{1,2})(?:st|nd|rd|th)?|([a-z]+))[,\\s]+(\\d{1,4, re.IGNORECASE)¶

class wiki_dump_extractor.date_utils.YearFormat[source]¶

Bases: DateFormat

Format for YYYY dates.

classmethod match_to_date(match: Match) → Date[source]¶

Convert a regex match to a datetime object.

Parameters¶

matchre.Match: The regex match object containing the date information

Returns¶

datetime: The parsed datetime object

Raises¶

ValueError: If the match cannot be converted to a valid datetime

name: ClassVar[str] = 'YEAR'¶

pattern: ClassVar[Pattern] = re.compile('\\b(?:c\\.|in|from|to)\\s*(?:AD\\s*)?(\\d{1,4})(?:\\s*(BC|BCE))?[\\s,\\.,\\)]', re.IGNORECASE)¶

wiki_dump_extractor.date_utils.extract_dates(text: str) → List[Dict][source]¶

Extract dates from text with context information.

Parameters¶

textstr: The text to extract dates from.

Returns¶

List[Dict]: A list of dictionaries containing: - ‘date_str’: The original date string found - ‘format’: The name of the date format - ‘datetime’: The parsed datetime object (if parsing was successful)

Page Utilities¶

class wiki_dump_extractor.page_utils.Section(level: int, title: str, text: str = '', children: List[ForwardRef('Section')] = <factory>, parent: Optional[ForwardRef('Section')] = None)[source]¶

Bases: object

all_subsections_text_dict(text_dict: dict | None = None) → dict[source]¶

Recursively collect text from a section and all its subsections.

Args:: text_dict: Dictionary to store section titles and texts
Returns:: Dictionary mapping section titles to their text content

children: List[Section]¶

classmethod from_page_section_texts(texts: List[str]) → Section[source]¶: Build a tree of Section objects from a list of section texts.

classmethod from_page_text(text: str) → Section[source]¶: Build a tree of Section objects from a page text.

from_single_section_text() → Section[source]¶: Parse a heading string of the form ‘== Title ==’ or ‘=== Title ===’ and return a Section with the appropriate level and title.

get_section_text_by_title(title: str) → str[source]¶

level: int¶

parent: Section | None = None¶

text: str = ''¶

title: str¶

property title_with_parents¶

to_dict()[source]¶: Convert the Section to a dictionary representation.

with_cleaned_text()[source]¶

wiki_dump_extractor.page_utils.extract_categories(text: str) → List[str][source]¶

Extract categories from Wikipedia text.

Parameters¶

textstr: The Wikipedia page text to extract categories from.

Returns¶

List[str]: A list of category names with spaces normalized and sorted alphabetically.

wiki_dump_extractor.page_utils.extract_filenames(wiki_text)[source]¶

Extract the filename from a MediaWiki file link using regular expressions.

Args:: wiki_file_text (str): The MediaWiki file link text
Yields:: str: Each extracted filename found in the text

wiki_dump_extractor.page_utils.extract_geospatial_coordinates(text: str) → Tuple[float, float] | None[source]¶

Return geographical coordinates (latitude, longitude) from Wikipedia page text.

Parameters¶

textstr: The wikipedia page text to extract coordinates from.

Returns¶

tuple[float, float] | None: The geographical coordinates (latitude, longitude) or None if no coordinates are found.

wiki_dump_extractor.page_utils.extract_infobox_category(text: str) → str | None[source]¶

Extract the broad category from the infobox of a Wikipedia page.

Parameters¶

textstr: The wikipedia page text to extract the infobox category from.

wiki_dump_extractor.page_utils.extract_links(wiki_text)[source]¶: Extract all the links of the form [[true page|text]] in a dict of the form {text: true page}

wiki_dump_extractor.page_utils.get_short_description(text: str) → str[source]¶: Return the short description of the page.

wiki_dump_extractor.page_utils.parse_infobox(page_text: str) → Tuple[dict, str][source]¶

Parse the infobox from a Wikipedia page text.

Example of infobox. This recognizes the “{{Infobox” pattern. then parses the fields starting with “|” as key-value pairs.

{{Infobox military conflict | conflict = First Battle of the Marne | partof = the [[Western Front (World War I)|Western Front]] of [[World War I]] | image = German soldiers Battle of Marne WWI.jpg | image_size = 300 | caption = German soldiers (wearing distinctive [[pickelhaube] | date = 5–14 September 1914 | place = [[Marne River]] near [[Brasles]], east of Paris | coordinates = {{coord|49|1|N|3|23|E|region:FR_type:event|display= inline}} | result = Allied victory | territory = German advance to Paris repulsed }}

Parameters¶

page_textstr: The wikipedia page text to extract the infobox from.

Returns¶

dict: The infobox as a dictionary.

wiki_dump_extractor.page_utils.remove_appendix_sections(text: str) → str[source]¶: Remove sections like References, Notes, etc. from the text.

wiki_dump_extractor.page_utils.remove_comments_and_citations(text: str) → str[source]¶: Return the text without comments and citations.

wiki_dump_extractor.page_utils.replace_file_links_with_captions(text)[source]¶

wiki_dump_extractor.page_utils.replace_nsbp_by_spaces(text: str) → str[source]¶: Replace spaces with underscores in the text.

wiki_dump_extractor.page_utils.replace_titles_with_section_headers(text)[source]¶

Download Utilities¶

wiki_dump_extractor.download_utils.download_file(url, filepath, replace=False)[source]¶: Download a web file to a filepath, with the option to skip.

LLM Utilities¶

SQL Extractor¶

class wiki_dump_extractor.wiki_sql_extractor.WikiSqlExtractor(file_path)[source]¶

Bases: object

Extracts data from a Wikipedia SQL dump.

Parameters¶

file_pathstr or Path: The path to the Wikipedia SQL dump file.

Examples¶

>>> extractor = WikiSqlExtractor("enwiki-20240301-pages-articles-multistream.xml.bz2")
>>> df = extractor.to_pandas_dataframe(columns=[...])

to_pandas_dataframe(columns=None, max_rows=None, row_filter=None)[source]¶

Reads the data from the database and returns a pandas DataFrame.

This is optimized for memory consumption.

Parameters¶

row_filtercallable, optional: A function that takes a record and returns True if the record should be included in the DataFrame.
columnslist, optional: A list of columns to include in the DataFrame.
max_rowsint, optional: The maximum number of rows to read from the database.