Wiki Dump Extractor Reference¶
This section contains the API reference for the Wiki Dump Extractor package.
Main Extractor¶
- class wiki_dump_extractor.wiki_dump_extractor.ExtractorBase[source]¶
Bases:
ABC
- extract_disambiguation_page_titles(output_file: str | Path, page_limit: int = None)[source]¶
Extract disambiguation pages from the dump file.
Parameters¶
- output_filestr
Path where to save the output Avro file.
- page_limitint, optional
Maximum number of pages to extract, by default None
- extract_pages_to_avro(output_file: str | Path, redirects_db_path: str | Path | None = None, ignored_fields: List[str] = None, fields: List[str] = None, batch_size: int = 10000, page_limit: int = None, codec: str = 'zstandard', page_filter: Callable[[Page], bool] | None = None)[source]¶
Convert the XML dump file to an Avro file.
Parameters¶
- output_filestr
Path where to save the output Avro file.
- redirects_db_pathstr | Path, optional
Path where to save the redirects LMDB database.
- ignored_fieldsList[str], optional
Fields to ignore, by default [“text”]
- batch_sizeint, optional
Number of pages per batch, by default 10_000
- page_limitint, optional
Maximum number of pages to extract, by default None
- codecstr, optional
Codec to use for compression, by default “zstandard”
- page_filterCallable[[Page], bool], optional
A function that takes a Page object and returns a boolean. If the function returns False, the page will not be included in the Avro file.
- iter_page_batches(batch_size: int, page_limit: int | None = None, page_filter: Callable[[Page], bool] | None = None) Iterator[List[Page]] [source]¶
Iterate over pages in batches.
Each return is a list of Page objects with fields title, page_id, timestamp, redirect_title, revision_id, and text.
This method iterates over the pages in the dump file and yields batches of pages. If a limit is provided, the iteration will stop after the specified number of batches have been returned.
Parameters¶
- batch_sizeint
The number of pages per batch.
- page_limitint | None, optional
The maximum number of pages to return.
- page_filterCallable[[Page], bool], optional
A function that takes a Page object and returns a boolean. If the function returns False, the page will not be included in the batch.
Returns¶
- Iterator[list[Page]]
An iterator over lists of pages.
- iter_pages(page_limit: int | None = None, page_filter: Callable[[Page], bool] | None = None) Iterator[Page] [source]¶
Iterate over all pages in the dump file.
The returned elements are Page objects with fields title, page_id, timestamp, redirect_title, revision_id, and text.
- process_page_batches_in_parallel(process_fn: Callable[[List[Page], int], Any], num_workers: int, batch_size: int, page_limit: int | None = None, page_filter: Callable[[Page], bool] | None = None, ordered_results: bool = False)[source]¶
Apply a function to each batch of pages in parallel.
This method is useful for speeding up the extraction of pages from the dump file.
Parameters¶
- actionCallable[[List[Page], int], Any]
A function that takes a list of Page objects and an integer and returns any value.
- num_workersint
The number of workers to use.
- batch_sizeint
The number of pages per batch.
- page_limitint | None, optional
The maximum number of pages to return.
- page_filterCallable[[Page], bool], optional
A function that takes a Page object and returns a boolean. If the function returns False, the page will not be included in the batch.
- class wiki_dump_extractor.wiki_dump_extractor.Page(title: str = '', text: str = '', page_id: int = 0, timestamp: datetime = None, redirect_title: str | None = None, revision_id: str = '')[source]¶
Bases:
object
Represents a page in the Wikipedia dump.
Attributes¶
- page_id: int
The ID of the page.
- title: str
The title of the page.
- timestamp: datetime
The timestamp of the page.
- redirect_title: str | None
The title of the page if it is a redirect.
- revision_id: str
The ID of the revision.
- text: str
The text of the page.
- page_id: int = 0¶
- redirect_title: str | None = None¶
- revision_id: str = ''¶
- text: str = ''¶
- timestamp: datetime = None¶
- title: str = ''¶
- class wiki_dump_extractor.wiki_dump_extractor.WikiAvroDumpExtractor(file_path: str, index_dir: str | Path | None = None)[source]¶
Bases:
ExtractorBase
- extract_pages_titles_to_new_dump(page_titles: List[str], output_avro_file: str | Path, replace_file: bool = True, batch_size: int = 1000, redirects_env: Environment | None = None, ignore_titles_not_found: bool = False)[source]¶
Extract a list of pages by their titles to a new Avro file.
Parameters¶
- page_titlesList[str]
The titles of the pages to extract.
- output_avro_filestr
Path where to save the output Avro file.
- replace_filebool, optional
Whether to replace the output file if it exists.
- batch_sizeint, optional
Size of batches to process at once.
- redirects_envlmdb.Environment, optional
LMDB environment containing redirects mapping.
- ignore_titles_not_foundbool, optional
Whether to ignore titles that are not found in the index.
- get_page_batch_by_title(titles: List[str], redirects_env: Environment | None = None, ignore_titles_not_found: bool = False) List[Page] [source]¶
Get a batch of pages by their titles.
Parameters¶
- titlesList[str]
The titles of the pages to get.
- redirects_envlmdb.Environment, optional
LMDB environment containing redirects mapping.
- ignore_titles_not_foundbool, optional
Whether to ignore titles that are not found in the index.
- get_page_by_title(title: str) Page [source]¶
Get a page by its title.
Parameters¶
- titlestr
The title of the page to get.
- class wiki_dump_extractor.wiki_dump_extractor.WikiXmlDumpExtractor(file_path: str | Path)[source]¶
Bases:
ExtractorBase
A class for extracting pages from a MediaWiki XML dump file. This class provides functionality to parse and extract pages from MediaWiki XML dump files, which can be either uncompressed (.xml) or bzip2 compressed (.xml.bz2). It handles the XML namespace detection automatically and provides iterators for processing pages individuallyor in batches.
Parameters¶
- file_pathstr | Path
Path to the MediaWiki XML dump file (.xml or .xml.bz2)
Examples¶
>>> extractor = WikiDumpExtractor("dump.xml.bz2") >>> for page in extractor.iter_pages(): ... print(page.title)
>>> # Process pages in batches >>> for batch in extractor.iter_page_batches(batch_size=100): ... process_batch(batch)
- extract_pages_to_new_xml(output_file: str | Path, limit: int | None = 50)[source]¶
Create a smaller XML dump file by extracting a limited number of pages.
This is useful for debugging, testing, creating examples, etc.
Parameters¶
- output_filestr | Path
Path where to save the output XML file. Can be a .xml or .xml.bz2 file.
- page_limitint, optional
Maximum number of pages to extract, by default 70
Utility Modules¶
Date Utilities¶
- class wiki_dump_extractor.date_utils.DashYMDFormat[source]¶
Bases:
DateFormat
Format for YYYY-MM-DD dates.
- classmethod match_to_date(match: Match) Date [source]¶
Convert a regex match to a datetime object.
Parameters¶
- matchre.Match
The regex match object containing the date information
Returns¶
- datetime
The parsed datetime object
Raises¶
- ValueError
If the match cannot be converted to a valid datetime
- name: ClassVar[str] = 'DASH_YMD'¶
- pattern: ClassVar[Pattern] = re.compile('\\b(\\d{1,4})[-/](\\d{1,2})[-/](\\d{1,2})(?:\\s+(BC|BCE))?\\b', re.IGNORECASE)¶
- class wiki_dump_extractor.date_utils.Date(year: int, month: int | None = None, day: int | None = None, is_approximate: bool = False)[source]¶
Bases:
object
- day: int | None¶
- is_approximate: bool¶
- month: int | None¶
- year: int¶
- class wiki_dump_extractor.date_utils.DateFormat[source]¶
Bases:
ABC
Base class for all date format detectors.
- classmethod convert_month_to_number(month_name: str) int [source]¶
Convert month name to its numerical representation.
Raises¶
- ValueError
If the month name is not recognized
- classmethod list_dates(text: str) bool [source]¶
Check if the text contains any dates detected by regex patterns.
- abstractmethod classmethod match_to_date(match: Match) datetime [source]¶
Convert a regex match to a datetime object.
Parameters¶
- matchre.Match
The regex match object containing the date information
Returns¶
- datetime
The parsed datetime object
Raises¶
- ValueError
If the match cannot be converted to a valid datetime
- name: ClassVar[str]¶
- pattern: ClassVar[Pattern]¶
- class wiki_dump_extractor.date_utils.DateRange(start: wiki_dump_extractor.date_utils.Date, end: wiki_dump_extractor.date_utils.Date)[source]¶
Bases:
object
- classmethod from_parsed_string(date: str) DateRange [source]¶
Parse a string representation of a date or date range into a DateRange object.
Examples: 1810 -> ~1810/01/01 - ~1810/12/31 1810-1812 -> ~1810/01/01 - ~1812/12/31 1810/1812 -> ~1810/01/01 - ~1812/12/31 1810/03/05 -> 1810/03/05 - 1810/03/05 1810/03 -> ~1810/03/01 - ~1810/03/31 1810/03 - 1812/05 -> ~1810/03/01 - ~1812/05/31 1810/03/05 - 1812/05/07 -> 1810/03/05 - 1812/05/07 1611/1612 - 1615/1617 -> ~1611/01/01 - ~1617/12/31 1930s - 1940s -> ~1930/01/01 - ~1949/12/31 1930s -> ~1930/01/01 - ~1939/12/31
- class wiki_dump_extractor.date_utils.DayMonthYearFormat[source]¶
Bases:
DateFormat
Format for DD Month YYYY dates.
- classmethod match_to_date(match: Match) Date [source]¶
Convert a regex match to a datetime object.
Parameters¶
- matchre.Match
The regex match object containing the date information
Returns¶
- datetime
The parsed datetime object
Raises¶
- ValueError
If the match cannot be converted to a valid datetime
- name: ClassVar[str] = 'DAY_MONTH_YEAR'¶
- pattern: ClassVar[Pattern] = re.compile("\\b\n (\\d{1,2}) # Day (1-2 digits)\n \\s+\n (January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Mar|Apr|Jun|J, re.IGNORECASE|re.VERBOSE)¶
- re_dmy = "\\b\n (\\d{1,2}) # Day (1-2 digits)\n \\s+\n (January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Mar|Apr|Jun|Jul|Aug|Sep|Oct|Nov|Dec) # Month (provided externally)\n [,\\s]+\n (?:AD\\s*)?\n (\\d{1,4}) # Year (1-4 digits)\n (?:\\s+(BC|BCE))? # Optional ' BC'\n \\b\n "¶
- class wiki_dump_extractor.date_utils.DetectedDate(date: wiki_dump_extractor.date_utils.Date, format: str, date_str: str)[source]¶
Bases:
object
- date_str: str¶
- format: str¶
- class wiki_dump_extractor.date_utils.MonthDayYearFormat[source]¶
Bases:
DateFormat
Format for Month DD YYYY dates.
- classmethod match_to_date(match: Match) Date [source]¶
Convert a regex match to a datetime object.
Parameters¶
- matchre.Match
The regex match object containing the date information
Returns¶
- datetime
The parsed datetime object
Raises¶
- ValueError
If the match cannot be converted to a valid datetime
- name: ClassVar[str] = 'MONTH_DAY_YEAR'¶
- pattern: ClassVar[Pattern] = re.compile("\n \\b\n (January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Mar|Apr|Jun|Jul|Aug|Sep|Oct|Nov|Dec) # Month name\n \\s+\n (, re.IGNORECASE|re.VERBOSE)¶
- re_mdy = "\n \\b\n (January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Mar|Apr|Jun|Jul|Aug|Sep|Oct|Nov|Dec) # Month name\n \\s+\n (\\d{1,2}) # Day (1 or 2 digits)\n (?:st|nd|rd|th)? # Optional ordinal suffix\n [,\\s]+\n (?:AD\\s*)?\n (\\d{1,4}) # Year (1 to 4 digits)\n (?:\\s+(BC|BCE))? # Optional ' BC'\n \\b\n "¶
- class wiki_dump_extractor.date_utils.MonthYearFormat[source]¶
Bases:
DateFormat
Format for Month YYYY dates.
- classmethod match_to_date(match: Match) Date [source]¶
Convert a regex match to a datetime object.
Parameters¶
- matchre.Match
The regex match object containing the date information
Returns¶
- datetime
The parsed datetime object
Raises¶
- ValueError
If the match cannot be converted to a valid datetime
- name: ClassVar[str] = 'MONTH_YEAR'¶
- pattern: ClassVar[Pattern] = re.compile('\\b(January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Mar|Apr|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\\s*(?:AD\\s*)?(\\d{2,4})(?:\\s+(BC|BCE))?\\b', re.IGNORECASE)¶
- class wiki_dump_extractor.date_utils.SlashDMYMDYFormat[source]¶
Bases:
DateFormat
Format for DD/MM/YYYY or MM/DD/YYYY dates.
- classmethod match_to_date(match: Match) Date [source]¶
Convert a regex match to a datetime object.
Parameters¶
- matchre.Match
The regex match object containing the date information
Returns¶
- datetime
The parsed datetime object
Raises¶
- ValueError
If the match cannot be converted to a valid datetime
- name: ClassVar[str] = 'SLASH_DMY_MDY'¶
- pattern: ClassVar[Pattern] = re.compile('\\B[^|](\\d{1,2})[-/](\\d{1,2})[-/](\\d{1,4})(?:\\s+(BC|BCE))?\\b', re.IGNORECASE)¶
- class wiki_dump_extractor.date_utils.WikiDateFormat[source]¶
Bases:
DateFormat
Format for {{Birth date|YYYY|MM|DD|…}}.
- classmethod match_to_date(match: Match) datetime [source]¶
Convert a regex match to a datetime object.
Parameters¶
- matchre.Match
The regex match object containing the date information
Returns¶
- datetime
The parsed datetime object
Raises¶
- ValueError
If the match cannot be converted to a valid datetime
- name: ClassVar[str] = 'WIKI_BIRTH_DATE'¶
- pattern: ClassVar[Pattern] = re.compile('{{[^|]*\\|(\\d{1,4})\\|(\\d{1,2})\\|(\\d{1,2}).*}}', re.IGNORECASE)¶
- class wiki_dump_extractor.date_utils.WrittenDateFormat[source]¶
Bases:
DateFormat
Format for Month the day, year dates.
- classmethod match_to_date(match: Match) datetime [source]¶
Convert a regex match to a datetime object.
Parameters¶
- matchre.Match
The regex match object containing the date information
Returns¶
- datetime
The parsed datetime object
Raises¶
- ValueError
If the match cannot be converted to a valid datetime
- name: ClassVar[str] = 'WRITTEN_DATE'¶
- pattern: ClassVar[Pattern] = re.compile('\\b(January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Mar|Apr|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\\s+the\\s+(?:(\\d{1,2})(?:st|nd|rd|th)?|([a-z]+))[,\\s]+(\\d{1,4, re.IGNORECASE)¶
- class wiki_dump_extractor.date_utils.YearFormat[source]¶
Bases:
DateFormat
Format for YYYY dates.
- classmethod match_to_date(match: Match) Date [source]¶
Convert a regex match to a datetime object.
Parameters¶
- matchre.Match
The regex match object containing the date information
Returns¶
- datetime
The parsed datetime object
Raises¶
- ValueError
If the match cannot be converted to a valid datetime
- name: ClassVar[str] = 'YEAR'¶
- pattern: ClassVar[Pattern] = re.compile('\\b(?:c\\.|in|from|to)\\s*(?:AD\\s*)?(\\d{1,4})(?:\\s*(BC|BCE))?[\\s,\\.,\\)]', re.IGNORECASE)¶
- wiki_dump_extractor.date_utils.extract_dates(text: str) List[Dict] [source]¶
Extract dates from text with context information.
Parameters¶
- textstr
The text to extract dates from.
Returns¶
- List[Dict]
A list of dictionaries containing: - ‘date_str’: The original date string found - ‘format’: The name of the date format - ‘datetime’: The parsed datetime object (if parsing was successful)
Page Utilities¶
- class wiki_dump_extractor.page_utils.Section(level: int, title: str, text: str = '', children: List[ForwardRef('Section')] = <factory>, parent: Optional[ForwardRef('Section')] = None)[source]¶
Bases:
object
- all_subsections_text_dict(text_dict: dict | None = None) dict [source]¶
Recursively collect text from a section and all its subsections.
- Args:
text_dict: Dictionary to store section titles and texts
- Returns:
Dictionary mapping section titles to their text content
- classmethod from_page_section_texts(texts: List[str]) Section [source]¶
Build a tree of Section objects from a list of section texts.
- classmethod from_page_text(text: str) Section [source]¶
Build a tree of Section objects from a page text.
- from_single_section_text() Section [source]¶
Parse a heading string of the form ‘== Title ==’ or ‘=== Title ===’ and return a Section with the appropriate level and title.
- level: int¶
- text: str = ''¶
- title: str¶
- property title_with_parents¶
- wiki_dump_extractor.page_utils.extract_categories(text: str) List[str] [source]¶
Extract categories from Wikipedia text.
Parameters¶
- textstr
The Wikipedia page text to extract categories from.
Returns¶
- List[str]
A list of category names with spaces normalized and sorted alphabetically.
- wiki_dump_extractor.page_utils.extract_filenames(wiki_text)[source]¶
Extract the filename from a MediaWiki file link using regular expressions.
- Args:
wiki_file_text (str): The MediaWiki file link text
- Yields:
str: Each extracted filename found in the text
- wiki_dump_extractor.page_utils.extract_geospatial_coordinates(text: str) Tuple[float, float] | None [source]¶
Return geographical coordinates (latitude, longitude) from Wikipedia page text.
Parameters¶
- textstr
The wikipedia page text to extract coordinates from.
Returns¶
- tuple[float, float] | None
The geographical coordinates (latitude, longitude) or None if no coordinates are found.
- wiki_dump_extractor.page_utils.extract_infobox_category(text: str) str | None [source]¶
Extract the broad category from the infobox of a Wikipedia page.
Parameters¶
- textstr
The wikipedia page text to extract the infobox category from.
- wiki_dump_extractor.page_utils.extract_links(wiki_text)[source]¶
Extract all the links of the form [[true page|text]] in a dict of the form {text: true page}
- wiki_dump_extractor.page_utils.get_short_description(text: str) str [source]¶
Return the short description of the page.
- wiki_dump_extractor.page_utils.parse_infobox(page_text: str) Tuple[dict, str] [source]¶
Parse the infobox from a Wikipedia page text.
Example of infobox. This recognizes the “{{Infobox” pattern. then parses the fields starting with “|” as key-value pairs.
{{Infobox military conflict | conflict = First Battle of the Marne | partof = the [[Western Front (World War I)|Western Front]] of [[World War I]] | image = German soldiers Battle of Marne WWI.jpg | image_size = 300 | caption = German soldiers (wearing distinctive [[pickelhaube] | date = 5–14 September 1914 | place = [[Marne River]] near [[Brasles]], east of Paris | coordinates = {{coord|49|1|N|3|23|E|region:FR_type:event|display= inline}} | result = Allied victory | territory = German advance to Paris repulsed }}
Parameters¶
- page_textstr
The wikipedia page text to extract the infobox from.
Returns¶
- dict
The infobox as a dictionary.
- wiki_dump_extractor.page_utils.remove_appendix_sections(text: str) str [source]¶
Remove sections like References, Notes, etc. from the text.
- wiki_dump_extractor.page_utils.remove_comments_and_citations(text: str) str [source]¶
Return the text without comments and citations.
Download Utilities¶
LLM Utilities¶
SQL Extractor¶
- class wiki_dump_extractor.wiki_sql_extractor.WikiSqlExtractor(file_path)[source]¶
Bases:
object
Extracts data from a Wikipedia SQL dump.
Parameters¶
- file_pathstr or Path
The path to the Wikipedia SQL dump file.
Examples¶
>>> extractor = WikiSqlExtractor("enwiki-20240301-pages-articles-multistream.xml.bz2") >>> df = extractor.to_pandas_dataframe(columns=[...])
- to_pandas_dataframe(columns=None, max_rows=None, row_filter=None)[source]¶
Reads the data from the database and returns a pandas DataFrame.
This is optimized for memory consumption.
Parameters¶
- row_filtercallable, optional
A function that takes a record and returns True if the record should be included in the DataFrame.
- columnslist, optional
A list of columns to include in the DataFrame.
- max_rowsint, optional
The maximum number of rows to read from the database.