Utilities¶
This module implements a set of utilities for extracting labeling events, text and features from the command-line. When the articlequality python package is installed, a articlequality utility should be available from the commandline. Run revscoring -h for more information:
Article Quality CLI¶
articlequality¶
$ articlequality -h
This script provides access to a set of utilities for extracting features
and building article quality classifiers.
* extract_labelings -- Gathers quality labeling events from XML dumps
* extract_text -- Gathers text for each labeling observation from XML dumps
* extract_features -- Extracts feature_lists for observations
* fetch_item_info -- Gets interesting statements from wikidata items
* fetch_text -- Gathers text for each labeling observation from a MediaWiki
API
Usage:
articlequality (-h | --help)
articlequality <utility> [-h | --help]
Options:
-h | --help Prints this documentation
<utility> The name of the utility to run
Sub-utilities¶
extract_from_text¶
$ articlequality extract_from_text -h
Extracts dependents from a labeling doc containing text and an
`wp10` label writes a new set of labeling docs that is
compatible as observations for `revscoring`'s cv_train and tune utilities.
Input: { ... "wp10": ..., "text": ..., ... }
Output: { ... "wp10": ..., "cache": ..., ... }
Usage:
extract_from_text <dependent>...
[--input=<path>]
[--output=<path>]
[--extractors=<num>]
[--verbose]
[--debug]
Options:
-h --help Print this documentation
<dependent> Classpath to a single dependent or list of
dependent values to solve
--input=<path> Path to a file containing observations
[default: <stdin>]
--output=<path> Path to a file to write new observations to
[default: <stdout>]
--extractors=<num> The number of parallel extractors to
start [default: <cpu count>]
--verbose Print dots and stuff to stderr
--debug Print debug logs
extract_labelings¶
$ articlequality extract_labelings -h
Extracts labels from an XML dump and writes out labeled observations for
each change in assessment class. Will match extraction method to the dump.
Usage:
extract_labelings <dump-file>... [--extractor=<name>] [--threads=<num>]
[--output=<path>] [--verbose]
[--debug]
extract_labelings -h | --help
Options:
-h --help Show this screen.
<dump-file> An XML dump file to process
--extractor=<name> The dbname of the wiki extractor to use
(e.g. 'enwiki') [default: <match>]
--threads=<num> If a collection of files are provided, how many
processor threads should be prepare?
[default: <cpu_count>]
--output=<path> The path to a file to dump observations to
[default: <stdout>]
--verbose Prints dots to <stderr>
--debug Print debug level logging
extract_text¶
$ articlequality extract_text -h
Extracts text & metadata for labelings using XML dumps.
Usage:
extract_text <dump-file>... [--labelings=<path>] [--output=<path>]
[--threads=<num>] [--verbose]
extract_text -h | --help
Options:
-h --help Show this screen.
<dump-file> An XML dump file to process
--labelings=<name> The path to a file containing labeling events.
[default: <stdin>]
--output=<path> The path to a file to dump observations to
[default: <stdout>]
--threads=<num> If a collection of files are provided, how many
processor threads should be prepare?
[default: <cpu_count>]
--verbose Prints dots to <stderr>
fetch_text¶
$ articlequality fetch_text -h
Fetches text & metadata for labelings using a MediaWiki API.
Usage:
fetch_text --api-host=<url> [--labelings=<path>] [--output=<path>]
[--verbose]
Options:
-h --help Show this documentation.
--api-host=<url> The hostname of a MediaWiki e.g.
"https://en.wikipedia.org"
--labelings=<path> Path to a containting observations with extracted
labels. [default: <stdin>]
--output=<path> Path to a file to write new observations
(with text) out to. [default: <stdout>]
--verbose Prints dots and stuff to stderr
score¶
$ articlequality score -h
Applies a scoring model to a chunch of text.
Usage:
score <model-file> [<text>]
score -h | --help
Options:
-h --help Prints this documentation
<model-file> The path to a scorer_model file to use
<text> The path to a file containing text to score
[default: <stdin>]