Utilities

This module implements a set of utilities for extracting labeling events, text and features from the command-line. When the articlequality python package is installed, a articlequality utility should be available from the commandline. Run revscoring -h for more information:

Article Quality CLI

articlequality

$ articlequality -h

This script provides access to a set of utilities for extracting features
and building article quality classifiers.

* extract_labelings -- Gathers quality labeling events from XML dumps
* extract_text -- Gathers text for each labeling observation from XML dumps
* extract_features -- Extracts feature_lists for observations
* fetch_item_info -- Gets interesting statements from wikidata items
* fetch_text -- Gathers text for each labeling observation from a MediaWiki
                API

Usage:
    articlequality (-h | --help)
    articlequality <utility> [-h | --help]

Options:
    -h | --help  Prints this documentation
    <utility>    The name of the utility to run

Sub-utilities

extract_from_text

$ articlequality extract_from_text -h

Extracts dependents from a labeling doc containing text and an
`wp10` label writes a new set of labeling docs that is
compatible as observations for `revscoring`'s cv_train and tune utilities.

Input: { ... "wp10": ..., "text": ..., ... }

Output: { ... "wp10": ..., "cache": ..., ... }


Usage:
    extract_from_text <dependent>...
                      [--input=<path>]
                      [--output=<path>]
                      [--extractors=<num>]
                      [--verbose]
                      [--debug]

Options:
    -h --help               Print this documentation
    <dependent>             Classpath to a single dependent or list of
                            dependent values to solve
    --input=<path>          Path to a file containing observations
                            [default: <stdin>]
    --output=<path>         Path to a file to write new observations to
                            [default: <stdout>]
    --extractors=<num>      The number of parallel extractors to
                            start [default: <cpu count>]
    --verbose               Print dots and stuff to stderr
    --debug                 Print debug logs

extract_labelings

$ articlequality extract_labelings -h

Extracts labels from an XML dump and writes out labeled observations for
each change in assessment class.  Will match extraction method to the dump.

Usage:
    extract_labelings <dump-file>... [--extractor=<name>] [--threads=<num>]
                                     [--output=<path>] [--verbose]
                                     [--debug]
    extract_labelings -h | --help

Options:
    -h --help           Show this screen.
    <dump-file>         An XML dump file to process
    --extractor=<name>  The dbname of the wiki extractor to use
                        (e.g. 'enwiki')  [default: <match>]
    --threads=<num>     If a collection of files are provided, how many
                        processor threads should be prepare?
                        [default: <cpu_count>]
    --output=<path>     The path to a file to dump observations to
                        [default: <stdout>]
    --verbose           Prints dots to <stderr>
    --debug           Print debug level logging

extract_text

$ articlequality extract_text -h

Extracts text & metadata for labelings using XML dumps.

Usage:
    extract_text <dump-file>... [--labelings=<path>] [--output=<path>]
                                [--threads=<num>] [--verbose]
    extract_text -h | --help

Options:
    -h --help           Show this screen.
    <dump-file>         An XML dump file to process
    --labelings=<name>  The path to a file containing labeling events.
                        [default: <stdin>]
    --output=<path>     The path to a file to dump observations to
                        [default: <stdout>]
    --threads=<num>     If a collection of files are provided, how many
                        processor threads should be prepare?
                        [default: <cpu_count>]
    --verbose           Prints dots to <stderr>

fetch_text

$ articlequality fetch_text -h

Fetches text & metadata for labelings using a MediaWiki API.

Usage:
    fetch_text --api-host=<url> [--labelings=<path>] [--output=<path>]
                                [--verbose]

Options:
    -h --help           Show this documentation.
    --api-host=<url>    The hostname of a MediaWiki e.g.
                        "https://en.wikipedia.org"
    --labelings=<path>  Path to a containting observations with extracted
                        labels. [default: <stdin>]
    --output=<path>     Path to a file to write new observations
                        (with text) out to. [default: <stdout>]
    --verbose           Prints dots and stuff to stderr

score

$ articlequality score -h

Applies a scoring model to a chunch of text.

Usage:
    score <model-file> [<text>]
    score -h | --help

Options:
    -h --help     Prints this documentation
    <model-file>  The path to a scorer_model file to use
    <text>        The path to a file containing text to score
                  [default: <stdin>]