Wiki-Class: Wikipedia article quality classification¶
A library for performing automatic detection of assessment classes of Wikipedia articles.
Compatible with Python 3.x only. Sorry.
- Install:
pip install wikiclass
- Models: https://github.com/halfak/Wiki-Class/tree/master/models
- Repo: https://github.com/halfak/Wiki-Class
Basic usage¶
If you want to detect some assessment classes, you’re going to need a model. You can download a prebuilt model or build one yourself.
Model from file¶
from wikiclass.models import RFTextModel
model = RFTextModel.from_file(open("enwiki.rf_text.model", "rb"))
assessment, probabilities = model.classify("Some article text")
print("I'm about {0}% ".format(probabilities[assessment]*100) + \
"sure that this should be classified {0}".format(assessment))
Model building¶
from wikiclass.models import RFTextModel
# Gather a training and test set
train_set = [
("Stub", "Some article text"),
("Start", "Some more article text<ref>news</ref>.")
# ...
]
test_set = [
("C", "The Lorem Ipsum dolored the sit amet."),
("FA", "'''Lorem Ipsum''', sit amet the dolor amer. {{InfoBox|...}}")
# ...
]
# Train a model
model = RFTextModel.train(
train_set,
assessments=assessments.WP10
)
# Run the test set & print the results
results = model.test(test_set)
print(results)
# Write the model to disk for reuse.
model.to_file(open("enwiki.rf_text.model", "wb"))
Modules¶
- wikiclass.models
A set of classification models that can be trained and used to classify articles.
RFTextModel
– A random forrest classifier that extracts features from article text.
- wikiclass.features
A set of feature extractors used to organize a set of features for use in model training and classification.
WikitextAndInfonoise
– A text feature extractor that gathers wiki markup features and an information-based measure.
- wikiclass.languages
Some
FeatureExtractor
s require information about the language being processed. This module contains basic language info for common languages.get()
, gets aLanguage
based on a name. Currently supported languages include:"English"
register()
, registers a newLanguage
for access fromget()
.
Authors¶
- Aaron Halfaker
- Morten Warcke-Wang