Metadata-Version: 2.0
Name: soft404
Version: 0.1.1
Summary: A classifier for detecting soft 404 pages
Home-page: https://github.com/TeamHG-Memex/soft404
Author: Konstantin Lopuhin
Author-email: kostia.lopuhin@gmail.com
License: MIT
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 2
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.3
Classifier: Programming Language :: Python :: 3.4
Classifier: Programming Language :: Python :: 3.5
Requires-Dist: lxml
Requires-Dist: numpy
Requires-Dist: parsel
Requires-Dist: scikit-learn
Requires-Dist: scipy
Requires-Dist: six
Requires-Dist: webstruct (>=0.3)

soft404: a classifier for detecting soft 404 pages
==================================================

A "soft" 404 page is a page that is served with 200 status,
but is really a page that says that content is not available.

.. contents::


Installation
------------

::

    pip install soft404


Usage
-----

::

    >>> from soft404 import Soft404Classifier
    >>> clf = Soft404Classifier()
    >>> clf.predict('<h1>Page not found</h1>')
    0.9736860086882132


Development
-----------

Classifier is trained on 120k pages from 25k domains, with 404 page ratio of about 1/3.
With 10-fold cross-validation, F1 is 0.963 ± 0.012, and ROC AUC is 0.992 ± 0.004.


Getting data for training
+++++++++++++++++++++++++

Install dev requirements::

    pip install -r requirements_dev.txt

Run the crawler for a while (results will appear in ``pages.jl.gz`` file)::

    cd crawler
    scrapy crawl spider -o gzip:pages.jl -s JOBDIR=job


Training
++++++++

First, extract text and structure from html::

    ./soft404/convert_to_text.py pages.jl.gz items

This will produce two files, ``items.meta.jl.gz`` and ``items.items.jl.gz``.
Next, train the classifier::

    ./soft404/train.py items

Vectorizer takes a while to run, but it's result is cached (the filename
where it is cached will be printed on the next run).
If you are happy with results, save the classifier::

    ./soft404/train.py items --save soft404/clf.joblib


License
-------

License is MIT.


