Metadata-Version: 2.1
Name: nnanno
Version: 0.0.2
Summary: Sample and annotate Newspaper Navigator data for computer vision tasks
Home-page: https://github.com/Living-with-machines/nnanno/tree/master/
Author: Daniel van Strien
Author-email: daniel.van-strien@bl.uk
License: MIT License
Description: # nnanno
        > Sample, annotate and apply computer vision models to the Newspaper Navigator dataset  
        
        
        [![CI](https://github.com/Living-with-machines/nnanno/actions/workflows/main.yml/badge.svg)](https://github.com/Living-with-machines/nnanno/actions/workflows/main.yml)
        
        ## tl;dr
        
        `nnanno` is a modest collection of tools to help work with the delightful [Newspaper Navigator](https://news-navigator.labs.loc.gov/) data. 
        
        ## nnanno
        
        [Newspaper Navigator](https://news-navigator.labs.loc.gov/) is a project which extracted visual content (pictures, maps etc.) from the Library of Congress [Chronicling America](https://chroniclingamerica.loc.gov/) digitised newspaper collection.
        
        Newspaper Navigator has released data in a number of formats including `json` files which contain a range of metadata about the newspaper from which the image was taken from. `nnanno` was thrown together to help work with this collection as part of the preparation of some example datasets used in a series of Programming Historian tutorials (currently under review). Since this code was developed using the wonderful [nbdev](nbdev.fast.ai/) library it is possible to install it for your own use. 
        
        ## What nnanno tries to help with
        
        nnanno doesn't to provide an end-to-end to end software for using machine learning with the Newspaper Navigator data. Instead it is a minimal collection of code that *may* help you if you want to work with the Newspaper Navigator data.
        
        There are three particular areas where nnanno tries to help a little:
        - sampling subsets from the full Newspaper Navigator data
        - annotating this sample with additional labels using [label studio](https://labelstud.io)
        - inference (experimental 😬) running inference i.e making new predictions with a machine learning model on the newspaper navigator dataset using IIIF.
        
        ### Disclaimer
        
        This code was written mainly to help develop some example datasets for a series of Programming Historian tutorials. It has some tests but there are likely bugs and issues with the code. The code in this repository was all written in notebooks, some people  hate notebooks. Those people will probably hate this too 🤷‍♂️
        
        If you want to work with the full Newspaper Navigator data you will likely be better of accessing it via the provided S3 bucket see [news-navigator.labs.loc.gov/]() for more information.
        
        ## nbdev notes
        This code was written using `nbdev`. This is a tool that helps use Jupyter notebooks for developing Python libraries. Inside the documentation you will see code cells followed by output. This is generated from a Jupyter notebook and shows the actual output of the code rather than something that has been copied and pasted for example:
        
        ```python
        import datetime
        print(datetime.date.today())
        ```
        
            2021-02-02
        
        
        Is evaluated as Python code. This also means all of the documentation and examples can be opened in notebooks and the code inspected, changed and run. 
        
        ## Install
        
        You can install `nnanno` via GitHub:
        
        `pip install git+https://github.com/Living-with-machines/nnanno`
        
        This will install all of the code for sampling from Newspaper Navigator data. 
        
        ## Optional extra packages 
        If you want to use all of the parts of `nnanno` you'll need to install some extra packages. 
        
        ### label studio 
        If you also want to use `label-studio` for annotating you will need to install this too. There are various ways in which you may want to install label-studio. See the label studio [documentation](https://labelstud.io/) for various options for setting this up. 
        
        ### fastai
        If you want to use the experimental inference functionality you will need to install fastai. See the [fastai docs](https://docs.fast.ai/#Installing) for options for doing this. 
        
        ## Documentation
        
        The [documentation](https://living-with-machines.github.io/nnanno/) can be viewed as rendered pages with the option of opening as notebooks in Google Colab. 
        
        ## Illustrations in advertising: an 'end-to-end' example of using nnanno 
        
        You can find an 'end-to-end' example in [examples folder of the documentation](https://living-with-machines.github.io/nnanno/intro.html). 
        
        This example goes through the process of sampling, annotating, training a model and predicting this against the newspaper navigator data. 
        
        ## Functionality 
        The three main areas of `nnanno` are shown below. The examples in the documentation and in the "end-to-end" example shows this functionality in more detail. 
        
        ### Creating samples
        
        `nnanno` can be used to create samples from the Newspaper Navigator data:
        
        ```python
        from nnanno.sample import *
        ```
        
        ```python
        sampler = nnSampler()
        df = sampler.create_sample(1,
                                   'photos',
                                   start_year=1850, 
                                   end_year=1855, 
                                   step=1)
        ```
        
        This returns a dataframe containing samples from the Newspaper Navigator data (loaded via JSON) into a Pandas DataFrame. 
        
        ```python
        df.columns
        ```
        
        
        
        
            Index(['filepath', 'pub_date', 'page_seq_num', 'edition_seq_num', 'batch',
                   'lccn', 'box', 'score', 'ocr', 'place_of_publication',
                   'geographic_coverage', 'name', 'publisher', 'url', 'page_url'],
                  dtype='object')
        
        
        
        ### Annotation
        The annotation part of nnanno is mainly a little bit of documentation and a few functions to help setup annotation of a sample from Newspaper Navigator using IIIF urls and the [label studio](https://labelstud.io/) annotation tool. Since we can annotate via IIIF this offers a way of annotating without having to download large amounts of data locally. 
        
        ```python
        from nnanno.annotate import create_label_studio_json
        ```
        
        ### Inference
        
        The inference section of nnanno *attempts* to show one possible way to use IIIF to run inference against samples of Newspaper Navigator using a trained [fastai](https://docs.fast.ai/) model. This will allow you to make predictions against larger parts of the Newspaper Navigator data. 
        
        ```python
        from nnanno.inference import *
        ```
        
        ```python
        from fastai.vision.all import *
        dls = ImageDataLoaders.from_csv('../ph/ads/', 
                                        'ads_upsampled.csv',
                                        folder='images', 
                                        fn_col='file', 
                                        label_col='label',
                                        item_tfms=Resize(64,ResizeMethod.Squish),
                                        num_workers=0)
        learn = cnn_learner(dls, resnet18, metrics=F1Score())
        learn.fit(1)
        ```
        
        
        <table border="1" class="dataframe">
          <thead>
            <tr style="text-align: left;">
              <th>epoch</th>
              <th>train_loss</th>
              <th>valid_loss</th>
              <th>f1_score</th>
              <th>time</th>
            </tr>
          </thead>
          <tbody>
            <tr>
              <td>0</td>
              <td>0.924742</td>
              <td>0.963872</td>
              <td>0.645570</td>
              <td>00:12</td>
            </tr>
          </tbody>
        </table>
        
        
        With a trained fastai model we can predict on a sample from Newspaper Navigator
        
        ```python
        predictor = nnPredict(learn, try_gpu=False)
        ```
        
        ```python
        predictor.predict_sample('ads','testinference',0.01,end_year=1850)
        ```
        
            
        
        
        This returns a `json` file for each year from the sample containing the original newspaper navigator data plus the predictions from your model
        
        ```python
        df = pd.read_json('testinference/1850.json')
        ```
        
        We can access the 'decoded' predictions
        
        ```python
        df['pred_decoded'].value_counts()
        ```
        
        
        
        
            text-only        50
            illustrations    38
            Name: pred_decoded, dtype: int64
        
        
        
        or work with the probabilities directly
        
        ```python
        df.iloc[:5,-2]
        ```
        
        
        
        
            0    0.152435
            1    0.027183
            2    0.096673
            3    0.395800
            4    0.957422
            Name: illustrations_prob, dtype: float64
        
        
        
        ### Acknowledgment 
        
        > This work was support by [Living with Machines](livingwithmachines.ac.uk/). This project, funded by the UK Research and Innovation (UKRI) Strategic Priority Fund, is a multidisciplinary collaboration delivered by the Arts and Humanities Research Council (AHRC), with The Alan Turing Institute, the British Library and the Universities of Cambridge, East Anglia, Exeter, and Queen Mary University of London.
        
Keywords: computer-vision,digital-humanities
Platform: UNKNOWN
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: License :: OSI Approved :: MIT License
Requires-Python: >=3.6
Description-Content-Type: text/markdown
