Metadata-Version: 2.1
Name: vila
Version: 0.0.1
Summary: UNKNOWN
Home-page: https://github.com/allenai/vila
Author: Zejiang Shen
License: Apache-2.0
Platform: UNKNOWN
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Requires-Dist: torch (>=1.5)
Requires-Dist: torchvision (>=0.6.0)
Requires-Dist: pandas
Requires-Dist: pytest
Requires-Dist: pysbd
Requires-Dist: layoutparser (>=0.2)
Requires-Dist: transformers (==4.4.2)
Requires-Dist: datasets
Requires-Dist: pdfplumber
Requires-Dist: pdf2image
Requires-Dist: tqdm
Requires-Dist: scikit-learn

# VILA🌴 <br> Incorporating **VI**sual **LA**yout Structures for Scientific Text Classification

<a href="https://arxiv.org/abs/2106.00676"><img src="https://img.shields.io/badge/arXiv-2106.00676-b31b1b.svg" title="Layout Parser Paper"></a>
<a href="https://github.com/Layout-Parser/layout-parser/blob/master/LICENSE"><img src="https://img.shields.io/pypi/l/layoutparser" title="Layout Parser uses Apache 2 License"></a>

## Motivation 

Scientific papers typically organize contents in visual groups like text blocks or lines, and text within each group usually have the same semantics. We explore different approaches for injecting the group structure into the text classifiers, and build models that improves the accuracy or efficiency of the scientific text classification task. 

![tease](./.github/github-teaser.png)

## Installation 

After cloning the github repo, you can either install the `vila` library or just install the dependencies: 
```bash 
git clone git@github.com:allenai/VILA.git
cd VILA 
conda create -n vila python=3.6
pip install -e . # Install the `vila` library 
pip install -r requirements.txt # Only install the dependencies 
```

We tested the code and trained the models using `Python≥3.6`, `PyTorch==1.7.1`, and `transformers==4.4.2`. 

## Usage 

### Directory Structure 

```
VILA
├─ checkpoints  # For all trained weights 
│  └─ grotoap2  # For each dataset                                 
│     ├─ baseline  # For the experiment type, e.g., baseline, ivila, hvila, ...
│     │  └─ bert-base-uncased  # For the used base model, e.g., bert-base-uncased. 
│     │     ├─ checkpoint-199999                                
│     │     ├─ checkpoint-299999                                 
│     │     ├─ all_results.json                                       
│     │     └─ pytorch_model.bin                         
│     └─ ivila-BLK-row                           
│        └─ microsoft-layoutlm-base-uncased 
└─ data                                       
   ├─ docbank
   ├─ ...
   └─ grotoap2                                 

```

Note:
- We will provide the download links to the datasets very soon. 

### Training 

All training scripts are in the `./scripts` folder. 

1. Train the baseline models
    ```bash
    cd scripts
    # bash train_baseline.sh [dataset-name] [base-model-name]
    bash train_baseline.sh grotoap2 bert-base-uncased
    bash train_baseline.sh docbank microsoft/layoutlm-base-uncased
    ```
2. Train the I-VILA models
    ```bash
    cd scripts
    # bash train_ivila.sh [dataset-name] [how-to-obtain-layout-indicators] [used-special-token] [base-model-name]
    bash train_ivila.sh grotoap2 row BLK microsoft/layoutlm-base-uncased 
      # Row is an alias for textline 
    bash train_ivila.sh docbank block SEP bert-base-uncased
      # We can also use the default special tokens like SEP 
    bash train_ivila.sh s2-vl sentence BLK roberta-base 
      # We can also extract the sentence breaks using spacy and use them as indicators.
    ```
3. Train the H-VILA models
    ```bash
    cd tools
    python create_hvila_model_base_weights.py 

    cd ../scripts
    # bash train_hvila.sh \
    #  [dataset-name] \
    #  [H-VILA-names] \
    #  [Group-Encoder-Output-Aggregation-Function] \
    #  [How-to-Obtain-Bounding-Box] \
    #  [Use-textline-or-block-as-the-group]

    bash train_hvila.sh \
      grotoap2 \
      weak-strong-layoutlm \
      average \
      first \
      row 
    ```
    In the above example, we use the:

    1. *average* of the group encoder outputs for all tokens as the group representation 
    2. the bounding box of the *first* token as the group's bounding box
    3. textline (or row) to construct the groups

### Evaluation Toolkit 

The evaluation toolkit can generate a detailed report for the prediction accuracy (marco F1 scores) and Visual Layout consistency (group entropy) for the prediction files `test_predictions.csv` produced by the training scripts. 

1. Generate reports for a group of experiments for a specific dataset
  ```bash
    cd tools
    python generate-eval.py --dataset_name grotoap2 --experiment_name baseline
    # It will create a _reports folder in ../checkpoints/grotoap2/baseline and store the 
    # scores in the report.csv file. 
  ```
2. Generate reports for all experiments for a specific dataset
  ```bash
    cd tools
    python generate-eval.py --dataset_name grotoap2
    # It will create reports for all experiments in the ../checkpoints/grotoap2/ folder
    # Also it will aggregate all the results and save them in ../checkpoints/grotoap2/_reports 
  ```
3. Generate reports for per-class accuracy
  ```bash
    cd tools
    python generate-eval.py --dataset_name grotoap2 --experiment_name baseline --store_per_class
    # In additiona to the report.csv file, it will also generate a report_per_class.csv
    # table in the corresponding folder. 
  ```

**Note**: this evaluation toolkits might take a long time to run as calculing the group entropy may take long. 

### Model Inference/Prediction 

Please refer to the example code below

```python
import layoutparser as lp # For visualization 

from vila.pdftools.pdf_extractor import PDFExtractor
from vila.predictors import SimplePDFPredictor
# Choose from SimplePDFPredictor,
# LayoutIndicatorPDFPredictor, 
# and HierarchicalPDFDataPreprocessor

pdf_extractor = PDFExtractor("pdfplumber")
page_tokens, page_images = pdf_extractor.load_tokens_and_image(f"path-to-your.pdf")

pdf_predictor = SimplePDFPredictor.from_pretrained("path-to-the-trained-weights")

for idx, page_token in enumerate(page_tokens):
    pdf_data = page_token.to_pagedata().to_dict()
    predicted_tokens = pdf_predictor.predict(pdf_data)
    lp.draw_box(page_images[idx], predicted_tokens)
```

## Citation

```
@article{Shen2021IncorporatingVL,
  title={Incorporating Visual Layout Structures for Scientific Text Classification},
  author={Zejiang Shen and Kyle Lo and Lucy Lu Wang and Bailey Kuehl and Daniel S. Weld and Doug Downey},
  journal={ArXiv},
  year={2021},
  volume={abs/2106.00676},
  url={https://arxiv.org/abs/2106.00676}
}
```



