Metadata-Version: 2.3
Name: similarius
Version: 0.0.2
Summary: Compare web page and evaluate the level of similarity.
License: BSD-2-Clause
Keywords: web similarity,web comparaison
Author: David Cruciani
Author-email: david.cruciani@circl.lu
Maintainer: Alexandre Dulaunoy
Maintainer-email: a@foo.be
Requires-Python: >=3.8,<4.0
Classifier: License :: OSI Approved :: BSD License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Dist: beautifulsoup4 (>=4.11.1,<5.0.0)
Requires-Dist: lxml (>=4.9.2,<6.0.0)
Requires-Dist: nltk (>=3.8.1,<4.0.0)
Requires-Dist: requests (>=2.28.2,<3.0.0)
Requires-Dist: scikit-learn (>=1.2.0,<2.0.0)
Project-URL: Repository, https://github.com/ail-project/Similarius
Description-Content-Type: text/markdown

# Similarius

Similarius is a Python library to compare web page and evaluate the level of similarity.

The tool can be used as a stand-alone tool or to feed other systems.



# Requirements

- Python 3.8+
- [Requests](https://github.com/psf/requests)
- [Scikit-learn](https://github.com/scikit-learn/scikit-learn)
- [Beautifulsoup4](https://pypi.org/project/beautifulsoup4/)
- [nltk](https://github.com/nltk/nltk)



# Installation

## Source install

**Similarius** can be install with poetry. If you don't have poetry installed, you can do the following `curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python`.

~~~bash
$ poetry install
$ poetry shell
$ similarius -h
~~~

## pip installation

~~~bash
$ pip3 install similarius
~~~



# Usage

~~~bash
dacru@dacru:~/git/Similarius/similarius$ similarius --help
usage: similarius.py [-h] [-o ORIGINAL] [-w WEBSITE [WEBSITE ...]]

optional arguments:
  -h, --help            show this help message and exit
  -o ORIGINAL, --original ORIGINAL
                        Website to compare
  -w WEBSITE [WEBSITE ...], --website WEBSITE [WEBSITE ...]
                        Website to compare
~~~



# Usage example

~~~bash
dacru@dacru:~/git/Similarius/similarius$ similarius -o circl.lu -w europa.eu circl.eu circl.lu
~~~



# Used as a library

~~~python
import argparse
from similarius import get_website, extract_text_ressource, sk_similarity, ressource_difference, ratio

parser = argparse.ArgumentParser()
parser.add_argument("-w", "--website", nargs="+", help="Website to compare")
parser.add_argument("-o", "--original", help="Website to compare")
args = parser.parse_args()

# Original
original = get_website(args.original)

if not original:
    print("[-] The original website is unreachable...")
    exit(1)

original_text, original_ressource = extract_text_ressource(original.text)

for website in args.website:
    print(f"\n********** {args.original} <-> {website} **********")

    # Compare
    compare = get_website(website)

    if not compare:
        print(f"[-] {website} is unreachable...")
        continue

    compare_text, compare_ressource = extract_text_ressource(compare.text)

    # Calculate
    sim = str(sk_similarity(compare_text, original_text))
    print(f"\nSimilarity: {sim}")

    ressource_diff = ressource_difference(original_ressource, compare_ressource)
    print(f"Ressource Difference: {ressource_diff}")

    ratio_compare = ratio(ressource_diff, sim)
    print(f"Ratio: {ratio_compare}")
~~~



# Acknowledgment

![](./img/cef.png)

The project has been co-funded by CEF-TC-2020-2 - 2020-EU-IA-0260 - JTAN - Joint Threat Analysis Network.

