Metadata-Version: 2.1
Name: xml-miner
Version: 0.0.4
Summary: data mining tool, to mine data from batch of xml files
Home-page: https://github.com/tilaboy/xml-miner
Author: Chao Li
Author-email: chaoli.job@gmail.com
License: MIT license
Keywords: data mining,xml
Platform: UNKNOWN
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7

XML/TRXML Selector
==================

Description
-----------

This package provides two scripts: ``mine-xml`` and
``mine-trxml``.

``mine-xml`` selects tags from xml/mxml files, and save the
selected values to file.

``mine-trxml`` selects fields from trxml/mtrxml files, and save
the selected values to file.

Status
------------

.. image:: https://travis-ci.org/tilaboy/xml-miner.svg?branch=master
    :target: https://travis-ci.org/tilaboy/xml-miner

.. image:: https://readthedocs.org/projects/xml-miner/badge/?version=latest
    :target: https://xml-miner.readthedocs.io/en/latest/?badge=latest
    :alt: Documentation Status

.. image:: https://pyup.io/repos/github/tilaboy/xml-miner/shield.svg
    :target: https://pyup.io/repos/github/tilaboy/xml-miner/
    :alt: Updates

Requirements
------------

Python 3.6+

Installation
------------

::

    pip install xml-selector


Usage
-----

Use xml selector script
~~~~~~~~~~~~~~~~~~~~~~~

The xml selector supports:
^^^^^^^^^^^^^^^^^^^^^^^^^^

-  one or more tagnames:

-  selector could be one tagname ``name``

-  or comma separated tagnames ``langskill,compskill,softskills``

-  multiple sources:

-  e.g. select from xml dir, xml files, mxml file, or directly from
   annotation server

examples:
^^^^^^^^^

::

    #select from xml directory
    mine-xml --source tests/xmls/ --selector name --output_file name.tsv
    mine-xml --source tests/xmls/ --selector langskill,compskill,softskill --output_file skill.tsv --with_field_name

    #select from xml file or mxml file
    mine-xml --source tests/sample.mxml --selector experience --output_file experience.tsv

    #select directly from annotation server
    mine-xml --source localhost:50249 --selector name --output_file name.tsv --query "set Data2018"

Use trxml selector script
~~~~~~~~~~~~~~~~~~~~~~~~~

The trxml selector supports:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-  one or more selectors:

-  selector can be one field: ``name.0.name``

-  or comma separated fields: ``name.0.name,address.0.address``

-  single or multi item:

-  can select field from one item, e.g. ``experienceitem.3.experience``

-  or select field value of all item, e.g. ``experienceitem.experience``
   (or ``experienceitem.*.experience``)

-  multiple sources:

-  e.g. select from trxml dir, trxml files, or mtrxml file

examples:
^^^^^^^^^

::

    # one selector, single item
    mine-trxml --source tests/trxmls/ --selector name.0.name --output_file name.tsv

    # one selector, multiple item
    mine-trxml --source tests/sample.mxml --selector experienceitem.experience --output_file experience.tsv

    # more selectors, single item
    mine-trxml --source tests/trxmls/ --selector name.0.name,address.0.address,phone.0.phone --output_file personal.tsv

    # more selectors, multiple item
    mine-trxml --source tests/sample.mxml  --itemgroup experienceitem --fields experience,experiencedate --output_file experience.tsv
    mine-trxml --source tests/sample.mxml  --selector experienceitem.*.experience,experienceitem.*.experiencedate --output_file experience.tsv
    mine-trxml --source tests/sample.mxml  --selector experienceitem.experience,experienceitem.experiencedate --output_file experience.tsv

Development
-----------

To install package and its dependencies, run the following from project
root directory:

::

    python setup.py install

To work the code and develop the package, run the following from project
root directory:

::

    python setup.py develop

To run unit tests, execute the following from the project root
directory:

::

    python setup.py test

selector and output details:
----------------------------

-  mine-xml:

   input: documents, selector(s), output

   output:

   -  default (parameter ``with_field_name`` not set):
      ``filename, field_value``

   e.g. select all names with selector ``name``

   +------------+-----------+
   | filename   | value     |
   +============+===========+
   | xxxx       | Chao Li   |
   +------------+-----------+

   -  parameter ``with_field_name`` set:
      ``filename, field_value, field_name``

   e.g. select skills with selector ``compskill,langskill,otherskill``

   +------------+---------+-------------+
   | filename   | value   | field       |
   +============+=========+=============+
   | xxxx       | java    | compskill   |
   +------------+---------+-------------+
   | xxxx       | dutch   | langskill   |
   +------------+---------+-------------+

-  mine-trxml

   -  input:
   -  documents, selector(s), output,
   -  documents, itemgroup, fields, output

   -  single selector:
   -  single item (``name.0.name``): filename field

   +------------+---------------+
   | filename   | name.0.name   |
   +============+===============+
   | xxxx       | Chao Li       |
   +------------+---------------+

   -  multi items (``skill.*.skill``): filename item\_index field

   +------------+---------------+---------+
   | filename   | item\_index   | field   |
   +============+===============+=========+
   | xxxx       | 0             | java    |
   +------------+---------------+---------+
   | xxxx       | 1             | dutch   |
   +------------+---------------+---------+

   -  multiple selectors
   -  single item: filename, field1, field2 ...

   each selector points to a field of a specific item with a digital
   index, e.g. ``name.0.lastname,name.0.firstname,address.0.country``

   +------------+-------------------+--------------------+---------------------+
   | filename   | name.0.lastname   | name.0.firstname   | address.0.country   |
   +============+===================+====================+=====================+
   | xxxx       | Li                | Chao               | China               |
   +------------+-------------------+--------------------+---------------------+
   | xxxx       | Lee               | Richard            | USA                 |
   +------------+-------------------+--------------------+---------------------+

   -  multi items: filename, item\_index, field1, field2 ...

   each selector points to a field from all items in an itemgroup, e.g.
   ``skill.skill,skill.type,skill.date``

   +------------+---------+---------+-------------+-------------+
   | filename   | skill   | skill   | type        | date        |
   +============+=========+=========+=============+=============+
   | xxxx       | 0       | java    | compskill   | 2001-2005   |
   +------------+---------+---------+-------------+-------------+
   | xxxx       | 1       | dutch   | langskill   | 2002-       |
   +------------+---------+---------+-------------+-------------+


0.0.4 (2019-09-11)
==================
- bug fix: reading always use utf8, and not continue reading if failed on encoding of one document

0.0.3 (2019-08-11)
==================
- expand miner.py module to generate matched phrases per doc

0.0.2 (2019-08-09)
==================

- added support for CI


0.0.1 (2019-08-09)
==================

- make two script: mine-xml and mine-trxml


0.0.0 (2019-08-06)
==================

- Add the first version of the mine_xml and mine_trxml


