Metadata-Version: 2.1
Name: cwb-ccc
Version: 0.9.2
Summary: CWB wrapper to extract concordances and collocates
Home-page: https://gitlab.cs.fau.de/pheinrich/ccc
Author: Philipp Heinrich
Author-email: philipp.heinrich@fau.de
License: UNKNOWN
Description: # Collocation and Concordance Computation #
        
        ## Introduction ##
        This module is a wrapper around the [IMS Open Corpus Workbench
        (CWB)]([http://cwb.sourceforge.net/]).  It requires CWB version 3.4.16
        or newer for anchored queries.  Main purpose of the module is to
        extract concordance lines and to calculate collocates, as well as to
        extract the results of queries with more than two anchors.
        
        ## Installation ##
        The recommended way to install the module is to clone the repository
        and use `setup.py`.
        
            python3 setup.py install
        
        Alternatively, you can just install the requirements and make sure the
        `ccc` subfolder can be found by Python by including it in your
        `PYTHONPATH`.
        
        
        ## Usage ##
        
        ### CWBEngine
        All methods rely on the `CWBEngine` from `ccc.cwb`, which you first
        have to initialize with your system specific settings:
        
        ```python
        from ccc.cwb import CWBEngine
        
        engine = CWBEngine(
        	corpus_name="EXAMPLE_CORPUS"
        	registry_path="/path/to/your/cwb/registry"
        )
        ```
        
        NB: this will raise a KeyError if the named corpus is not in the
        specified registry.
        
        You can use the `cqp_bin` to point the engine to a specific version of
        `cqp` (this is also helpful if `cqp` is not in your `PATH`):
        
        ```python
        engine = CWBEngine(
        	corpus_name="EXAMPLE_CORPUS",
        	registry_path="/path/to/your/cwb/registry", 
        	cqp_bin="/usr/local/cwb-3.4.16/bin/cqp"
        )
        ```
        
        If you are using macros and wordlists, you have to store them in a
        separate folder (with subfolders `wordlists` and `macros`).  Make sure
        you specify this folder via `lib_path` when initializing the
        engine:
        
        ```python
        engine = CWBEngine(
        	corpus_name="EXAMPLE_CORPUS", 
        	registry_path="/path/to/your/cwb/registry",
        	lib_path="/path/to/your/lib/"
        )
        ```
        
        
        ### Concordancing ###
        
        You can use the `Concordance` class from `ccc.concordances` for
        concordancing. The concordancer has to be initialized with the engine
        and accepts valid CQP queries:
        	
        ```python
        from ccc.concordances import Concordance
        
        # initialize the concordancer with the engine
        concordance = Concordance(engine)
        
        # extract concordance lines
        concordance.query('[lemma="Angela"] [lemma="Merkel"]')
        ```
        
        The result will be a dictionary with the _cpos_ of the match as keys
        and the entries one concordance line each. Each concordance line is
        formatted as a `pandas.DataFrame` with the _cpos_ of each token as
        index:
        
        | **cpos**  | word    | match | offset |
        |-----------|---------|-------|--------|
        | 188530363 | ,       | False | -5     |
        | 188530364 | dass    | False | -4     |
        | 188530365 | die     | False | -3     |
        | 188530366 | Tage    | False | -2     |
        | 188530367 | von     | False | -1     |
        | 188530368 | Angela  | True  | 0      |
        | 188530369 | Merkel  | True  | 0      |
        | 188530370 | gezählt | False | 1      |
        | 188530371 | sind    | False | 2      |
        | 188530372 | .       | False | 3      |
        
        The queries _must not_ end on a "within" clause.  If you want to
        restrict your concordance lines by a structural attribute, use the
        `s_break` parameter (defaults to "text"). The default context window
        is 20 tokens to the left and 20 tokens to the right of the query match
        and matchend, respectively.
        
        ```python
        concordance = Concordance(engine, context=50, s_break='s')
        concordance.query('[lemma="Angela"] [lemma="Merkel"]')
        ```
        
        Further parameters for the `Concordance` class are `order` (one of
        "random", "first", or "last"), `cut_off` (for the number of
        concordance lines to extract), and `p_show` (a `list` of additional
        p-attributes besides the primary word layer to show, e.\,g. "lemma" or
        "pos"; these will be added as additional columns).
        
        ### Anchored Queries ###
        
        `Concordance` detects anchored queries by default. The following query
        ```python
        concordance.query(
        	'@0[lemma="Angela"]? @1[lemma="Merkel"] '
        	'[word="\\("] @2[lemma="CDU"] [word="\\)"]'
        )
        ```
        will thus return `DataFrame`s with an additional column indicating the
        anchor positions:
        
        | **cpos**  | word       | match | offset | anchor |
        |-----------|------------|-------|--------|--------|
        | 298906425 | auch       | False | -5     | None   |
        | 298906426 | das        | False | -4     | None   |
        | 298906427 | Handy      | False | -3     | None   |
        | 298906428 | von        | False | -2     | None   |
        | 298906429 | Kanzlerin  | False | -1     | None   |
        | 298906430 | Angela     | True  | 0      | 0      |
        | 298906431 | Merkel     | True  | 0      | 1      |
        | 298906432 | (          | True  | 0      | None   |
        | 298906433 | CDU        | True  | 0      | 2      |
        | 298906434 | )          | True  | 0      | None   |
        | 298906435 | sowie      | False | 1      | None   |
        | 298906436 | ihres      | False | 2      | None   |
        | 298906437 | Vorgängers | False | 3      | None   |
        | 298906438 | Gerhard    | False | 4      | None   |
        | 298906439 | Schröder   | False | 5      | None   |
        
        
        ### Argument Queries
        Argument queries are anchored queries with additional information. (1)
        Each anchor can be modified by an offset (usually used to capture
        underspecified tokens near an anchor point). (2) Anchors can be mapped
        to external identifiers for further logical processing, and (3) be
        given a clear name:
        
        
        | anchor | offset | idx  | clear name |
        |--------|--------|------|------------|
        | 0      | 0      | None | None       |
        | 1      | -1     | None | None       |
        | 2      | 0      | None | None       |
        | 3      | -1     | None | None       |
        
        
        Furthermore, several anchor queries can be combined to form regions,
        which in turn can be mapped to identifiers and be given a clear name:
        
        | start | end | idx | clear name |
        |-------|-----|-----|------------|
        | 0     | 1   | "0" | "person X" |
        | 2     | 3   | "1" | "person Y" |
        
        
        Example: Given the definition of anchors and regions above, the
        follwing complex query extracts corpus positions where there's some
        correlation between "person X" (the region from anchor 0 to anchor 1)
        and "person Y" (anchor 2 to 3):
        
        ```python
        query = (
        	"<np> []* /ap[]* [lemma = $nouns_similarity] "
        	"[]*</np> \"between\" @0:[::](<np>[pos_simple=\"D|A\"]* "
        	"([pos_simple=\"Z|P\" | lemma = $nouns_person_common | "
        	"lemma = $nouns_person_origin | lemma = $nouns_person_support | "
        	"lemma = $nouns_person_negative | "
        	"lemma = $nouns_person_profession] |/region[ner])+ "
        	"[]*</np>)+@1:[::] \"and\" @2:[::](<np>[pos_simple=\"D|A\"]* "
        	"([pos_simple=\"Z|P\" | lemma = $nouns_person_common | "
        	"lemma = $nouns_person_origin | lemma = $nouns_person_support | "
        	"lemma = $nouns_person_negative | "
        	"lemma = $nouns_person_profession] | /region[ner])+ "
        	"[]*</np>) (/region[np] | <vp>[lemma!=\"be\"]</vp> | "
        	"/region[pp] |/be_ap[])* @3:[::]"
        )
        ```
        
        NB: the set of identifiers defined in the table of anchors and in the
        table of regions, respectively, should not overlap.
        
        It is customary to store these queries in json query files such as the
        [example](tests/gold/query-example.json). You can directly process
        these files using the `process_argmin_file` method from `ccc.anchors`:
        
        ```python
        from ccc.argmin import process_argmin_file
        
        # process the query file
        query_path = "tests/gold/query-example.json"
        result = process_argmin_file(engine, query_path)
        ```
        
        The result is a `dict` with the same keys as specified in the query
        file as well as an entry "result" with the following keys:
        
        - "nr_matches": the number of query matches in the corpus.
        - "matches": the actual concordance lines as returned from
          `Concordance().query()` (see above) converted to a `dict`. An
          additional entry "holes" contains a mapping from the idx specified
          in the anchor and region tables to the tokens or token sequences,
          respectively, for each concordance line (default: lemma layer).
        - "holes": a global list of all tokens of the entities specified in
          the "idx" columns (default: lemma layer).
        
        
        ## Acknowledgements ##
        The module relies on several other python modules (see the
        requirements).  Special thanks to Yannick Versley and Jorg Asmussen
        for the implementation of
        [cwb-python](https://pypi.org/project/cwb-python/).
        
        This work has been funded by the Deutsche Forschungsgemeinschaft (DFG)
        within the project "Reconstructing Arguments from Noisy Text", grant
        number 377333057, as part of the Priority Program "Robust
        Argumentation Machines (RATIO)" (SPP-1999).
        
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+)
Classifier: Operating System :: OS Independent
Requires-Python: >=3.5
Description-Content-Type: text/markdown
