Metadata-Version: 2.4
Name: xbrl-core
Version: 0.1.2
Summary: Lightweight XBRL 2.1 / iXBRL 1.1 parser and structured data extraction library
Project-URL: Homepage, https://github.com/youseiushida/xbrl-core
Project-URL: Repository, https://github.com/youseiushida/xbrl-core
Project-URL: Issues, https://github.com/youseiushida/xbrl-core
Author: nezow
License: AGPL-3.0-or-later
License-File: LICENSE
Keywords: edinet,financial,ixbrl,sec,tdnet,xbrl,xbrl-parser
Classifier: Development Status :: 3 - Alpha
Classifier: License :: OSI Approved :: GNU Affero General Public License v3 or later (AGPLv3+)
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Office/Business :: Financial
Classifier: Topic :: Text Processing :: Markup :: XML
Classifier: Typing :: Typed
Requires-Python: >=3.12
Requires-Dist: lxml>=5.0
Provides-Extra: all
Requires-Dist: openpyxl>=3.0; extra == 'all'
Requires-Dist: pandas>=2.0; extra == 'all'
Requires-Dist: pyarrow>=14.0; extra == 'all'
Requires-Dist: rich>=13.0; extra == 'all'
Provides-Extra: analysis
Requires-Dist: pandas>=2.0; extra == 'analysis'
Requires-Dist: pyarrow>=14.0; extra == 'analysis'
Provides-Extra: display
Requires-Dist: rich>=13.0; extra == 'display'
Provides-Extra: excel
Requires-Dist: openpyxl>=3.0; extra == 'excel'
Requires-Dist: pandas>=2.0; extra == 'excel'
Description-Content-Type: text/markdown

# xbrl-core — Lightweight XBRL 2.1 / iXBRL 1.1 Parser for Python

[![PyPI version](https://img.shields.io/pypi/v/xbrl-core.svg)](https://pypi.org/project/xbrl-core/)
[![Python](https://img.shields.io/pypi/pyversions/xbrl-core.svg)](https://pypi.org/project/xbrl-core/)
[![Context7 Indexed](https://img.shields.io/badge/Context7-Indexed-047857)](https://context7.com/youseiushida/xbrl-core)
[![Context7 llms.txt](https://img.shields.io/badge/Context7-llms.txt-047857)](https://context7.com/youseiushida/xbrl-core/llms.txt)

**xbrl-core** is a pure-Python parser and structured data extraction library for [XBRL 2.1](https://www.xbrl.org/Specification/XBRL-2.1/REC-2003-12-31/XBRL-2.1-REC-2003-12-31+corrected-errata-2013-02-20.html) instance documents and [iXBRL (Inline XBRL)](https://www.xbrl.org/Specification/inlineXBRL-part1/REC-2013-11-18/inlineXBRL-part1-REC-2013-11-18.html) documents. It supports fact extraction, context/unit structuring, all five linkbase types (presentation, calculation, definition, label, reference), XSD schema parsing, calculation validation, text block extraction, pandas/DataFrame conversion, and Rich/HTML rendering. The only required dependency is [lxml](https://lxml.de/).

[GitHub Repository](https://github.com/youseiushida/xbrl-core)

## Installation

```sh
pip install xbrl-core
```

Optional dependencies:

```sh
# pandas + pyarrow (DataFrame conversion, Parquet export)
pip install 'xbrl-core[analysis]'

# Rich terminal display
pip install 'xbrl-core[display]'

# Excel export (pandas + openpyxl)
pip install 'xbrl-core[excel]'

# Everything
pip install 'xbrl-core[all]'
```

## Quick Start

```python
from xbrl_core import parse_xbrl_facts, structure_contexts, build_line_items

# 1. Parse an XBRL instance document
with open("instance.xbrl", "rb") as f:
    parsed = parse_xbrl_facts(f.read(), source_path="instance.xbrl")

print(f"Facts: {parsed.fact_count}")

# 2. Structure contexts and build typed LineItems
ctx_map = structure_contexts(parsed.contexts)
items = build_line_items(parsed.facts, ctx_map)

for item in items[:5]:
    print(item.local_name, item.value, item.period)
```

## Parsing

### XBRL Instance

`parse_xbrl_facts()` takes raw bytes and returns a `ParsedXBRL` containing facts, contexts, units, schema refs, footnote links, and ignored elements.

```python
from xbrl_core import parse_xbrl_facts

parsed = parse_xbrl_facts(xbrl_bytes, source_path="example.xbrl")

# Extracted data
parsed.facts             # tuple[RawFact, ...]
parsed.contexts          # tuple[RawContext, ...]
parsed.units             # tuple[RawUnit, ...]
parsed.schema_refs       # tuple[RawSchemaRef, ...]
parsed.footnote_links    # tuple[RawFootnoteLink, ...]
parsed.ignored_elements  # tuple[IgnoredElement, ...]
parsed.fact_count        # int
```

### iXBRL (Inline XBRL)

`parse_ixbrl_facts()` parses iXBRL (XHTML-embedded XBRL) documents. The output is the same `ParsedXBRL` type, so downstream pipelines work identically.

```python
from xbrl_core import parse_ixbrl_facts

parsed = parse_ixbrl_facts(ixbrl_bytes, source_path="report.htm")

for fact in parsed.facts[:5]:
    print(fact.local_name, fact.value_raw)
```

iXBRL format attributes (`ixt:numdotdecimal`, `ixt:numcommadecimal`, etc.) and `scale`/`sign` attributes are automatically applied. Custom formats can be registered:

```python
from xbrl_core import FormatRegistry, parse_ixbrl_facts

registry = FormatRegistry()
registry.register("dateyearmonthdaycjk", my_cjk_date_func)

parsed = parse_ixbrl_facts(ixbrl_bytes, format_registry=registry)
```

### IXDS (Inline XBRL Document Set)

Multiple iXBRL files from a single filing can be merged:

```python
from xbrl_core import parse_ixbrl_facts, merge_ixbrl_results

results = [parse_ixbrl_facts(f) for f in ixbrl_files]
merged = merge_ixbrl_results(results)
```

### Strict / Lenient Mode

Both parsers accept a `strict` parameter. When `strict=True` (default), spec violations raise `XbrlParseError`. When `strict=False`, violations emit warnings and are recorded in `ignored_elements`.

```python
parsed = parse_xbrl_facts(xbrl_bytes, strict=False)
for elem in parsed.ignored_elements:
    print(elem.reason, elem.source_line)
```

## Context Structuring

`structure_contexts()` converts raw context XML fragments into typed `StructuredContext` objects with period, entity, and dimension information.

```python
from xbrl_core import structure_contexts, ContextCollection

ctx_map = structure_contexts(parsed.contexts)

# Direct dict access
ctx = ctx_map["CurrentYearInstant"]
print(ctx.period)       # InstantPeriod(instant=datetime.date(2024, 3, 31))
print(ctx.entity_id)    # "E00001"
print(ctx.dimensions)   # tuple[DimensionMember, ...]

# ContextCollection for filtering
coll = ContextCollection(ctx_map)
coll.filter_instant()                   # instant contexts only
coll.filter_duration()                  # duration contexts only
coll.filter_no_dimensions()             # no dimension members
coll.filter_by_dimension(axis="{ns}ProductAxis", member="{ns}SegmentA")

coll.latest_instant_period              # most recent InstantPeriod
coll.unique_duration_periods            # unique DurationPeriods, sorted
```

## Unit Structuring

`structure_units()` converts raw unit XML fragments into typed `StructuredUnit` objects.

```python
from xbrl_core import structure_units

unit_map = structure_units(parsed.units)

unit = unit_map["JPY"]
print(unit.is_monetary)     # True
print(unit.currency_code)   # "JPY"

unit = unit_map["pure"]
print(unit.is_pure)         # True

unit = unit_map["JPYPerShare"]
print(unit.is_per_share)    # True
```

## Building LineItems

`build_line_items()` merges `RawFact` + `StructuredContext` + optional `LabelResolver` into fully typed `LineItem` objects.

```python
from xbrl_core import build_line_items

items = build_line_items(parsed.facts, ctx_map, langs=("en", "ja"))

for item in items:
    print(item.local_name)    # "NetSales"
    print(item.value)         # Decimal('1234567890')
    print(item.period)        # InstantPeriod / DurationPeriod
    print(item.entity_id)     # "E00001"
    print(item.dimensions)    # tuple[DimensionMember, ...]
    print(item.label("en"))   # "Net sales"
    print(item.label("ja"))   # "売上高"
```

## Linkbase Parsing

### Presentation Linkbase

```python
from xbrl_core import parse_presentation_linkbase, merge_presentation_trees

trees = parse_presentation_linkbase(pre_xml_bytes)

for role_uri, tree in trees.items():
    # Flatten the tree (depth-first)
    for node in tree.flatten(skip_abstract=True, skip_dimension=True):
        print("  " * node.depth + node.concept)

    # Get only the line-items subtree
    for node in tree.line_items_roots():
        print(node.concept, node.order)

# Merge multiple presentation linkbases
merged = merge_presentation_trees(trees_a, trees_b)
```

### Calculation Linkbase

```python
from xbrl_core import parse_calculation_linkbase

calc_lb = parse_calculation_linkbase(cal_xml_bytes)

for role_uri in calc_lb.role_uris:
    tree = calc_lb.get_tree(role_uri)
    for arc in tree.arcs:
        sign = "+" if arc.weight == 1 else "-"
        print(f"  {arc.parent} {sign}-> {arc.child}")

# Query relationships
calc_lb.children_of("GrossProfit")        # child arcs
calc_lb.parent_of("NetSales")             # parent arcs
calc_lb.ancestors_of("NetSales", role_uri=role)  # root-ward chain
```

### Definition Linkbase

```python
from xbrl_core import parse_definition_linkbase

def_trees = parse_definition_linkbase(def_xml_bytes)

for role_uri, tree in def_trees.items():
    for hc in tree.hypercubes:
        print(f"Table: {hc.table_concept}")
        for axis in hc.axes:
            print(f"  Axis: {axis.axis_concept}")
            if axis.domain:
                print(f"  Domain: {axis.domain.concept}")
```

### Label Linkbase

```python
from xbrl_core import parse_label_linkbase

labels = parse_label_linkbase(lab_xml_bytes)

for lab in labels:
    print(f"{lab.concept_name} [{lab.lang}] = {lab.text}")
```

### Reference Linkbase

```python
from xbrl_core import parse_reference_linkbase

refs = parse_reference_linkbase(ref_xml_bytes)

for ref in refs:
    print(f"{ref.concept_name}: {ref.role}")
    for part in ref.parts:
        print(f"  {part.local_name} = {part.value}")
```

### Footnotes

```python
from xbrl_core import parse_footnote_links

footnote_map = parse_footnote_links(parsed.footnote_links)

notes = footnote_map.get("IdFact1234")
for n in notes:
    print(n.text, n.lang)

print(footnote_map.fact_ids)       # Fact IDs with footnotes
print(len(footnote_map))           # number of Facts with footnotes
```

## Schema Parsing

```python
from xbrl_core import parse_xsd_elements

elements = parse_xsd_elements(xsd_bytes)

elem = elements["NetSales"]
print(elem.period_type)         # "duration"
print(elem.balance)             # "credit"
print(elem.abstract)            # False
print(elem.type_name)           # "xbrli:monetaryItemType"
print(elem.substitution_group)  # "xbrli:item"
```

## Calculation Validation

Validates summation-item relationships per XBRL 2.1 section 5.2.5.2, with decimals-based rounding tolerance.

```python
from xbrl_core import validate_calculations, parse_calculation_linkbase

calc_lb = parse_calculation_linkbase(cal_xml_bytes)
result = validate_calculations(items, calc_lb)

print(result)           # "Calculation validation: PASS (checked=42, passed=42, errors=0, skipped=3)"
print(result.is_valid)  # True

for issue in result.issues:
    print(issue.parent_concept, issue.expected, issue.actual, issue.severity)
```

## Text Block Extraction

Extracts `textBlockItemType` facts (e.g. MD&A, risk factors, notes) from filings.

```python
from xbrl_core import extract_text_blocks, clean_html

blocks = extract_text_blocks(parsed.facts, ctx_map)

for block in blocks:
    print(block.concept)         # "BusinessRisksTextBlock"
    print(block.period)          # DurationPeriod(...)
    plain = clean_html(block.html)
    print(plain[:200])
```

`clean_html()` converts HTML fragments to plain text, preserving table structure with tabs and newlines — useful as preprocessing for LLM / RAG pipelines.

## DataFrame Conversion

Requires `pip install 'xbrl-core[analysis]'`.

```python
from xbrl_core import line_items_to_dataframe, to_csv, to_parquet

df = line_items_to_dataframe(items, label_lang="en")
print(df[["local_name", "label", "value", "period_end"]].head())

# Export
to_csv(df, "output.csv")
to_parquet(df, "output.parquet")
```

Requires `pip install 'xbrl-core[excel]'`:

```python
from xbrl_core import to_excel

to_excel(df, "output.xlsx", sheet_name="BalanceSheet")
```

## Display

### Rich (Terminal)

Requires `pip install 'xbrl-core[display]'`.

```python
from rich.console import Console
from xbrl_core import render_statement

table = render_statement(items, title="Balance Sheet", label_lang="en")
Console().print(table)
```

### Hierarchical Display

Use `DisplayHint` with presentation tree data for indented financial statements:

```python
from xbrl_core import (
    build_display_rows,
    render_hierarchical_statement,
    DisplayHint,
)

hints = [
    DisplayHint(concept="AssetsAbstract", depth=0, is_abstract=True, label="Assets"),
    DisplayHint(concept="CashAndDeposits", depth=1),
    DisplayHint(concept="TotalAssets", depth=0, is_total=True),
]

# Rich Table
table = render_hierarchical_statement(items, hints=hints, title="BS")

# Or get raw DisplayRow objects
rows = build_display_rows(items, hints=hints)
```

### HTML (Jupyter)

```python
from xbrl_core import to_html

html = to_html(items, hints=hints, title="Balance Sheet")
```

## Label Resolution

`LabelResolver` is a Protocol — implement it to inject taxonomy labels into `build_line_items()`.

```python
from xbrl_core import LabelResolver, LabelInfo, LabelSource

class MyResolver:
    def resolve(self, concept_qname, lang, role):
        # Look up label from your taxonomy data
        return LabelInfo(text="Net sales", role=role, lang=lang, source=LabelSource.STANDARD)

    def resolve_batch(self, concept_qnames, lang, role):
        return {qn: self.resolve(qn, lang, role) for qn in concept_qnames}

items = build_line_items(parsed.facts, ctx_map, resolver=MyResolver(), langs=("en",))
```

## Error Handling

All errors inherit from `XbrlError` and carry a structured error code and context.

```python
from xbrl_core import XbrlError, XbrlParseError, XbrlValidationError

try:
    parsed = parse_xbrl_facts(bad_bytes)
except XbrlParseError as e:
    print(e.code)     # "XBRL_PARSE_001"
    print(e.context)  # {"source_path": "..."}
```

| Error code prefix | Exception class | Description |
|:---|:---|:---|
| `XBRL_PARSE_xxx` | `XbrlParseError` | XML/XBRL parse errors |
| `XBRL_CTX_xxx` | `XbrlParseError` | Context structuring errors |
| `XBRL_UNIT_xxx` | `XbrlParseError` | Unit structuring errors |
| `XBRL_LINK_xxx` | `XbrlParseError` | Linkbase parse errors |
| `XBRL_IXBRL_xxx` | `XbrlParseError` | iXBRL parse errors |
| `XBRL_VAL_xxx` | `XbrlValidationError` | Validation errors |

`XbrlWarning` (a `UserWarning` subclass) is emitted for non-fatal issues.

### Customizing Error / Warning Classes

All linkbase parsers and `structure_units()` accept optional `error_class` and `warning_class` parameters. This allows downstream libraries to substitute their own exception and warning types — useful when wrapping xbrl-core in a domain-specific package (e.g. an EDINET library).

```python
from xbrl_core import XbrlParseError, XbrlWarning, parse_calculation_linkbase

class EdinetParseError(XbrlParseError):
    """EDINET-specific parse error."""

class EdinetWarning(UserWarning):
    """EDINET-specific warning."""

lb = parse_calculation_linkbase(
    xml_bytes,
    error_class=EdinetParseError,
    warning_class=EdinetWarning,
)
```

Supported by: `parse_calculation_linkbase`, `parse_definition_linkbase`, `parse_presentation_linkbase`, `parse_label_linkbase`, `parse_reference_linkbase` (`error_class` only), `parse_footnote_links`, `structure_units`.

### Customizing Concept Extraction

Linkbase parsers extract concept local names from `xlink:href` fragments. The default logic handles standard XBRL taxonomy patterns (`{prefix}_{YYYY-MM-DD}.xsd#prefix_ConceptName`), but jurisdiction-specific taxonomies may use different naming conventions.

All linkbase parsers accept a `concept_extractor` parameter (`Callable[[str], str | None]`) to override this logic:

```python
import re
from xbrl_core import ConceptExtractor, parse_label_linkbase

def edinet_concept_extractor(href: str) -> str | None:
    """EDINET Strategy 2: extract local name by backward _[A-Z] scan."""
    if "#" not in href:
        return None
    fragment = href.rsplit("#", 1)[1]
    m = re.search(r"_([A-Z][A-Za-z0-9]*)$", fragment)
    return m.group(1) if m else fragment

labels = parse_label_linkbase(xml_bytes, concept_extractor=edinet_concept_extractor)
```

Supported by: `parse_calculation_linkbase`, `parse_definition_linkbase`, `parse_presentation_linkbase`, `parse_label_linkbase`, `parse_reference_linkbase`.

## Requirements

Python 3.12+. The only required dependency is `lxml >= 5.0`.
