Open data

Freely available entomological datasets published under
CC BY 4.0.

Dataset
v1.2 · 2026-04-28 · CC BY 4.0


Hymenoptera Family Matrix — Open Dataset

Štrunc, Vladimír · EntoLibry s.r.o. ·
ORCID 0009-0006-1012-0948

A taxonomic skeleton of order Hymenoptera at family level, cross-referenced with GBIF taxon keys and English and Czech common names. Covers 352 taxa: 2 suborders, 2 infraorders, 26 superfamilies, 98 families, 179 subfamilies, and 45 tribes.

Hymenoptera
taxonomy
GBIF
entomology
open dataset
Cite as:
Štrunc, V. (2026). Hymenoptera Family Matrix — Open Dataset, v1.2. EntoLibry s.r.o. [Dataset]. Zenodo.
https://doi.org/10.5281/zenodo.19850164

DOI: 10.5281/zenodo.19850163 (concept DOI — always resolves to latest version)

The Hymenoptera Family Matrix

A single structured dataset encoding the morphology, ecology, behaviour, biogeography, and identification pathways of every family of sawflies, wasps, bees, and ants on Earth.


What it is

The Hymenoptera Family Matrix is a hand-curated research spreadsheet covering the entire order Hymenoptera — from the most basal sawfly relicts to the hyper-diverse parasitoid wasps, the social insects, and every bee family on the planet. It currently holds 352 taxa × 164 data columns, organised across 21 worksheets, and backed by 37 primary literature sources, 687 audit-log entries, and a full provenance-tracking system.

Every row represents a taxonomic unit — 2 suborders, 2 infraorders, 26 superfamilies, 98 families, 179 subfamilies, and 45 tribes — treated at the resolution that real-world identification demands.

The freely available Zenodo edition provides the taxonomic backbone: hierarchy, authorities, GBIF cross-references, and bilingual common names. Everything described below is the extended matrix that powers our e-book series.


What it encodes

The 164 columns are organised into thematic blocks, each designed to answer a different kind of question.

Block A — Taxonomy (14 columns). The full classificatory hierarchy from suborder to tribe, following Aguiar et al. (2013) and Zhang (2025). Every family-level row carries a verified GBIF taxon key. Common names are provided in English (324 of 352 taxa) and in Czech (28 taxa, sourced exclusively from attested literature: Macek 2010, Macek 2020, Přidal 2001).

Block B — Morphology (43 columns). Body size ranges and categories. Head orientation, ocelli, clypeus form, mandible type, malar space ratio. Five antennal characters (type, segment count, insertion, club, elbow). Ten thorax characters including pronotum shape, notauli, propodeum fusion, cenchri, pubescence, and three specialist traits added in v2.0 (mesopleural suture, sternaulus, acropleuron). Nine wing characters — cell count, venation reduction, the diagnostic 2m-cu vein, marginal vein morphology, hamuli count, wing fringe, areolet presence, and pterostigma shape. Tarsal formula, tibial spurs, strigilis, pollen-collecting apparatus (corbicula vs. scopa), and trochantellus. Petiolus segmentation, metasomal shape, petiole node shape (Formicidae), and the A3–A4 constriction diagnostic for Ponerinae.

Every morphological character is defined in a 130-row legend with explicit state coding, and governed by 35 dependency rules (e.g., corbicula/scopa is only scored for Apoidea bees; cenchri only for Symphyta; areolet only for Ichneumonidae).

Block C — Immature stages (6 columns). Larval type (eruciform, apodous, hypermetamorphic), leg and proleg count, stemmata, pupal type, and cocoon presence — the characters that separate a sawfly caterpillar from a butterfly caterpillar at a glance.

Block D — Ecology and hosts (25 columns). Adult and larval feeding guilds. Ten atomised host-taxon columns broken out by insect order (Lepidoptera, Coleoptera, Hemiptera, Aphidoidea, Hymenoptera, Diptera, Araneae, plants, other invertebrates) plus a free-text original field for audit. Host range, parasitoid strategy (idiobiont/koinobiont), ecto- vs. endoparasitism, gall formation, nesting behaviour, Michener sociality scale, caste systems, and pollination role.

Block E — Appearance and diagnostics (6 columns). Wing pattern, body colour, ovipositor length class, metasomal banding, and ocular hair presence — the “what does it look like?” characters atomised from the original diagnostic free-text and kept separately for quick visual sorting.

Block F — Identification metadata (5 columns). Difficulty rating, confusion groups (which families get mixed up with which), diagnostic life stage, genital necessity, and a four-tier identification level.

Block G — Biogeography and phylogeny (16 columns). Presence/absence across nine zoogeographic regions (Palaearctic, W/E Palaearctic, Nearctic, Neotropical, Afrotropical, Madagascar, Oriental, Australasian). Distribution type. First fossil appearance, crown-group and stem-group ages in Ma. Sister group. Phylogenetic position. Key phylogenomic study.

Block H — Applied and synoptic indices (26 columns). Thermal preference, climate sensitivity, population trend. Economic importance, pest status, biocontrol use. Five binary syndrome flags (aculeate, parasitoid, eusocial, pollinator, gall-former). Informal grouping (bees, ants, social wasps, solitary wasps, parasitoids, sawflies, gall wasps, chalcidoids, cuckoo wasps). Size class, trophic guild, nesting guild, Gondwanan lineage flag, invasive risk, economic use, recommended identification entry point, diel activity, human-contact frequency, optics needed, teaching tier, field season, habitat type, specimen preparation, Czech occurrence, and Czech field-guide coverage.

Block I — Publication layer (7 columns). Every row carries a unique page anchor, a WordPress URL slug, a Czech brief annotation (1–2 sentences), an English diagnostic description, a key-character summary, a confusion-group note, and a biological anomaly flag (68 taxa with notable exceptions to family-level generalisations). These columns are what our e-books are built from.

Provenance tracking (5 columns). Each data block is independently scored as P (primary — explicitly authored), I (inherited from the parent taxon), or C (conflict — requires expert verification). The current matrix has 145 primary rows and 205 inherited rows for the publication block alone, with every inherited text transparently prefixed (“Subfamily of X. …” or “Tribe X (Family:Subfamily). …”).


How it is built

The matrix follows a publication philosophy adapted from a parallel Coleoptera project and refined across nine audit rounds.

Hierarchical inheritance. Subfamily and tribe rows inherit their parent family’s data by default, but can be promoted to primary status when their biology diverges enough to warrant a distinct description. The provenance system makes the boundary between authored and derived content visible at all times.

Atomised characters. Wherever earlier versions used free-text diagnostic fields, v2.0+ broke them into scored, machine-queryable columns. Host associations, for example, moved from a single “HOST_TAXON” string to ten binary columns by insect order — enabling queries like “which families parasitise both Lepidoptera and Coleoptera?” without parsing text.

Character dependencies. A formal dependency table (35 rules) defines when a character is inapplicable — corbicula cannot be scored for Ichneumonidae, petiole node shape is only meaningful for Formicidae, the 2m-cu vein matters only when wing venation is not strongly reduced. This prevents false zeroes from polluting comparative analyses.

Discrimination metrics. An IFEDQ++ (Inter-Family Effective Discrimination Quality) metric evaluates every morphological character’s ability to separate family pairs across all 4,753 possible comparisons among the 98 families. The metric applies block capping (e.g., five biogeographic columns contribute a maximum of 1.5 points, not 5×), entropy weighting, and polymorphism discounting — so the matrix knows which characters actually work for building keys, not just which characters vary.

Audit trail. 687 structured log entries and 51 recommendation records document every editorial decision, from GBIF ID corrections to source-conflict resolution. A separate 971-row citation audit traces every data point back to its literature source.


What it enables

The matrix is not published as a raw file. It is the engine behind a series of derived products.

Identification keys. The Key_nodes sheet defines 45 decision-tree nodes across two MVP keys (a general Hymenoptera entry key and a Chalcidoidea micro-key), each node linking a plain-language question to a matrix predicate (e.g., “PROPODEUM_FUSION=1”) and photographic support. Six strategic entry points let users start from informal groups (“Is it a bee?”), body size, trophic guild, nesting behaviour, economic relevance, or morphology.

Layperson observables. Fifteen visual markers — metallic sheen, bumblebee habitus, honeybee lookalikes, wasp waist, caterpillar-like larvae, velvet ants, paper nests, ant mounds, micro-hymenoptera, visible ovipositors, painful stings, plant galls — each mapped to the families that exhibit them, with explicit confusion warnings (e.g., hoverflies mimicking honeybees have only two wings; all Hymenoptera have four).

Teaching pathways. A 172-row study-path table organises the 98 families into a five-semester learning curriculum, sequenced by teaching tier, required optics, field season, and specimen preparation method.

Page architecture. A 375-row mapping table assigns every taxon a page anchor and URL slug, ready for deployment as a web reference or e-book chapter structure.


Source integrity

The matrix is grounded in 37 cited references spanning foundational taxonomic works (Goulet & Huber 1993, Bolton 2003, Michener 2007), modern phylogenomics (Peters et al. 2017, Blaimer et al. 2023), regional faunas (Macek et al. 2010, 2020), and applied entomology databases (GBIF, IUCN, CABI, AntCat). A dedicated Garanti_znaku (character guarantor) sheet assigns every morphological character to a responsible literature source with year, reference ID, and status (classic / modern / modern_revision / novel).

Classification follows Aguiar et al. (2013) with updates from Zhang (2025). It explicitly predates the Burks et al. (2022) reclassification of Chalcidoidea — Pteromalidae is treated as a single family. This is a deliberate, documented decision, not an oversight, and the matrix is designed to accommodate the split in a future version.

The open taxonomic backbone (13 columns, 352 taxa) is freely available on Zenodo under CC BY 4.0. The extended matrix is the source behind our Hymenoptera e-book series — illustrated guides, field keys, and study pathways built from 164 columns of structured data.