generate publication pages from the metadata pool
Find a file
Michał Szczepanik f5c1ee04ce Make (some) prefixes configurable
The Research Information schema should be the same between deployments,
with the exception of prefixes used for identifiers. This makes them
configurable for Project and Publication processing.

We have three kinds of prefixes:

- root namespace (ROOTNS): used in object PIDs, the code uses that
  information to decide if / where the object belongs in the generated
  website (eg. trr379root, xyzrins)

- schema prefix (SPFX): used for types defined by the schema, the code
  uses those to determine whether it's dealing with an object of a given
  schema type (eg. trr379ri, xyzri)

- class prefix (CPFX): used at the beginning of type names defined by
  the schema but after the actual schema prefix, used by the code
  together with SPFX and for the same purpose (e.g. TRR379, XYZ)

The SPFX and CPFX always go together (e.g. trr379:TRRPerson) and one is
a variation of another (uppercase without "ri") but it's better to be
explicit.

Perhaps we could get read of these checks altogether, or perhaps not.
For example, referencing something in a Hugo taxonomy leads to creation
of a page, which may not be desired (hence checking if something is in
the root namespace).
2026-02-25 20:24:01 +01:00
filters Add docstring and help to the join_association filter 2026-02-23 15:00:07 +01:00
.gitignore Add parsing of person records 2026-01-23 17:10:25 +01:00
.python-version One has to start somewhere 2025-11-17 13:06:59 +01:00
person.py Treat person's projects as a set 2026-02-24 19:35:45 +01:00
project.py Make (some) prefixes configurable 2026-02-25 20:24:01 +01:00
publication.py Make (some) prefixes configurable 2026-02-25 20:24:01 +01:00
pyproject.toml Stop print debugging, remove icecream dependency & update others 2026-02-23 18:36:09 +01:00
README.md Rewrite README 2026-02-19 13:48:15 +01:00
uv.lock Stop print debugging, remove icecream dependency & update others 2026-02-23 18:36:09 +01:00

From knowledge pool to a website

Overview

This is a collection of scripts for generating Hugo Markdown pages from the data retrieved from the TRR379 Knowledge Pooling Tool, enriching it with external metadata where possible or necessary.

The scripts are divided into two categories:

  • formatters are at the top level; they accept JSON-lines data conforming to the Research information schema (with optional additional fields) from files or standard output and write files in the output directory (file names are determined from PIDs);

  • filters are in the filters/ directory; they read schema-compliant JSON-lines input from file or standard input and write to file or standard output; they may make queries to external services or read additional files;

Data retrieval from the Pool and some transformations are out of scope for this repository; use dump things client (dtc) and query-rse-group (qrg).

Usage example

We recommend using uv run and running from the root directory of this project. This is an example pipe for creating contributor pages:

dtc read-pages https://pool.v0.trr379.de/api/public/records/p/TRR379Project > /tmp/projects.jsonl
dtc read-pages https://pool.v0.trr379.de/api/public/records/p/TRR379Person
| qrg inline-records --api-url https://pool.v0.trr379.de/api/ -c public -p delegated_by
| uv run filters/join-association.py --inline --pop --field-name x_associated_projects - /tmp/projects.jsonl associated_with
| uv run filters/infer-site.py - $EXTRA_DATA_DIR/v2.2-2026-01-29-ror-data.parquet - 
| uv run person.py - $HUGO_SOURCES_DIR/content/contributors

Line by line, this does the following:

  • save all project records in a file for future reference (project records contain roles)
  • load all person records (replace with get-records to process single record)
  • inline delegated_by property
  • do a join-like operation with the previously saved Project records, inserting a custom field (ie. not a schema-compliant property of Person) x_associated_projects created as an inverse of Project's associated_with
  • add information about "Sites", which requires cross-referencing the ROR records (additional information) stored in the parquet file from another repository
  • format the records into Hugo Markdown files, writing to the given directory

Note the usage of - to denote stdin / stdout (provided by Click's File arguments).

Caching

We attempt to be reasonably efficient when interacting with external services. Web requests to doi.org will be cached (with a TTL of 2 hours) in .cache. A copy of the spdx license list will be downloaded and saved in the same directory when it first needs to be accessed. The cache can be safely removed, to force re-retrieval of information.