pool-publication-page/README.md
Michał Szczepanik 014df14b76 Rewrite README
Update README to reflect the extended scope and the new, pipe-based
approach.
2026-02-19 13:48:15 +01:00

70 lines
3 KiB
Markdown

# From knowledge pool to a website
## Overview
This is a collection of scripts for generating Hugo Markdown pages
from the data retrieved from the TRR379 Knowledge Pooling Tool,
enriching it with external metadata where possible or necessary.
The scripts are divided into two categories:
- *formatters* are at the top level; they accept JSON-lines data
conforming to the Research information schema (with optional
additional fields) from files or standard output and write files in
the output directory (file names are determined from PIDs);
- *filters* are in the `filters/` directory; they read
schema-compliant JSON-lines input from file or standard input and
write to file or standard output; they may make queries to external
services or read additional files;
Data retrieval from the Pool and some transformations are out of scope
for this repository; use [dump things
client](https://hub.psychoinformatics.de/datalink/dump-things-pyclient)
(dtc) and
[query-rse-group](https://hub.psychoinformatics.de/datalink/query-rse-group)
(qrg).
## Usage example
We recommend using `uv run` and running from the root directory of
this project. This is an example pipe for creating contributor pages:
```
dtc read-pages https://pool.v0.trr379.de/api/public/records/p/TRR379Project > /tmp/projects.jsonl
dtc read-pages https://pool.v0.trr379.de/api/public/records/p/TRR379Person
| qrg inline-records --api-url https://pool.v0.trr379.de/api/ -c public -p delegated_by
| uv run filters/join-association.py --inline --pop --field-name x_associated_projects - /tmp/projects.jsonl associated_with
| uv run filters/infer-site.py - $EXTRA_DATA_DIR/v2.2-2026-01-29-ror-data.parquet -
| uv run person.py - $HUGO_SOURCES_DIR/content/contributors
```
Line by line, this does the following:
- save all project records in a file for future reference (project
records contain roles)
- load all person records (replace with `get-records` to process
single record)
- inline `delegated_by` property
- do a join-like operation with the previously saved Project records,
inserting a custom field (ie. not a schema-compliant property of
Person) `x_associated_projects` created as an inverse of Project's
`associated_with`
- add information about "Sites", which requires cross-referencing the
ROR records (additional information) stored in the parquet file
[from another repository](https://hub.trr379.de/q02/ror-data-copy)
- format the records into Hugo Markdown files, writing to the given
directory
Note the usage of `-` to denote stdin / stdout (provided by [Click's
File
arguments](https://click.palletsprojects.com/en/stable/handling-files/)).
## Caching
We attempt to be reasonably efficient when interacting with external
services. Web requests to doi.org will be cached (with a TTL of 2
hours) in `.cache`. A copy of the spdx license list will be
downloaded and saved in the same directory when it first needs to be
accessed. The cache can be safely removed, to force re-retrieval of
information.