Update README to reflect the extended scope and the new, pipe-based approach.
70 lines
3 KiB
Markdown
70 lines
3 KiB
Markdown
# From knowledge pool to a website
|
|
|
|
## Overview
|
|
|
|
This is a collection of scripts for generating Hugo Markdown pages
|
|
from the data retrieved from the TRR379 Knowledge Pooling Tool,
|
|
enriching it with external metadata where possible or necessary.
|
|
|
|
The scripts are divided into two categories:
|
|
|
|
- *formatters* are at the top level; they accept JSON-lines data
|
|
conforming to the Research information schema (with optional
|
|
additional fields) from files or standard output and write files in
|
|
the output directory (file names are determined from PIDs);
|
|
|
|
- *filters* are in the `filters/` directory; they read
|
|
schema-compliant JSON-lines input from file or standard input and
|
|
write to file or standard output; they may make queries to external
|
|
services or read additional files;
|
|
|
|
Data retrieval from the Pool and some transformations are out of scope
|
|
for this repository; use [dump things
|
|
client](https://hub.psychoinformatics.de/datalink/dump-things-pyclient)
|
|
(dtc) and
|
|
[query-rse-group](https://hub.psychoinformatics.de/datalink/query-rse-group)
|
|
(qrg).
|
|
|
|
## Usage example
|
|
|
|
We recommend using `uv run` and running from the root directory of
|
|
this project. This is an example pipe for creating contributor pages:
|
|
|
|
```
|
|
dtc read-pages https://pool.v0.trr379.de/api/public/records/p/TRR379Project > /tmp/projects.jsonl
|
|
dtc read-pages https://pool.v0.trr379.de/api/public/records/p/TRR379Person
|
|
| qrg inline-records --api-url https://pool.v0.trr379.de/api/ -c public -p delegated_by
|
|
| uv run filters/join-association.py --inline --pop --field-name x_associated_projects - /tmp/projects.jsonl associated_with
|
|
| uv run filters/infer-site.py - $EXTRA_DATA_DIR/v2.2-2026-01-29-ror-data.parquet -
|
|
| uv run person.py - $HUGO_SOURCES_DIR/content/contributors
|
|
```
|
|
|
|
Line by line, this does the following:
|
|
|
|
- save all project records in a file for future reference (project
|
|
records contain roles)
|
|
- load all person records (replace with `get-records` to process
|
|
single record)
|
|
- inline `delegated_by` property
|
|
- do a join-like operation with the previously saved Project records,
|
|
inserting a custom field (ie. not a schema-compliant property of
|
|
Person) `x_associated_projects` created as an inverse of Project's
|
|
`associated_with`
|
|
- add information about "Sites", which requires cross-referencing the
|
|
ROR records (additional information) stored in the parquet file
|
|
[from another repository](https://hub.trr379.de/q02/ror-data-copy)
|
|
- format the records into Hugo Markdown files, writing to the given
|
|
directory
|
|
|
|
Note the usage of `-` to denote stdin / stdout (provided by [Click's
|
|
File
|
|
arguments](https://click.palletsprojects.com/en/stable/handling-files/)).
|
|
|
|
## Caching
|
|
|
|
We attempt to be reasonably efficient when interacting with external
|
|
services. Web requests to doi.org will be cached (with a TTL of 2
|
|
hours) in `.cache`. A copy of the spdx license list will be
|
|
downloaded and saved in the same directory when it first needs to be
|
|
accessed. The cache can be safely removed, to force re-retrieval of
|
|
information.
|