Change workflow and update to match new schema #3

Manually merged
msz merged 0 commits from updates into main 2026-02-11 16:22:09 +00:00
Member

The idea is roughly as follows:

  • most of the data will come from dtc read-pages | qrg inline-records
  • if we need enrichment (eg. reading from extra files) this will be done in a preceding step
The idea is roughly as follows: - most of the data will come from `dtc read-pages | qrg inline-records` - if we need enrichment (eg. reading from extra files) this will be done in a preceding step
This removes all query components, anticipating the fact that we will
use external scripts to query / filter / inline / extend the records.

Using click's File argument allows us to read from file or standard
input ("-") (1). This should mesh well with our other work.

(1) https://click.palletsprojects.com/en/stable/handling-files/
Because the data model has no "affiliation" and the website template
does have one (shown under person's name), we read the names of the
organizations listed as Person.delegated_by.

One weak spot is that we check delegation object againist a hardcoded
curie (trr379ri:TRR379Organization).

Another weak spot is that we currently do not have "Employer" role in
the pool, and we would ideally use that plus object type to craft
affiliation.

One final weak spot is that the organizations in the pool are currently
only ROR organizations (so big ones, rather than labs or institutes) -
but this is a good start and we have ways to move forward once more
granular affiliation becomes needed.
Some information about a given object will be available only from other
objects. For example, to match Person to Project, we must look through
Projects and their associated_with. A Person can be associated with
multiple projects. For this, we need some kind of join / merge.

This implementation is inspired by query_rse_group and might eventually
migrate there, but for starters it can be kept here.
Because projects link to persons (not vice versa), this means that we
need to combine information from Person and Project records. Since
Person's projects are not part of the schema, we will assume that we are
using an additional "x_associated_projects" property.

The role detection currently works with role pids (not inlined roles)
because it is (sort of) easier to map the pids to the role taxonomy
currently used in the website. There is definitely room for improvement:
for example this script does not currently recognize LOC relators "rtm",
and the doctoral-researcher role is not commonly used in the pool. This
will require convergence between the pool, website taxonomy, and this
code.
This can be done more exhaustively, probably as a separate "filter".
We recognize ORCID by schema type or by its creator pid; it could be
done more nicely if we worked with an inlined creator, but for some
reason ror record for ORCID is currently missing.
This adds an "infer-site" filter, which combines the pool information
with information from the ror database (data dump). The external ror
information is used to provide information about related organizations,
in addition to parent organizations. A duckdb dependency is used to
improve load speed and reduce data dump size by using parquet format.

Only the seven main TRR sites are hardcoded; an organization (present as
delegated_by) counts as site if it is that site, has the site as parent
or related organization, or has the site as its parent organization's
parent or related organization.

For example, this filter recognizes ror:03f6n9m15 University Hospital
Frankfurt as "frankfurt" site because it is related to ror:04cvxnb49
Goethe University Frankfurt (which is the "frankfurt" site).

After implementing this, I realized that this could be simplified by
inverting the lookup: start not from any given organization in the
Person record, but from the list of sites. A mapping of related, child,
child-child, and child-related organization could be created upfront,
producing a much smaller org-to-site relationship. This can, however, be
done in the future, without changing the interface much.
This follows the convention of trr379root: prefix designating records
which should end up having a page on the website. Person records with
other prefixes may include, e.g., co-authors of a paper with no
affiliation with the TRR.
This eliminates the pool-querying functions from the script for
processing publications, and changes its API so that querying can be
done with dump-things-pyclient and query-rse-group.

Publication property access is updated to match the updated schema.

This is functional, but essentially a work in progress - the next step
should be factoring out the enrichment functions which use doi.org. Some
code is left commented out.
If we expect the contributor records to be inlined (ie. be a dict), this
causes problems when an entry is not inlined (because the pid is not in
the pool) and remains a pid.

Ignoring such entries seems an inconsequential change for now, but needs
to be reconsidered in the future.
This is an attempt at isolating enrichment (via doi.org queries) from
formatting markdown pages.

Further optimizations will, no doubt, be needed, but the functionality
is maintained (as in: the output produced seems to be maintained).
msz changed title from WIP: Change workflow and update to match new schema to Change workflow and update to match new schema 2026-02-11 16:15:34 +00:00
Now that we are dealing with publications and persons, there is nothing
"main" about the publication processing. We may bring back "main" and
introduce a subcommand API in the future, though.
msz manually merged commit 0abe8fdaef into main 2026-02-11 16:22:09 +00:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
q02/pool-publication-page!3
No description provided.