q02/pool-publication-page

Fork 1

Change workflow and update to match new schema #3

Manually merged

msz merged 0 commits from updates into main

2026-02-11 16:22:09 +00:00

msz commented

2026-02-03 17:49:30 +00:00

Member

The idea is roughly as follows:

most of the data will come from dtc read-pages | qrg inline-records
if we need enrichment (eg. reading from extra files) this will be done in a preceding step

The idea is roughly as follows: - most of the data will come from `dtc read-pages | qrg inline-records` - if we need enrichment (eg. reading from extra files) this will be done in a preceding step

msz added 1 commit

2026-02-03 17:49:31 +00:00

Trim person-processing script ff8513bae2

This removes all query components, anticipating the fact that we will
use external scripts to query / filter / inline / extend the records.

Using click's File argument allows us to read from file or standard
input ("-") (1). This should mesh well with our other work.

(1) https://click.palletsprojects.com/en/stable/handling-files/

msz added 1 commit

2026-02-04 12:09:19 +00:00

Read "affiliation" from "delegated_by" c7ad859951

Because the data model has no "affiliation" and the website template
does have one (shown under person's name), we read the names of the
organizations listed as Person.delegated_by.

One weak spot is that we check delegation object againist a hardcoded
curie (trr379ri:TRR379Organization).

Another weak spot is that we currently do not have "Employer" role in
the pool, and we would ideally use that plus object type to craft
affiliation.

One final weak spot is that the organizations in the pool are currently
only ROR organizations (so big ones, rather than labs or institutes) -
but this is a good start and we have ways to move forward once more
granular affiliation becomes needed.

msz added 1 commit

2026-02-04 17:06:00 +00:00

Add a basic join implementation a02e1d52e0

Some information about a given object will be available only from other
objects. For example, to match Person to Project, we must look through
Projects and their associated_with. A Person can be associated with
multiple projects. For this, we need some kind of join / merge.

This implementation is inspired by query_rse_group and might eventually
migrate there, but for starters it can be kept here.

msz added 3 commits

2026-02-04 18:34:54 +00:00

Retrieve projects and roles 3c6b84c9cc

Because projects link to persons (not vice versa), this means that we
need to combine information from Person and Project records. Since
Person's projects are not part of the schema, we will assume that we are
using an additional "x_associated_projects" property.

The role detection currently works with role pids (not inlined roles)
because it is (sort of) easier to map the pids to the role taxonomy
currently used in the website. There is definitely room for improvement:
for example this script does not currently recognize LOC relators "rtm",
and the doctoral-researcher role is not commonly used in the pool. This
will require convergence between the pool, website taxonomy, and this
code.

Remove unused function for processing sites 4c75acab48

This can be done more exhaustively, probably as a separate "filter".

Add a rudimentary function to process ORCID 765fbbac08

We recognize ORCID by schema type or by its creator pid; it could be
done more nicely if we worked with an inlined creator, but for some
reason ror record for ORCID is currently missing.

msz added 1 commit

2026-02-06 18:04:23 +00:00

Add site information bf70a92f7c

This adds an "infer-site" filter, which combines the pool information
with information from the ror database (data dump). The external ror
information is used to provide information about related organizations,
in addition to parent organizations. A duckdb dependency is used to
improve load speed and reduce data dump size by using parquet format.

Only the seven main TRR sites are hardcoded; an organization (present as
delegated_by) counts as site if it is that site, has the site as parent
or related organization, or has the site as its parent organization's
parent or related organization.

For example, this filter recognizes ror:03f6n9m15 University Hospital
Frankfurt as "frankfurt" site because it is related to ror:04cvxnb49
Goethe University Frankfurt (which is the "frankfurt" site).

After implementing this, I realized that this could be simplified by
inverting the lookup: start not from any given organization in the
Person record, but from the list of sites. A mapping of related, child,
child-child, and child-related organization could be created upfront,
producing a much smaller org-to-site relationship. This can, however, be
done in the future, without changing the interface much.

msz added 1 commit

2026-02-09 12:34:57 +00:00

Filter Person records by PID ee42f1f289

This follows the convention of trr379root: prefix designating records
which should end up having a page on the website. Person records with
other prefixes may include, e.g., co-authors of a paper with no
affiliation with the TRR.

msz added 2 commits

2026-02-09 19:33:17 +00:00

Trim and update publication processing script 3383886bdb

This eliminates the pool-querying functions from the script for
processing publications, and changes its API so that querying can be
done with dump-things-pyclient and query-rse-group.

Publication property access is updated to match the updated schema.

This is functional, but essentially a work in progress - the next step
should be factoring out the enrichment functions which use doi.org. Some
code is left commented out.

Ignore non-inlined contributor records dc1fd8e822

If we expect the contributor records to be inlined (ie. be a dict), this
causes problems when an entry is not inlined (because the pid is not in
the pool) and remains a pid.

Ignoring such entries seems an inconsequential change for now, but needs
to be reconsidered in the future.

msz added 1 commit

2026-02-11 15:03:26 +00:00

Refactor publication generation: separate doi.org interactions 56399bead9

This is an attempt at isolating enrichment (via doi.org queries) from
formatting markdown pages.

Further optimizations will, no doubt, be needed, but the functionality
is maintained (as in: the output produced seems to be maintained).

msz changed title from ~~WIP: Change workflow and update to match new schema~~ to Change workflow and update to match new schema

2026-02-11 16:15:34 +00:00

msz added 1 commit

2026-02-11 16:18:48 +00:00

Rename script for publications d8ba1b9f26

Now that we are dealing with publications and persons, there is nothing
"main" about the publication processing. We may bring back "main" and
introduce a subcommand API in the future, though.

msz manually merged commit 0abe8fdaef into main

2026-02-11 16:22:09 +00:00

No reviewers

No labels

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

q02/pool-publication-page!3

No description provided.

Rows
Columns