Text Service Infrastructure

Pillar 1: Decentralized Text APIs, Canonical Text Service (CTS) or Distributed Text Service (DTS)

The plain text inventory format is
URN [TAB] Title [TAB] Year [TAB] Author [TAB] Copyright restricted [TAB] lang [NL]
Convenient example links for most API request features for each data set are available here

Public Data Instances

Namespace	State	Access Endpoint (Inventory)	Inventory Structure (Treemap)	Lang	Copyright Restricted (*)	Content	Data Provenance	Downloadable Resources
ancJewLit	Stable	📚	🖌	heb,arc	Free	Classical Jewish sources	AncJewLit GitHub	DB+SRC VRT Word2Vec Model Current Database
dhd	Stable	📚	🖌	deu	Free	Abstracts of DHD conference published by Verband Digital Humanities im deutschsprachigen Raum e.V.	DHd-Verband GitHub	DB+SRC VRT Word2Vec Model Current Database
dsb	WIP, URNs not persistent		🖌	dsb(,deu,hsb)	Mixed	Lower Sorbian text corpus	Serbski Institute Operator / Admin
edh	Stable	📚	🖌	Multi	Free	The Epigraphic Database Heidelberg contains the texts of Latin and bilingual (i.e. Latin-Greek) inscriptions of the Roman Empire	Epigraphic Database Heidelberg	DB+SRC VRT Word2Vec Model Current Database
folgershakespeare	Stable	📚	🖌	eng	Free	All Shakespeare's works	Folger Shakespeare Library	DB+SRC VRT Word2Vec Model Current Database
gps4	Stable	📚	🖌	deu	Free	German Political Speeches corpus compiled by Adrien Barbaresi (6'685 documents)	German Political Speeches Corpus and Visualization	DB+SRC VRT Word2Vec Model Current Database
gwtc	Stable	📚	🖌	eng	Fully	The Game Walkthrough Corpus reduced to 8'616 Neoseeker walkthroughs.	Game Walkthrough Corpus
humboldtdigital	Stable	📚	🖌	deu	Free	Tagebücher, Briefe, Dokumente, Forschungsbeiträge, Chronologieeinträge der Edition Humboldt Digital (Version 11.0.1)	TELOTA (BBAW) GitHub	DB+SRC VRT Word2Vec Model Current Database
jeanpaulbriefe	Stable	📚	🖌	deu	Free	Daten der digitalen Briefedition "Jean Paul – Sämtliche Briefe digital" (https://www.jeanpaul-edition.de)	TELOTA (BBAW) GitHub	DB+SRC VRT Word2Vec Model Current Database
kant	Stable	📚	🖌	deu,lat,fra	Free	Kant’s gesammelte Schriften. Neuedition der Abteilung I (https://kant-digital.bbaw.de/)	TELOTA (BBAW) GitHub	DB+SRC VRT Word2Vec Model Current Database
lebenswelten	Stable	📚	🖌	deu	Free	Lebenswelten, Erfahrungsräume, politische Horizonte (Familie Lehndorff 18.-20. Jhdt) and Neuzeitlich-bäuerlicher Lebenswelten (ostpreußische Gutsarchive)	TELOTA (BBAW) GitHub	DB+SRC VRT Word2Vec Model Current Database
openarabicpe	Stable	📚	🖌	ara	Free	Open Arabic Periodical Editions (Muqtabas, Manar, Ustadh, Haqaiq, Lughat, Zuhur)	OpenArabicPE GitHub	DB+SRC VRT Word2Vec Model Current Database
pbc	Stable	📚	🖌	Multi	Free	20 copyright-free parallel bible translations	Parallel Bible Corpus	DB+SRC VRT Word2Vec Model Current Database
pcp	Stable	📚	🖌	fra	Free	Chrétien de Troyes's Le Chevalier de la Charrette (Lancelot, ca. 1180)	The Princeton Charrette Project	DB+SRC VRT Word2Vec Model Current Database
textgrid	Stable	📚	🖌	deu	Free	Textgrid	The Digital Library in Textgrid	DB+SRC VRT Word2Vec Model Current Database
tgap	Stable	📚	🖌	eng,cat	Free	Thomas Gray Archiv Poems	Thomas Gray Archive	DB+SRC VRT Word2Vec Model Current Database
voth	Stable	📚	🖌	Multi	Free	David Boder: Voices of the Holocaust	David Boder: Voices of the Holocaust	DB+SRC VRT Word2Vec Model Current Database

(*) The current list of requests that are available for copyright restricted texts is documented here.

Online Tools

Namespace Resolver provides endpoint URLs based on URN namespaces
Text API Explorer provides convenient access and example requests for available CTS instances
Openleaf is a free ebook-style reader

Resources and Source Code

Source Code Repositories (Git hosted via Bitbucket.org)

Suggested Citation

Jochen Tiepmar. 2025. Text Service Infrastructure. URL https://urncts.eu, requested on

Programming Interfaces

Python	from pythoncts import * print(cts_passage("urn:cts:dhd:2022.scholger_walter_datenschutz_in_der_wissenschaftlichen_praxis:1.1-1.2"))
Javascript	<head><script src="cts.js"></script></head> <body><script type = "module"> document.write(await cts_passage("urn:cts:dhd:2022.scholger_walter_datenschutz_in_der_wissenschaftlichen_praxis:1.1-1.2")) </script></body>
Verticalized Text Corpus Workbench Sketch-Engine (WIP)	python3 .\cts2vrt.py tgap # Import in IWS Workbench cwb-encode -d /var/www/cwb/test/ -xsBC9 -c utf8 -f tgap.vrt -R /var/www/cwb/testregistry/test -P pos -P lemma -S edition:0+urn+title+lang -S s:0+n # Manual Indexing in CWB cwb-makeall -r /var/www/cwb/testregistry -V test cwb-huffcode -A -r /var/www/cwb/testregistry test cwb-compress-rdx -A -r /var/www/cwb/testregistry test # CWB Terminal & Hello World cqp -e set Registry "/var/www/cwb/testregistry"; info TEST;
Word Embeddings Word2Vec Word Vectors Language Models (WIP)	# Build the model for a given namespace (e.g. "tgap") python3 cts2vec.py tgap # Calculate cosine similiarity between word1 (e.g. "the") and word2 (e.g. "and") for the compiled model (e.g. "tgap") python3 cosinesim.py tgap the and >> Cosine similarity between 'the' and 'and' - CBOW : 0.99955016
Canonical Text Service	#Each data instance supports CTS requests on the following endpoint /cts/?request=[REQUEST NAME] #Parameters can be added as additional GET Paramaters with the ampersand symbol & /cts/?request=[REQUEST NAME]&urn=[URN]
Distributed Text Service (WIP)	#Each data instance supports the following DTS requests. Entry Point: /dts/ Collection Endpoint: /dts/collection Collection Endpoint: /dts/collection/?id=[document level incomplete URN] Document Endpoint: /dts/document/?id=[URN] Navigation Endpoint: /dts/navigation/?id=[URN]
Decentralized Structured Text Search (WIP)	Exakte Suche

Selected Collections in E-Book Style

Autor	Hans Christian Andersen \| Wilhelm Busch \| Johann Wolfgang von Goethe \| Grimm's Fairy Tales \| Franz Kafka \| Karl May \| Friedrich Schiller \| Shakespeare

Pillar 2: Text Mining

Text Mining Instances

Text Mining Corpora	Source Document List	Time Series	Type*
ancJewLit	Full	No	default
dhd	Full	Yes	default
dsb Casnik Subcorpus	Full Casnik	Yes	custom
folgershakespeare	Full	No	default
edh	Full	No	default
gps4	Full	Yes	default
jeanpaulbriefe	Full	Yes	default
humboldtdigital Chronology Letter	Full Chronology Letter	Yes	default
gwtc	Full	No	default
kant	Full	No	default
lebenswelten	Full	Yes	default
openarabicpe	Full	No	default
pbc French English German	Full .fra. .eng. .deu.	Yes	default
pcp	Full	No	default
textgrid	Full	No	default
tgap	Full	No	default
voth English German	Full .eng. .deu.	No	default

(*) Type custom means that the text miner uses corpus specific features (e.g. specific text annotation). Type default means that the default CTM source code is used without further programming. Cloning the repo and running the setup script results in an identical application on your server. Diachronic features require CTS side publication dates, that may not be available for every document / data set. Subcorpus setups require a file named "urnlist.txt" as provided in the table.

Source Code Resources

Source Code and Installation (Git hosted via Bitbucket.org)

Suggested Citation

Jochen Tiepmar. Text Service Infrastructure. URL https://urncts.eu, requested on

Impressum and Data Protection Policy

This is a non-commercial academic research and data webservice.

Impressum
Dr. Jochen Tiepmar, c/o IP-Management #48412, Ludwig-Erhard-Str. 18, 20459 Hamburg, Germany.
Preferably Email: tiepilab at gmx.de or the usual academic communication channels.

Data Protection Policy
No user data is collected besides IP access logs that are stored by Apache Server software. These access logs are deleted automatically. Data sets are provided according to their public license or prior individual agreements. Tools may include publicly available software licenses (namely plotly.js and cytoscape.js).

If you use this service, make sure that you do not violate the license of the individual data set or document edition. For instance, do not built commercial software based on data sets that prohibit commercial use. Whether or not specific commercial use cases and tools allow the usage of specific non-commercial data sets has to be decided by the person responsible for the use case or tool.

Cooperation and Service Contracts

I am open for cooperation in academic projects and also as a private service provider. If you need a custom text analysis infrastructure that may include CTS, DTS, custom text mining tools, IWS corpus workbench, regular versioned updates, custom tools & program(-ming) interfaces as well as data hosting and more, feel free to contact me for a service contract. If you just want your TEI/XML text corpus added to the infrastructure and use the already available generic tools and interfaces, it is free of charge - just wink in my direction and I am more than happy to expand this infrastructure with interesting data sets.

State of Infrastructure

Availability and Software-Versions (versionsoftware.php): 🖌 📚
Database Versions (versiondata.php): 🖌 📚

FAQ

What is this?

tl;dr: This is a service independent decentralized text data service that includes the protocols Canonical Text Service, Distributed Text Service and a growing number of other text API requests.
The Canonical Text Services protocol defines interaction between a client and server providing identification of texts and retrieval of canonically cited passages of texts. The official specifications by David Neel Smith and Christopher Blackwell can be found here. Distributed Text Services (DTS) is a more abstract API protocol with the goal to make served content - including but not limited to text data - interoperable on metadata level within the tools that use DTS. This content can be referred to by a CTS URN or a website URL (or as indicated in the specifications any text string).

Is the implementation feature complete?

CTS: yes, aside from possible bugs and things that I missed.
DTS: Work is in Progress.
Future work includes a lot of additional features that extend the ussual protocols (e.g. license management on passage request level, domain-specific language models and more). See this dissertation for more information about what is planned.

How do data requests work for texts that are copyright restricted?

To avoid the necessity of user management, randomized copyright tokens are generated for each relevant data set. These copyright tokens can be used to allow access, similar to a key that opens a locked door. You can request such a key if (and only if!) you qualify for access - e.g. by affiliation or ownership. Restricted content is served if a valid token is provided as in this example request (just for demo purposes, the document is copyright free). Copyright tokens are randomized each sunday at midnight to avoid harm from inadvertently published keys, so plan your workflows accordingly.

Can I host my data on my own server ?

Each prepared data set is available via Zenodo. You can download the database and source code onto your server. If you want your data to be available in the programming interfaces, you need to tell me the server address so I can register it for the namespace resolver. If your text corpus is not yet prepared and I can help, communicate.

How about data persistency and versioning? Can I reliably cite text passages via URNs or can the text content change?

CTS URNs are meant to be persistent references. However, mistakes and improvements happen and structure markup can change if documents are still edited. There is no clear solution for this problem but some kind of versioning will be implemented (e.g. numbered updates). Text corpora that are still worked on are marked with (!) in the above table. Generally CTS URNs can be considered safe for citation purposes.

How reliable is this service? Will you monetize the tools and services once people depend on it?

The server is financed privately and I am using these webservices for my own programming work and research. The software is open source and can be recreated by anyone. It is planned to implement a data cloning mechanism, which will allow decentralized distributed backups for texts once they are part of the infrastructur; this will eliminate any dependency on individual servers as it will allow anyone to mix and host their own data instances. Monetizing this service will not be neccessary and would be counter productive for me personally because it would undermine the reliability of my research output.
Each copyright-free data instance is also provided as an SQL database and PHP sourcecode package via Zenodo, enabling you to set it up yourself.

The source code of the analysis tools is also as free as possible to avoid building gateway demos designed to lead users into commercial software. However, commercial software can and might be developed in the scope of this project if fitting use cases arise.