---
title: "How Scholarly Identifiers Are Defined"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{How Scholarly Identifiers Are Defined}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

# Introduction

This vignette explains how common scholarly identifiers are formally defined,
what their structural components are, and what it means for them to be *valid*
in a programmatic context.

When working with identifiers in R, it is essential to distinguish between:

- **Structural validity** (does it match the formal grammar?)
- **Checksum validity** (does the control digit verify?)
- **Registry validity** (does the identifier actually exist?)

The functions in `scholid` operate at the **structural level**. The regexes
shown below describe the structural form that an identifier must match.

---

# DOI (Digital Object Identifier)

**Governing body:** International DOI Foundation  
**Standard:** ISO 26324

## Structure

A DOI has two parts:

```
prefix/suffix
```

### Prefix
- Always begins with `10.`
- Followed by a registrant code (4–9 digits)

Example:
```
10.1000
10.1038
```

### Suffix
- Assigned by the registrant
- May contain almost any printable character
- Has no globally fixed grammar
- Case-sensitive in theory

Example:
```
10.1000/182
10.1038/s41586-020-2649-2
```

## Important Properties

- No checksum.
- The suffix is opaque.
- Structural validation cannot confirm existence.
- DOI resolution requires registry lookup (e.g., via doi.org).

## Structural Regex

A commonly accepted structural regex:

```
^10\.\d{4,9}/\S+$
```

This checks:
- Prefix starts with `10.`
- 4–9 digits
- A slash
- Non-whitespace suffix

---

# ORCID

**Governing body:** ORCID, Inc.  
**Standard basis:** ISO 7064 (checksum algorithm)

## Structure

An ORCID iD consists of 16 characters:

```
0000-0002-1825-0097
```

### Components

- 16 digits total
- Grouped as 4-4-4-4
- Final character is a checksum digit
- Check digit may be `X`

Internally (without hyphens):

```
0000000218250097
```

## Checksum

Uses ISO 7064 Mod 11-2 algorithm.  
A structurally correct ORCID may still be invalid if the checksum does not match.

## Structural Regex

Hyphenated form:

```
^\d{4}-\d{4}-\d{4}-\d{3}[\dX]$
```

Unhyphenated internal form:

```
^\d{15}[\dX]$
```

---

# ISBN (International Standard Book Number)

**Governing body:** International ISBN Agency  
**Standard:** ISO 2108

## Two Forms

### ISBN-10
- 9 digits + checksum digit
- Check digit may be `X`

Example:
```
0306406152
030640615X
```

### ISBN-13
- 13 digits
- Usually begins with 978 or 979
- EAN-13 checksum algorithm

Example:
```
9780306406157
```

## Structural Regex

ISBN-10:

```
^\d{9}[\dX]$
```

ISBN-13:

```
^\d{13}$
```

---

# ISSN (International Standard Serial Number)

**Governing body:** ISSN International Centre  
**Standard:** ISO 3297

## Structure

An ISSN has 8 characters:

```
1234-567X
```

### Components

- 7 digits
- 1 checksum digit (0–9 or X)
- Canonical display includes a hyphen after 4 digits

Internal numeric form:

```
1234567X
```

## Structural Regex

Hyphenated:

```
^\d{4}-\d{3}[\dX]$
```

Compact form:

```
^\d{7}[\dX]$
```

---

# arXiv Identifier

**Authority:** arXiv (Cornell University)

## Two Formats

### Modern (post-2007)

```
YYMM.NNNN
YYMM.NNNNN
```

Optional version suffix:

```
YYMM.NNNN(v2)
```

Components:
- 4-digit year/month
- Dot
- 4–5 digit submission number
- Optional version `vN`

Structural regex:

```
^\d{4}\.\d{4,5}(v\d+)?$
```

---

### Legacy (pre-2007)

```
archive/YYMMNNN
```

Example:
```
hep-th/9901001
```

Structural regex:

```
^[a-z\-]+/\d{7}(v\d+)?$
```

---

# PMID (PubMed Identifier)

**Authority:** U.S. National Library of Medicine

## Structure

- Pure integer
- Variable length
- No checksum

Example:

```
12345678
```

Structural regex:

```
^\d+$
```

---

# PMCID (PubMed Central Identifier)

**Authority:** PubMed Central

## Structure

```
PMC1234567
```

Components:
- Literal prefix `PMC`
- One or more digits

Structural regex:

```
^PMC\d+$
```