PFIF 1.1 Specification

September 5, 2005
Ka-Ping Yee, Kieran Lal, Jonathan Plax

URL of this specification: http://zesty.ca/pfif/1.1
FAQ, examples, and other information on PFIF: http://zesty.ca/pfif
Editor: Ka-Ping Yee <ping@zesty.ca>

This document is licensed under the GNU Free Documentation License 1.2.

  1. Abstract
  2. Design Principles
  3. Data Life Cycle
  4. Data Model
    1. PERSON records
    2. NOTE records
  5. XML Format Specification
  6. Atom Feed Specifications
    1. Atom Person Feeds
    2. Atom Note Feeds
  7. RSS Feed Specifications
    1. RSS Person Feeds
    2. RSS Note Feeds
  8. Suggested Relational Database Schema
  9. Acknowledgements

1. Abstract

This document defines the People Finder Interchange Format, which encompasses both a data model and an XML-based exchange format for sharing data about people who are missing or displaced by natural or human-made disasters. The data model is first described in a manner independent of implementation style (object-oriented, relational, or XML). Then the PFIF proper is specified by an XML Schema. This document also provides a recommended schema for handling PFIF data in a relational database, though such implementation decisions are ultimately up to application developers.

2. Design Principles

  1. The purpose of PFIF is to bring people and data together. The design aims to promote convergence: convergence of people who seek the same person, convergence of information about a person obtained from various sources, convergence of duplicated data, and ultimately convergence of missing people with their loved ones.
  2. Data is fundamentally divided into two types: data that is fixed and data that changes over time.
  3. Data is permanent. Once records are created, they do not change. Incoming updates are timestamped and added to the pool of knowledge; they do not replace or destroy existing data. This approach improves the resilience of the system and allows applications to freely aggregate data from different sources without the complexity of resolving conflicting changes.
  4. Data should be traceable. Since data comes from sources of unknown reliability and accountability, information on the origins of data should be maintained, to help users ascertain its trustworthiness.
  5. Each record belongs to a home repository, which is the (PFIF or non-PFIF) repository where the record was first entered. The record may be copied to other places, but the home repository remains the authority on the record.
  6. Each aggregator of data has its own perspective on the world. It is not possible to dictate truths about all data from a single central authority.
  7. It should be possible to keep track of multiple records that refer to the same person. But, by the preceding principle, each aggregator makes its own decisions about which records to merge; there is no central authority.
  8. It should be possible to resolve multiple copies of the same record that have been imported via different data paths.
  9. All dates and times must be in UTC, never in a local time zone, because data records will be transmitted among many different time zones. This format uses dates in the RFC 3339 format, with only UTC allowed. Front-ends can convert dates and times to the local time zone for display.

3. Data Life Cycle

Each PFIF repository may contain original records and clone records. Original records are records residing in their home repository; clone records are copies of records that belong to other repositories. Here is a diagram describing the life of a PFIF record as it is created and then travels to other repositories.

                     .---------------------.
                     | 1. real-world facts |
                     '---------------------'
                          |            |
       entered by a human |            | entered by a human
   into a PFIF repository |            | into a non-PFIF repository
                          |            |
 entry_date, source_date, |            |
  source_name, source_url |            |
are set by the repository |            |
                          v            v
.-----------------------------.    .------------------------------.
| 2a. original PFIF record    |    | 2b. original non-PFIF record |
| in record's home repository |    | in record's home repository  |
'-----------------------------'    '------------------------------'
                          |            |
       exported as a PFIF |            | parsed and converted to the PFIF
         document or feed |            | data model by a human or program
                          |            |
                          |            | source_date, source_name, source_url
                          |            | are set by the human or program
                          v            v
                        .----------------.
       .--------------> | 3. PFIF record |
       |                '----------------'
       |                        |
       |                        | loaded into a PFIF repository
       |                        |
       |                        | entry_date is set to date/time of import
       |                        v
       |     .--------------------------------------.
       |     | 4. clone record in a PFIF repository |
       |     '--------------------------------------'
       |                        |
       |                        | exported as a PFIF document or feed
       '------------------------'

PFIF is based on a "post-only" philosophy. After a record has been stored for the first time in PFIF, it is only copied from place to place, not changed. The one exception is the entry_date field, which indicates when a record entered a receiving repository. No other fields change when PFIF records are transferred between repositories, and after a record enters a repository, none of its fields change, not even entry_date.

4. Data Model

There are two types of records. person records are for static information. note records are for changing information. Each note record belongs to a particular person, and a person record may have with any number of associated note records. Once a record is created, it is never changed. To indicate the fact that data about a particular person has changed, add a timestamped note record associated with that person record.

person records may be created both by those who seek missing a person and by those who have information on a missing person. The person record for a person is the point of convergence for all parties; the note records on that person are the growing pool of shared knowledge.

4.1. person records

A person record contains 17 fields. There may be multiple person records for the same person. In fact, any given application that imports data from multiple sources is likely to have multiple person records for the same person. It is up to the application to associate such records (see the Database Schema section). It is recommended that applications keep copies of all the records, and separately keep track of which records correspond to the same person.

Static Tracking Information About the Record Itself (8 fields)

Meta-information like this is essential because it allows people to trace and ascertain the reliability of the data they are looking at, which was a big problem with survivor databases for September 11.

person_record_id (string)
Unique identifier for this record, which consists of a domain name followed by a slash and a local identifier. The domain name identifies this record's home repository, which is the authority for this record. The format of the local identifier is up to the home repository. When the person_record_id begins with a domain other than the application's own domain, it means this record is a clone of a record from another source.
entry_date (string in the form "yyyy-mm-ddThh:mm:ssZ"):
Date in UTC that this copy of this record was stored. A PFIF repository must guarantee that this value increases monotonically as records are added, so that a client can update a copy of a repository by querying for all records with an entry_date greater than or equal to the entry_date of the last received record.
author_name (string):
The full name of the person who entered this record.
author_email (string):
The preferred contact e-mail address of the person who entered this record.
author_phone (string):
The preferred contact phone number of the person who entered this record.
source_name (string):
The name of the home repository of this record.
source_date (string in the form "yyyy-mm-ddThh:mm:ssZ"):
The date in UTC that the original copy of this record was created in its home repository.
source_url (string):
The URL to this record in its home repository (as specific as possible, down to the URL of the individual record).

Static Identifying Information About a Missing Person (9 fields)

These fields are specifically for identifying the person and should be for data that never changes. These are the fields to search on. Insisting on all capitals and no accents is ugly, but it makes searches more likely to converge on the correct record. The other field is a very crude way to import foreign data, but the formatting guidelines should make it possible to extract the data again if there is a desperate need to do so. For other, free-form text was chosen instead of XML to make it easy for an application to display other directly in the UI.

first_name (string, all capitals, no accents):
First name of the person sought or found, optionally followed by a space and any middle names or middle initials.
last_name (string, all capitals, no accents):
Last name of the person sought or found.
home_city (string, city name, all capitals, no accents):
Home city of the person sought or found.
home_state (string, two-letter postal abbreviation):
Home state of the person sought or found.
home_neighborhood (string, all capitals, no accents):
Name of the home neighborhood of the person sought or found.
home_street (string, all capitals, no accents):
Street name (no number) of the home address of the person sought or found.
home_zip (integer):
Zip code of the home address of the person sought or found.
photo_url (string):
URL to an image of an identifying photograph of the person sought or found.
other (large string):
Free-form text containing any other static data fields brought in from other sources. (Non-static data imported from other sources should go into a note record.) Short fields should be on a single line with the field name, a colon, and the field value. Long fields can be given as a line with the field name and a colon, then text indented on the following lines. When a record is converted from some other form to PFIF by a machine process, the field "automated-pfif-author" should be present and should name the program that produced the PFIF. The "automated-pfif-author" field is not added when records are exported from a PFIF repository. A description of the person in free-form text can also go here, with the field name "description". For example, a program that scrapes a record from a non-PFIF format that includes a free-form text field might produce an other field like this:
description:
    Dark hair, in her late thirties.
    Also goes by the names "Kate" or "Katie".
automated-pfif-author: ScrapeMatic 0.5
Field names for data fields imported from other applications should begin with the domain name and a slash. For example, if a birthdate is imported from an ICRC record, it might look like this:
icrc.org/birthdate: 1976-02-26

4.2. note records

Each note record belongs to exactly one person record. There may be any number of note records associated with a particular person record. (See below for implementation notes. A database might implement this by including a foreign key, person_record_id, that refers to the person record. An object-oriented representation might implement this by embedding a list of note objects within the person object.)

Not being able to remove or update records was a huge problem with September 11 survivor databases. note records resolve this problem while avoiding the problem of synchronizing conflicting changes. Every note has a timestamp and information on the author of the note. Applications can use the timestamp to determine the most recent value of a given field. Users can use the author information to ascertain the reliabiliy of a given field.

Information About a Missing Person That Changes Over Time (11 fields)

The found, email_of_found_person, phone_of_found_person and last_known_location fields store data that changes over time. When these fields are present in a note record, the record is specifying new values for these fields, and the source_date field indicates the date that the new values took effect. So, for example, an application that wants to display the most recent known location can look for the note with the latest source_date that has a non-empty last_known_location field.

note_record_id (string):
Unique identifier for this record, which consists of a domain name followed by a slash and a local identifier. The domain name identifies this record's home repository, which is the authority for this record. The format of the local identifier is up to the home repository. When the note_record_id begins with a domain other than the application's own domain, it means this record is a clone of a record from another source.
entry_date (string in the form "yyyy-mm-ddThh:mm:ssZ"):
Date in UTC that this copy of this record was stored. A PFIF repository must guarantee that this value increases monotonically as records are added, so that a client can update a copy of a repository by querying for all records with an entry_date greater than or equal to the entry_date of the last received record.
author_name (string):
The full name of the person who entered this note.
author_email (string):
The preferred contact e-mail address of the person who entered this note.
author_phone (string):
The preferred contact phone number of the person who entered this note.
source_date (string in the form "yyyy-mm-ddThh:mm:ssZ"):
The date in UTC that the original copy of this note was created in its home repository. In most cases, notes should be sorted by this field for display.
found (boolean string, "true" or "false"):
This value is "true" if the missing person has been personally contacted or seen, or "false" otherwise. The text field of this note should describe HOW and WHEN the person was contacted or seen.
email_of_found_person (string):
The preferred contact e-mail address of the FOUND person. This field is present ONLY if the person has been FOUND. The text field of this note should describe HOW the person's contact information was determined.
phone_of_found_person (string):
The preferred contact phone number of the FOUND person. This field is present ONLY if the person has been FOUND. The text field of this note should describe HOW the person's contact information was determined.
last_known_location (string):
A free-form description of the last known location of the person being sought, including the city, state, and as much detail as possible. The text field of this note should describe HOW the person's location was determined.
text (large string):
Free-form text description of the person's current condition, situation and location details, where they were last seen, corrections to other information, etc.

5. XML Format Specification

The XML Namespace for PFIF is:

The XML Schema for PFIF is located at:

The MIME type for a PFIF document is:

The XML Schema is a straightforward translation of the data model into two complex types, Person and Note, with sub-elements corresponding to the fields of the record. A valid PFIF document consists of a single pfif element containing one or more person elements, each of which contains zero or more note elements.

The entry_date and source_date fields have the XML Schema datatype dateTime. The source_url and photo_url fields have the datatype anyURI. The found field has the datatype boolean. The home_zip field has the datatype integer. All other fields have the datatype string.

In a person element, the fields person_record_id, first_name, and last_name are mandatory. All other fields are optional.

In a note element, the fields note_record_id, author_name, source_date, and text are mandatory. All other fields are optional.

In a PFIF XML document, the fields must occur in the order they are listed in the Data Model section of this specification.

6. Atom Feed Specifications

PFIF XML documents can be embedded into Atom 1.0 feeds. The PFIF document should be embedded using an XML namespace and inserted as an immediate child of the entry element.

Atom 1.0 defines a top-level feed element that contains any number of entry elements. The top-level element should declare the PFIF namespace. The recommended prefix is pfif, so the top-level element should look like this:

<feed xmlns="http://www.w3.org/2005/Atom"
      xmlns:pfif="http://zesty.ca/pfif/1.1">
...
</feed>

The rest of this section offers recommendations on how applications should populate the standard Atom elements so that the feed will make sense to existing feed-reading software. Nonetheless, the embedded PFIF document takes precedence over any redundant information that appears in Atom elements.

Two kinds of PFIF Atom feeds are defined here: person feeds in which each item is a person, and note feeds in which each item is a note. A person feed is roughly analogous to a blog feed containing blog entries; a note feed is roughly analogous to a comment feed on a particular blog entry. For example, one application might subscribe to a person feed in order to aggregate missing person records from other databases; another application might subscribe to a note feed in order to display a stream of notes with updates about a particular person.

6.1. Atom Person Feeds

An Atom person feed provides at least the following elements within the feed element:

id
This element should contain a unique URI associated with this feed. This might be the URL to the website that corresponds to the database or service providing this feed.
title
This element should contain the name of this feed. This should include the title of the database or service providing this feed.
subtitle
This element should contain a phrase or sentence describing this feed. This would be the place to explain how this feed is produced, for example: "Scraped daily by FooMatic 2.3 from http://www.familylinks.icrc.org/".
updated
This element should contain the date and time in UTC that this feed was last updated, given in "yyyy-mm-ddThh:mm:ssZ" format.
link
This element should contain a URL from which this feed can be retrieved. This element should have a rel attribute whose value is self.

An Atom person feed provides at least the following elements within each entry element:

pfif:pfif
This element is the top-level pfif element of the PFIF document. It contains a single pfif:person element, which contains elements for the fields of the person record and may contain zero or more pfif:note elements. A service wishing to provide a complete export would include all the note records associated with the person here.
id
This element should contain a URI string consisting of the scheme "pfif:" followed by the value of the person_record_id field.
title
This element should contain the value of the first_name field, followed by a space and the value of the last_name field in the person record.
author
This element should contain a name element containing the value of the author_name field and an email element containing the value of the author_email field in the person record.
updated
This element should contain the value of the source_date field in the person record.
content
This element should contain a human-readable HTML formatting of the information in the person record. It is up to the application to decide how to format the content.
source
This element should contain a copy of the title element of this feed. This element may also contain copies of any other child elements of the feed element.

6.2. Atom Note Feeds

An Atom note feed provides at least the following elements within the feed element:

id
This element should contain a unique URI associated with this feed. This might be the URL to the website that corresponds to the database or service providing this feed.
title
This element should contain the name of this feed. This should include the title of the database or service providing this feed, followed by a more specific title that describes how the notes were selected from the database or service. For example, for a note feed about a particular person, the title could be the title of the service followed by the first name and last name of the person in question.
subtitle
This element should contain a phrase or sentence describing this feed. This would be the place to explain how this feed is produced, for example: "Exported by CiviCRM 1.1, http://www.example.org/."
updated
This element should contain the date and time in UTC that this feed was last updated, given in "yyyy-mm-ddThh:mm:ssZ" format.
link
This element should contain a URL from which this feed can be retrieved. This element should have a rel attribute whose value is self.

An Atom note feed provides at least the following elements within each entry element:

pfif:pfif
This element is the top-level pfif element of the PFIF document. It contains a single pfif:person element. In a note feed, the person element would only need to contain the mandatory fields person_record_id, first_name, and last_name, and a single pfif:note element containing the note.
id
This element should contain a URI string consisting of the scheme "pfif:" followed by the value of the note_record_id field.
title
This element should contain an excerpt of the text field.
author
This element should contain a name element containing the value of the author_name field and an email element containing the value of the author_email field in the note record.
updated
This element should contain the value of the source_date field in the note record.
content
This element should contain an HTML formatting of the text field in the note record. It is up to the application to decide how to format the content.

7. RSS Feed Specifications

PFIF XML documents can be embedded into RSS 2.0 feeds. (In RSS 2.0 terminology, this section defines an RSS 2.0 module.) The PFIF document should be specified using an XML namespace and embedded as an immediate child of the item element.

RSS 2.0 defines two main elements, channel and item, that are enclosed in a top-level rss element. The top-level element should declare the PFIF namespace. The recommended prefix is pfif, so the top-level element should look like this:

<rss version="2.0" xmlns:pfif="http://zesty.ca/pfif/1.1">
...
</rss>

The rest of this section offers recommendations on how applications should populate the standard RSS elements so that the feed will make sense to existing feed-reading software. Nonetheless, the embedded PFIF document takes precedence over any redundant information that appears in RSS elements.

As in the preceding section, two kinds of PFIF RSS feeds are defined here: person feeds in which each item is a person, and note feeds in which each item is a note.

7.1. RSS Person Feeds

An RSS person feed provides at least the following elements within the channel element:

title
This element should contain the name of this feed, which should include the title of the database or service providing this feed.
description
This element should contain a phrase or sentence describing this feed. This is the place to explain how this feed is produced, for example: "Scraped daily by FooMatic 2.3 from http://www.familylinks.icrc.org/".
lastBuildDate
This element should contain the date and time in UTC that this feed was last updated, given in RFC 822 date format, for example: "Sat, 07 Sep 2002 00:00:01 GMT".
link
This element should contain a URL to the website that corresponds to the database or service providing this feed.

An RSS person feed provides at least the following elements within each item element:

pfif:pfif
This element is the top-level pfif element of the PFIF document. It contains a single pfif:person element, which contains elements for the fields of the person record and may contain zero or more pfif:note elements. A service wishing to provide a complete export would include all the note records associated with the person here.
guid
This element should contain the value of the person_record_id field.
title
This element should contain the value of the first_name field, followed by a space and the value of the last_name field.
author
This element should contain the value of the author_email field, followed by a space and the value of the author_name field enclosed in parentheses.
pubDate
This element should contain the date in the source_date field in the person record, converted to RFC 822 date format, for example: "Sat, 07 Sep 2002 00:00:01 GMT". The timezone MUST be GMT and the year MUST have four digits.
description
This element should contain a human-readable HTML formatting of the information in the person record. It is up to the application to decide how to format the description.
source
This element should contain the value of the source_name field.
link
This element should contain the value of the source_url field.

7.2. RSS Note Feeds

An RSS note feed provides at least the following elements within the channel element:

title
This element should contain the name of this feed. This should include the title of the database or service providing this feed, followed by a more specific title that describes how the notes were selected from the database or service. For example, for a note feed about a particular person, the title could be the title of the service followed by the first name and last name of the person in question.
description
This element should contain a phrase or sentence describing the feed. This is the place to explain how the feed is produced, for example: "Scraped daily by FooMatic 2.3 from http://www.familylinks.icrc.org/".
lastBuildDate
This element should contain the date and time in UTC that this feed was last updated, given in RFC 822 date format, for example: "Sat, 07 Sep 2002 00:00:01 GMT".
link
This element should contain a URL to the website that corresponds to the database or service providing this feed. For a note feed about a particular person, this link could point to the web page for that person's record.

An RSS note feed provides at least the following elements within each item element:

pfif:pfif
This element is the top-level pfif element of the PFIF document. It contains a single pfif:person element. In a note feed, the person element would only need to contain the mandatory fields person_record_id, first_name, and last_name, and a single pfif:note element containing the note.
guid
This element should contain the value of the note_record_id field.
author
This element should contain the value of the author_email field, followed by a space and the value of author_name field enclosed in parentheses.
pubDate
This element should contain the date in the source_date field in the note record, converted to RFC 822 date format, for example: "Sat, 07 Sep 2002 00:00:01 GMT". The timezone MUST be GMT and the year MUST have four digits.
description
This element should contain an HTML formatting of the text field in the note record. It is up to the application to decide how to format the description.

8. Suggested Relational Database Schema

This section suggests a possible relational database schema for storing PFIF data. The exact details of a database design are up to each application; this is one possible starting point.

A relational database could store PFIF records in two tables, person and note, for the two types of records. Rows would only be added to these tables; rows would never be modified or removed. To record the fact that data is changed, a timestamped row is added to the note table.

The data model does not specify how the application should link each note record to a person record, or how the application should indicate that multiple person records refer to the same person. The following is a suggested database schema that addresses these two issues.

PERSON table:
     string      person_record_id           primary key
     datetime    entry_date
     string      author_name
     string      author_email
     string      author_phone
     string      source_name
     datetime    source_date
     string      source_url
     string      first_name
     string      last_name
     string      home_city
     string      home_state
     string      home_neighborhood
     string      home_street
     integer     home_zip
     string      photo_url
     text        other

NOTE table:
     string      note_record_id             primary key
     string      person_record_id           foreign key not null
     string      linked_person_record_id    foreign key or null
     datetime    entry_date
     string      author_name
     string      author_email
     string      author_phone
     datetime    source_date
     boolean     found
     string      email_of_found_person
     string      phone_of_found_person
     string      last_known_location
     text        text

This suggested schema defines the person table exactly to match person in the PFIF data model, and in the note table adds two fields to the note in the PFIF data model. The first extra field, person_record_id, links each note to a person in the person table. The second extra field, linked_person_record_id, allows the application to indicate that another person record refers to the same person.

To link a foreign person record with a local person record, the application adds a note associated with the local person record, with a linked_person_record_id field containing the person_record_id of the foreign record. The other fields of the note describe the circumstances of the decision to merge: source_date indicates the date of the decision, text gives the reason for the decision, and author_name names the person, program, or other entity that made the decision. This specification does not dictate how an application would decide whether to merge two records; a merge could be initiated by a human operator or by a software algorithm that look for records with similar data. Recording the merge decision in a note record makes it possible to back out of a bad merge decision, and recording the name of the person or program in the author_name field makes it possible to track down the cause of an incorrect merge.

When displaying a person record, the application can then look for all the non-empty linked_person_record_id fields among the notes that belong to that person record, and display all the linked records or a merged view of the linked records.

9. Acknowledgements

Luke Blanshard, Tony Chang, Josh Kleinpeter, Kieran Lal, Jonathan Plax, Gabe Wachob, and Ka-Ping Yee contributed to the design of PFIF. The initial data model on which this specification is based is due to the CiviCRM team, David Geilhufe, and Kieran Lal. Jonathan Plax drafted the initial XML Schema document for PFIF.