Complete User Guide (Draft 4.1.0)

This document is an incomplete draft, and is under active revision. Please check back for updates. The Cypher Developers mailing list provides a forum for discussion and questions.

This file is also available in PDF form.

Table of Contents

1 Introduction

1.1 What is Cypher?

1.2 Fun and Easy

2 Installation

2.1 Requirements

2.2 Set up

3 Getting Started

3.1 Startup Properties

3.2 Output and Reports

3.3 Running the Web Service

3.4 API

4 Hello World Example

4.1 The Dataset

4.1.1 Pattern files and index.xml

4.1.2 Con Element (The Word/Phrase List)

4.1.3 Pattern Element (The Grammar)

4.1.4 Container Element

4.1.5 mso:Sense Class (The Vocabulary)

4.1.6 Frame Semantics (The Dictionary)

5 Additional Topics

5.1 Paraphrases

6 FAQ

Introduction

What is Cypher?

Welcome to Cypher! Cypher is one of the first software program available which generates the RDF graph and SeRQL query representations of a natural language input. The Cypher framework provides a set of robust definition languages, which can be used to extend and create grammars and lexicons. The Cypher specifications are designed to allow a novice to quickly and easily build transcoders for processing highly complex sentences and phrases of any natural language, and to cover any vocabulary.

Cypher is a 100% Pure Java application, and runs on any machine that supports Java. The specification languages used to describe the grammar, lexicon, and framenet is all XML.

Example Input/Output using English Language Package
FOAF Triples

Input 1

The Semantic Web is one of Chris Walker's hobby interests

Input Type

English Declarative Clause

Abbreviated Output

mso:CWalker foaf:interest mso:SemanticWeb

Output Type

.rdf File

Dataset

Download (coming soon)

 

Input 2

Chris Walker's interest

Input Type

Relational Noun Phrase

Abbreviated Output

select distinct node0 from
{mso:CWalker} foaf:interest {node0}

Output Type

.serql File

Dataset

Download (coming soon)

 

Frame Triples

Input 3

The Terminal stars Tom Hanks and Catherine Zeta-Jones

Input Type

English Declarative Clause

Abbreviated Output

mso:frame0 rdf:type mso:MovieRoleFrame
mso:frame0 mso:actor mso:TJeffreyHanks-Actor
mso:frame0 mso:actor mso:CZetaJones-Actress
mso:frame0 mso:movie mso:TheTerminal-Movie

Output Type

.rdf File

Dataset

Download (coming soon)

 

Input 4

Actresses who played in movies with Tom Hanks

Input Type

English Anaphora Noun Phrase

Abbreviated Output

select distinct node0 from
{mso:frame0} rdf:type {mso:MovieRoleFrame}
{mso:frame0} mso:actor {mso:TomHanks-Actor}
{mso:frame0} mso:actor {node0}
{mso:frame0} mso:movie {node1}
{node0} rdf:type {mso:Actress}
{node1} rdf:type {mso:Movie}

Output Type

.serql File

Dataset

Download (coming soon)

Fun and Easy

There are only three steps required to produce the above output: 1) describe the phrase structure of each sentence, enter the words in Cypher's lexicon (dictionary), and create the RDF schema (if needed). Cypher is designed to allow developers to build on the works of others. All Cypher data constructs are pluggable and reusable objects. Once you've described a PresentTenseVerbPhrase, it can be used by you or by others to create more complex phrases. And, a third-party developed lexicon can be added to your existing Cypher installment to instantly add new words to it's vocabulary without any futher modification to Cypher's grammar. Cypher's vocabulary effects the amount and quality of the semantic output. Programming with Cypher is fun to do, and with only three XML elements you must learn to write a grammar: <con/>, <pattern/>, <mso:sense/>, Cypher has a smooth learning curve. This technology is the result of over five years of research and development. The work has turned out to be, what is in our opinion, a novel and extremely elegant yet powerful approach to a very old problem in computer science. Our vision for Cypher is to create an open framework to facilitate the development of datasets which cover the majority of phrase patterns and vocabularies for the majority of languages; this is a gargantuan, but finite undertaking, and it all begins with you. So without further adieu, we present Cypher. Enjoy.

Installation

Requirements

Set up

Getting Started

Cypher conforms to a process called Transcography. The main focus of Transcography is to provide a minimalist framework for NLP, i.e. it is averse to first order logic, predicate calculus, neural nets, etc. It produces quality output by resolving ambiguity at multiple levels of processing, and allowing the interaction of processes to improve parse judgments. Transcography organizes and modulates the tasks in NLP in such a way as to “corner the beast” of ambiguity almost solely into a lexicon which is based on frame semantics. Once the phrase grammar has been produced for the input, five main techniques are used to produce the semantic output: Symbolic Reference, Node Expansion, Subcategerzation, Identity Transfer, and Inference.

Symbolic Reference – Each constituent of a phrase must resolve to a concept, referenced either by description (e.g. the blue bird) or unique identifier (i.e. Henry Ford). Cypher, therefore, produces either a URI or BNode which represents the phrase, plus a set of triples representing the description given by the phrase.

Node Expansion – The set of triples produced by each child node of a phrase is included in the parent phrase’s output.

Subcategerzation – Transcography conforms to the theory that verbs and other units of language subcategorize for their arguments, and that this information is specified in the lexicon. Cypher accommodates this with the mso:Sense class’ properties: mso:allows, mso:restricts, mso:requires.

Identity Transfer – The human language processor produced semantic output by consulting a dictionary, and retrieving an entry for each word encountered in the input. The entry contains the description of an anonymous entity, and this description is transferred to the instance concept. Cypher accommodates this with the mso:Sense class’ property mso:ref.

Inference – Each phrase and clause in natural language expresses information not explicit in the phrase. The human language processes makes use of a dictionary which provides a semantic map, linking the explicit description provided by the phrase, to implicit descriptions inferred from the phrase. Cypher accommodates this by using Frame Semantics.

Startup Properties

The program startup properties are located in <cypher home>/data/configuration/config.ini. Following is a description of each property:

 

Property Name

Description

Value

Required

rdf.repository.*

 Properties of this pattern specify a configuration string for a Sesame Repository. This allows for the use of SAIL and Repository implementations from third-parties.

 

The syntax is:

        <classname>,<args>…

 

where <classname> is the fully qualified name of a class in the classpath which is an implementation of the org.openrdf.repository.Repository. The .jar file containing this class should be placed in the <cypher home>/lib directory. The <args> is a list of Strings which will be passed (in the order in which they are listed) to the constructor method of the Repository class.

string

 N/A

rdf.repository.output

 The repository used to store semantic output

string

 YES

rdf.repository.lexicon

 The repository used to store the lexical data

string

 YES

rdf.repository.dictonary

 The repository used to store the framenet data

string

 YES

rdf.repository.internal

 The repository used for internal processing

string

 YES

 

 

 

cypher.language

The directory under <cypher home>/data/language containing Cypher’s language definition files

string

YES

 

cypher.input.files

Whether the files in cypher.input.dir will be loaded as input

boolean

YES

cypher.input.dir

 The absolute path to the directory containing input

path string

 YES

cypher.output.dir

 The absolute path to the directory to which the output is written. If the Cypher web service is running, then set the cypher.output.dir to the <cypher-ws home>/out directory, where <cypher-ws home> is the location where the cypher-ws.war file was deflated.

path string

 YES

cypher.output.format

The RDF format for output. Options are rdfxml, turtle, n3, ntriples, trix, trig

string

NO

cypher.output.commit

Whether to commit RDF output to the cypher.repository.output repository

boolean

NO

cypher.output.entailments

 Whether to output the frame entailments

boolean

 NO

 

 

 

 

cypher.report

Whether to write reports of the transcoding. Generally, the more reports which are turned on, the greater the burden on processing due to I/O. This option, and all those below it, are false by default.

boolean

NO

cypher.report.debug

 Whether to output the pattern matching debug info

boolean

 NO

cypher.report.grammar

 Whether to output the grammar parse info

boolean

 NO

cypher.report.rdf

 Whether to output any RDF statements that result from transcoding

boolean

 NO

cypher.report.sparql

 Whether to output the SPARQL form of the query resulting from transcoding

boolean

 NO

cypher.report.serql

 Whether to output the SERQL form of the query resulting from transcoding

boolean

 NO

cypher.report.failed

 Whether to output the lists of inputs which failed to transcode

boolean

 NO

cypher.report.lexical

 Whether to output the list of words which failed lexical lookup

boolean

 NO

cypher.report.query-rdf

 Whether to output the RDF representation of the query that results from transcoding

boolean

 NO

cypher.report.html

Whether to output HTML reports. The starting point is the index.html file located at the root of the cypher.output.dir directory. Enabling this option slows down parse time considerably.

boolean

NO

cypher.report.parapharse

Whether to output paraphrases of input

boolean

NO

cypher.report.refresh

The refresh interval for HTML interface; the format is minute:hour

string

YES

cypher.report.bulk

Whether to write reports to one file. If true, reports will be written to the <cypher.output.dir>/bulk directory

boolean

NO

 

path string

 

cypher.http.base

Used by the Cypher web service. This should be the URL of the Cypher service (e.g. http://localhost:8080/cypher-ws). If the Cypher web service is running, then set the cypher.output.dir to the <cypher-ws home>/out directory, where <cypher-ws home> is the location where the cypher-ws.war file was deflated. This is also used as the namespace for all URI’s minted during transcoding.

logger.config.file

 The name of the file containing the Log4j configuration. File name must be relative to the <cypher home>/data/configuration directory

string

 YES

Output and Reports 

Cypher recursively loads input from the cypher.input.file directory, writes the parser results to the cypher.output.file specified in the startup properties file. The format is as follows:

·         input dir

-         file-a.txt

-         file-b.html

·         output dir

-         file-a

§  The first sentence

§  1

§  query.serql

§  query.sparql

§  statements.rdf

§  grammar.xml

§  2

§  n

§  transcode.failed.txt

§  report.csv

§  Another sentence