Complete User Guide (Draft 4.1.0)
This document is an incomplete draft, and is under active revision. Please check back for updates. The Cypher Developers mailing list provides a forum for discussion and questions.
This file is also available in PDF form.
Table
of Contents
4.1.1 Pattern files and index.xml
4.1.2 Con Element (The Word/Phrase List)
4.1.3 Pattern Element (The Grammar)
4.1.5 mso:Sense Class (The Vocabulary)
4.1.6 Frame Semantics (The Dictionary)
Welcome to Cypher! Cypher is one of the first software program available which generates the RDF graph and SeRQL query representations of a natural language input. The Cypher framework provides a set of robust definition languages, which can be used to extend and create grammars and lexicons. The Cypher specifications are designed to allow a novice to quickly and easily build transcoders for processing highly complex sentences and phrases of any natural language, and to cover any vocabulary.
Cypher is a 100% Pure Java application, and runs on any machine that supports Java. The specification languages used to describe the grammar, lexicon, and framenet is all XML.

|
Input
1 |
The
Semantic Web is one of Chris Walker's hobby interests |
|
Input
Type |
English
Declarative Clause |
|
Abbreviated Output |
mso:CWalker foaf:interest mso:SemanticWeb |
|
Output
Type |
.rdf File |
|
Dataset |
Download
(coming soon) |
|
Input
2 |
Chris
Walker's interest |
|
Input
Type |
Relational
Noun Phrase |
|
Abbreviated Output |
select distinct node0 from |
|
Output
Type |
.serql File |
|
Dataset |
Download
(coming soon) |
|
Input
3 |
The
Terminal stars Tom Hanks and Catherine Zeta-Jones |
|
Input
Type |
English
Declarative Clause |
|
Abbreviated Output |
mso:frame0 rdf:type mso:MovieRoleFrame |
|
Output
Type |
.rdf File |
|
Dataset |
Download
(coming soon) |
|
Input
4 |
Actresses
who played in movies with Tom Hanks |
|
Input
Type |
English
Anaphora Noun Phrase |
|
Abbreviated Output |
select distinct node0 from |
|
Output
Type |
.serql File |
|
Dataset |
Download
(coming soon) |
There are only three steps required to produce the above output: 1) describe the phrase structure of each sentence, enter the words in Cypher's lexicon (dictionary), and create the RDF schema (if needed). Cypher is designed to allow developers to build on the works of others. All Cypher data constructs are pluggable and reusable objects. Once you've described a PresentTenseVerbPhrase, it can be used by you or by others to create more complex phrases. And, a third-party developed lexicon can be added to your existing Cypher installment to instantly add new words to it's vocabulary without any futher modification to Cypher's grammar. Cypher's vocabulary effects the amount and quality of the semantic output. Programming with Cypher is fun to do, and with only three XML elements you must learn to write a grammar: <con/>, <pattern/>, <mso:sense/>, Cypher has a smooth learning curve. This technology is the result of over five years of research and development. The work has turned out to be, what is in our opinion, a novel and extremely elegant yet powerful approach to a very old problem in computer science. Our vision for Cypher is to create an open framework to facilitate the development of datasets which cover the majority of phrase patterns and vocabularies for the majority of languages; this is a gargantuan, but finite undertaking, and it all begins with you. So without further adieu, we present Cypher. Enjoy.
Cypher
conforms to a process called Transcography. The main focus of Transcography is
to provide a minimalist framework for NLP, i.e. it is averse to first order
logic, predicate calculus, neural nets, etc. It produces quality output by
resolving ambiguity at multiple levels of processing, and allowing the
interaction of processes to improve parse judgments. Transcography organizes
and modulates the tasks in NLP in such a way as to “corner the
beast” of ambiguity almost solely into a lexicon which is based on frame
semantics. Once the phrase grammar has been produced for the input, five main
techniques are used to produce the semantic output: Symbolic Reference, Node Expansion, Subcategerzation, Identity Transfer, and Inference.
Symbolic Reference – Each constituent of a phrase must resolve to a concept, referenced either by description (e.g. the blue bird) or unique identifier (i.e. Henry Ford). Cypher, therefore, produces either a URI or BNode which represents the phrase, plus a set of triples representing the description given by the phrase.
Node Expansion – The set of triples produced by each child node of a phrase is included in the parent phrase’s output.
Subcategerzation – Transcography conforms to the theory that verbs and other units of language subcategorize for their arguments, and that this information is specified in the lexicon. Cypher accommodates this with the mso:Sense class’ properties: mso:allows, mso:restricts, mso:requires.
Identity Transfer – The human language processor produced semantic output by consulting a dictionary, and retrieving an entry for each word encountered in the input. The entry contains the description of an anonymous entity, and this description is transferred to the instance concept. Cypher accommodates this with the mso:Sense class’ property mso:ref.
Inference – Each phrase and clause in natural language expresses information not explicit in the phrase. The human language processes makes use of a dictionary which provides a semantic map, linking the explicit description provided by the phrase, to implicit descriptions inferred from the phrase. Cypher accommodates this by using Frame Semantics.
The
program startup properties are located in <cypher home>/data/configuration/config.ini. Following is a description
of each property:
|
Property
Name |
Description |
Value |
Required |
rdf.repository.*
|
Properties
of this pattern specify a configuration string for a Sesame Repository. This
allows for the use of SAIL and Repository implementations from third-parties. The
syntax is: <classname>,<args>… where
<classname> is the fully qualified name of a class in the classpath
which is an implementation of the org.openrdf.repository.Repository.
The .jar file containing this class should be placed in the <cypher
home>/lib
directory. The <args> is a list of Strings which will be passed (in the
order in which they are listed) to the constructor method of the Repository
class. |
string |
N/A |
rdf.repository.output
|
The
repository used to store semantic output |
string |
YES |
rdf.repository.lexicon
|
The
repository used to store the lexical data |
string |
YES |
rdf.repository.dictonary
|
The
repository used to store the framenet data |
string |
YES |
rdf.repository.internal
|
The
repository used for internal processing |
string |
YES |
|
|
|
|
|
cypher.language
|
The
directory under <cypher home>/data/language containing Cypher’s
language definition files |
string |
YES |
|
|
|||
cypher.input.files
|
Whether
the files in cypher.input.dir will be
loaded as input |
boolean |
YES |
cypher.input.dir
|
The
absolute path to the directory containing input |
path
string |
YES |
cypher.output.dir
|
The absolute path to the directory to which the output is written. If the Cypher web service is running, then set the cypher.output.dir to the <cypher-ws home>/out directory, where <cypher-ws home> is the location where the cypher-ws.war file was deflated. |
path
string |
YES |
cypher.output.format
|
The
RDF format for output. Options are rdfxml, turtle, n3, ntriples, trix, trig |
string |
NO |
cypher.output.commit
|
Whether
to commit RDF output to the cypher.repository.output repository |
boolean |
NO |
cypher.output.entailments
|
Whether
to output the frame entailments |
boolean |
NO |
|
|
|
|
|
cypher.report
|
Whether
to write reports of the transcoding. Generally, the more reports which are
turned on, the greater the burden on processing due to I/O. This option, and
all those below it, are false by default. |
boolean |
NO |
cypher.report.debug
|
Whether
to output the pattern matching debug info |
boolean |
NO |
cypher.report.grammar
|
Whether
to output the grammar parse info |
boolean |
NO |
cypher.report.rdf
|
Whether
to output any RDF statements that result from transcoding |
boolean |
NO |
cypher.report.sparql
|
Whether
to output the SPARQL form of the query resulting from transcoding |
boolean |
NO |
cypher.report.serql
|
Whether
to output the SERQL form of the query resulting from transcoding |
boolean |
NO |
cypher.report.failed
|
Whether
to output the lists of inputs which failed to transcode |
boolean |
NO |
cypher.report.lexical
|
Whether
to output the list of words which failed lexical lookup |
boolean |
NO |
cypher.report.query-rdf
|
Whether
to output the RDF representation of the query that results from transcoding |
boolean |
NO |
cypher.report.html
|
Whether
to output HTML reports. The starting point is the index.html file located at
the root of the cypher.output.dir directory.
Enabling this option slows down parse time considerably. |
boolean |
NO |
cypher.report.parapharse
|
Whether
to output paraphrases of input |
boolean |
NO |
cypher.report.refresh
|
The
refresh interval for HTML interface; the format is minute:hour |
string |
YES |
cypher.report.bulk
|
Whether
to write reports to one file. If true, reports will be written to the <cypher.output.dir>/bulk directory |
boolean |
NO |
|
|
path
string |
|
|
cypher.http.base
|
Used
by the Cypher web service. This should be the URL of the Cypher service (e.g.
http://localhost:8080/cypher-ws).
If the Cypher web service is running, then set the cypher.output.dir to the <cypher-ws
home>/out
directory, where <cypher-ws home> is the location where the cypher-ws.war file was
deflated. This is also used as the namespace for all URI’s minted
during transcoding. |
||
logger.config.file
|
The
name of the file containing the Log4j configuration. File name must be
relative to the <cypher home>/data/configuration directory |
string |
YES |
Cypher recursively loads input from the cypher.input.file directory, writes the parser results to the cypher.output.file specified in the startup properties file. The format is as follows:
·
input
dir
-
file-a.txt
-
file-b.html
·
output
dir
-
file-a
§
The
first sentence
§
1
§
query.serql
§
query.sparql
§
statements.rdf
§
grammar.xml
§
2
§
n
§
transcode.failed.txt
§
report.csv
§
Another
sentence