How does Khemeia work?

The software works on the principle of detecting structure and patterns that the eye sees when viewing a page - humans create documents primarily using visual logic (even if they do no use styles/style sheets). Khemeia interprets this visual logic.

For each publication type, rules are configured according to the precise requirements of the data. During processing automated segmentation of the input file takes place (for example into articles for a magazine or newspaper), plus further segmentation according to the metadata requirement specified by the customer (i.e. an article is then further divided).

Typical items identified by Khemeia include:

  • Page numbers, section numbers, titles, headings, notes, footnotes, end notes, “see” references, bullet points, number lists, images, images, captions, etc.
  • Styles on the basis of their:
    • Position
    • Fonts and their formatting
    • Co-ordinates on the page
  • Tables
  • Specific categories of term (from which metadata is generated)
  • Metadata for defined elements

… with the output transformed into, for example, XML, PDF or HTML.

Features of Khemeia include:

  • Comprehensive metadata generation without any limitation – the user interface allows the creation of as many tags/element types as required
  • Hierarchical and interlinked metadata - not simply flat structured XML
  • Outputs: valid XML parsed against customer DTDs/XML schema, HTML, any database or search format
  • Benchmarked processing capacity at 2 to 10 million characters per hour (subject to page complexity, images, etc)

Example - Magazines and Newspapers

  • A magazine consisting of multiple PDF pages; Khemeia segments the data by article, recognising the beginning and end of each, and then within the article segments further, for example, into title, sub-title, author, date, etc. – this segmentation even handles images, captions, photo credits and text boxes.

Example - Books

  • A book consisting of multiple chapters; Khemeia segments the data by chapter, recognising the beginning and end of each, and then within the chapter segments further into title, paragraphs, page numbers, footnotes, images, captions, etc.

Example - Legal Documents

Legal documents consisting of for example a PDF with a hundred case documents; Khemeia segments the data case by case, and then within each case segments further into name of the court, case number, date, jurisdiction, presiding judge, plaintiff/appellant, defendant/respondent, judgment, etc.

Example – Scientific Journal Abstracts

  • A scientific journal consisting of a number of articles; Khemeia segments the data into articles, and then within each article identifies the required information for the abstract, for example, title, author, publisher, date, subject, classification, description, edition, abstract, identifier, format, extent, language, availability, location, source, coverage, rights, terms of use, citation, etc.