Tags
Steamship uses Tags to represent all commentary about content.
- The intent of a chat message
- The embedding of a sentence
- The sentiment of a phase
- The markdown semantics of a region of text
- Identified object regions of an image
Steamship Files and Blocks contain content. Tags hold all data about the content.
The full Tag PyDoc spec is here
, but it’s useful to look at a summarized version:
class Tag:
"""Subset of the Tag object -- within the context of a Block"""
# What the tag is
kind: str
name: Optional[str]
value: Optional[Dict]
# The span of text the tag is commenting upon.
# Indices are relative to the block's text.
start_idx: Optional[int] # Start inclusive
end_idx: Optional[int] # End exclusive
This design results in a flexible data storage scheme that can be adapted to a number of different scenarios. We optimize our data storage so that you can query over tags and their contents.
Ways to use Tags
Here are a few examples to help you think of how tags are used.
The start_idx
and end_idx
have been left out of the pseudo-code below.
- An entity
Tag(kind="entity", name="person", value={"canonical": "Donald Duck"})
- A part of speech
Tag(kind="part-of-speech", name="adj"})
- An embedding
Tag(kind="embedding", name="my-embedder", value: { "vector-value": [0, 0, 0, 1, 0 .. 0] })
- A summarization
Tag(kind="generation", name="summary", value: { "string-value": "... summary of the span covered by this tag ..." })
- A dictionary lookup
Tag(kind="token", name="ce-dict", value: { "chinese": "你好", "pinyin": "nǐhǎo", "english": "hello" })
Tag Schemas
Steamship brings together many models under one roof using tags as the common representation for interoperation. But doesn’t fully solve the model interop problem: how do we make sure all models use the same tags?
Where possible, we use a common schema for the kind
, name
, and value
properties of a tag.
If all sentiment models produce tags with kind sentiment
and a range of names [positive, neutral, negative]
, for example, then we can swap them in and out as needed.
Our ongoing pursuit of this can be found in the tag_constants.py file (opens in a new tab) in Github. There you will find Python Enum classes that have common tags across our plugins.
TagKind class
contains suggested values for thekind
field of a Tag.TagValue class
contains suggested keys for thevalu
dictionary of a Tag.- The following classes contain suggested values for the
name
field:DocTag
for document semantics (HTML, Markdown, OCR, etc)EmotionTag
for emotion tagging tagEntityTag
for entity tagging tagsGenerationTag
for models which generate new data from the covered span as inputIntentTag
for intent classificationSentimentTag
for sentiment classificationRoleTag
for role classification
These constants are not required, but using them increases the chance that what you build will interoperate cleanly with everyone else that uses them.
Here is what some of the above tag examples would look like using these enum classes. Notice how it is a combination of standard constant and “magic string” depending on whether a constant exists for that concept.
- An entity
from steamship.data import TagKind, EntityTag Tag(kind=TagKind.ENTITY, name=EntityTag.PERSON, value={"canonical": "Donald Duck"})
- A part of speech
from steamship.data import TagKind Tag(kind=TagKind.PART_OF_SPEECH, name="adj"})
- An embedding
from steamship.data import TagKind, TagValue Tag(kind=TagKind.EMBEDDING, name="my-embedder", value: { TagValue.VECTOR_VALUE: [0, 0, 0, 1, 0 .. 0] })
- A summarization
from steamship.data import TagKind, GenerationTag, TagValue Tag(kind=TagKind.GENERATION, name=GenerationTag.SUMMARY, value: { TagValue.STRING_VALUE: "... summary of the span covered by this tag ..." })
- A dictionary lookup
from steamship.data import TagKind Tag(kind=TagKind.TOKEN, name="ce-dict", value: { "chinese": "你好", "pinyin": "nǐhǎo", "english": "hello" })
Block and File Tags
The above text discusses tags upon spans of text. But Steamship actually supports two types of tags: File Tags and Block Tags.
File Tags annotate a File object itself:
- They are attached to the File object (
file.tags
) - Their
block_id
,start_idx
, andend_idx
are always null. - They are referenced via the
filetag
keyword in our query system.
Block Tags annotate text within a Block object:
- They are attached to the Block object (
block.tags
) - Their
start_idx
andend_idx
fields are either both null or both non-null. If both are null, theTag is assumed to apply to the whole ``Block
. They represent offsets into the text that is spanned by that block. - They are referenced via the
blocktag
keyword in our query system.
Notes:
- It is impossible for a tag to cover text spanning multiple blocks.