METADATA and INFORMATION MANAGEMENT
Metadata is not an answer the question “How do I Manage my company’s information?”, rather it is a means of automating an information management plan – but you need the plan first. However, metadata is a cost, and a sound management plan ensures the cost of creating it is exceed by the benefits it brings. There are plenty of examples of systems which impose a metadata burden on users for which the costs are never recouped. This paper is a first step to answering the cost/benefit question, focussing on how do you know what metadata is needed and then steering between the errors of metadata that is too simplistic to be useful and metadata that is to complicated to provide commensurate benefit.
Metadata is typically defined as “data about data”, a singularly unhelpful definition. Rather than criticise this definition from first principles, the concern here is rather to say something useful about metadata as it is used in Information Management. In this context, metadata relates a subject item to the business processes and procedures that manage that item. What follows teases out what that relationship entails, and therefore what is “good metadata”.
The first section below works through some basic concepts – how Metadata relates to Process, what is a taxonomy and why is it useful, and how do we measure our need for metadata. This is followed by the main part of this paper, “The MUST of Metadata”. This outlines the aims of the basic management processes – Management, Usage Control, Search and Trust – and points to some metadata metrics which may be used to match metadata complexity to the business requirement.
But to start with, we must be clear what is being managed. A subject item is a collection of information which we want to treat as a single item for the purposes of managing it; a mathom1 to put away in our attic. A mathom might be a single physical item, like the garden gnome from granny; a bundled collection such as the box of photos from Aunt Ivy; or a loose collection that is best kept together, such as the the poles, skis and skiwear from last year’s trip. Similarly a subject item might be a simple document in a single file, the collection of images showing a product test, or a complex report with multiple views generated from a set of XML pages. A subject item is a single unit for the purposes of keeping track of it or for controlling access to it, since all the component parts are needed to understand the subject it describes. Information Management is about the way The Business works with the subject item. Information Management processes are best considered separately from the IT mechanisms that keep the components together, and which are properly the subject of Data Management.
Given we have a subject item, what do we mean by “manage it”? In practice, this comes down to four requirements:
Being able to find it among the thousands of other things on the computer;
Being able to use it, which means we need permissions – security and copyright – to read it;
Being able to trust it: “is it the latest version?”, “is this the authentic original?”;
Being able to reference it, so we can find it again or tell someone else about it.
The metadata problem is to identify what information will enable us to do all that, and the answer depends on the size and complexity of business processes relating to each of the four requirements.
Metadata and Process
We start with a deliberately behaviourist definition: The meaning of information is given by the behaviour of the systems that process it. This definition allows us to treat both people and computers in the same way without having to account for vague concepts like “understanding”: focusing on behaviour avoids the need to ask “what is in the computer’s head?”.
A process is simply doing one thing after another, except when it comes to decision points, when it is doing one thing or another. In general, metadata is either a memory of what step in the process the subject item has got to, or a marker for which branch is should follow at a decision point. For example, in document metadata, the status “draft” remembers which step was reached in a document approval cycle, while the metadata value “business record” is used to automatically select the sub-process “archive it for future reference”.
Although outside the scope of this note, two other types of metadata require a brief mention. Firstly, audit data records are used to show which process steps were taken and when: for example, the date a document was approved and the approver. Audit data may be part of trust data, but may also have other purposes such as proving compliance, for which the subject is the business itself rather than the actual subject items. Secondly, technical metadata is data used by software when accessing the subject item; for example, image size is used to render a list of pixels as a rectangular picture. Technical metadata is properly the subject of software technology.
One further source of confusion is the way some systems store subject data as “metadata”; for example, a design management system might store the material for a part as metadata to the part’s CAD file. Here the design management system is really a hybrid, being part a design database and part information management system. Such metadata is “subject data which is not stored in the physical subject item”. Metadata records often expand to meet business purposes for which no other convenient repository exists. This is rarely appropriate as it can add process specific metadata to subject items that are not subject to that process, and consequently lead to a habit of not-filling-in-metadata, which can then spill over to infect useful metadata.
Metadata and Taxonomy
An automated process trips up if the data it uses comes with synonyms or alternative spellings. The solution is often a controlled vocabulary, at its simplest a drop-down list. However, a controlled vocabulary is no use unless the users know what its terms mean.
A taxonomy is a system of definitions, for which the taxonomy class names provide a controlled vocabulary.
In a taxonomy of things, definition first identifies what sort of thing something is (the superclass it belongs to) and second, how it is differs from other similar things (what subclass it belongs to). For example the class of “animal” might be divided into subclasses of “mammal”, “bird”, “fish” and “insect”, with rules such as: “a bird has two legs and flies” and “fish swim”. Of course, a “good” taxonomy has the “right” criteria, and does not classify bats as birds nor penguins as fish. In a taxonomy, it is the definitions that are important, not the labels used for the classes.
A hierarchical taxonomy has layers which divide the subjects into smaller, more precise classes: Animal – Bird – Perching Bird – Thrush. A hierarchical set of terms is not a taxonomy unless it is supported by a complete set of definitions at every level; the terms are the cues – the shortcuts – for someone who understands the definitions. A taxonomy does not have to be hierarchical: when supported by definitions, a list can constitute a simple taxonomy. However, hierarchical taxonomies, if used correctly, have a particular advantage.
As one goes down a taxonomy hierarchy layer by layer, each layer adds more detailed definitions – which means the user’s total knowledge must increase. However, in many cases, the lower levels of detail are not needed by all users. For example, in an emergency a fire crew may need to know that a type 123 fire engine comes with a ladder and 5,000 litres of water, but this level of detail is not needed by the policeman or the ambulance crew, who simply care that a fire engine is coming. Hierarchy can be used as a tool for matching detail to user.
There is one important complication. A taxonomy may not be driven by the characteristics of the thing classified but by the process in which it is used. For example, being “an authoriser” is not a characteristic of a human being, it as a role assigned by a process. In such a case, the taxonomy will only be correctly applied by someone who knows the process. Too often when filling in a form, one is stymied when it asks an incomprehensible question such as “Are you resident, domiciled, habitually resident or habitually domiciled?” – the answer being used at a decision point in the UK tax assessment process for people working overseas (to be fair, in this case, the tax form comes with detailed decision criteria).
Metrics for Metadata.
A metric for metadata is an evaluation how “big” a metadata problem is, and therefore how complicated the solution needs to be. This section develops what the “bigness” questions are before applying them to the MUST of Metadata. Such metrics are currently quite immature, and the discussion is illustrative.
The three basic metrics for metadata are:
Size – how many processes, steps in a process, different types of user, different types of subject item;
Complexity – how complicated are the rules relating process to a subject item ;
Coupling – what other subject items or process steps does the user need to know about.
A key aim of Information Management is to minimise the effort needed to find, access and manage the subject items. In practice this is often a trade-off between size and complexity. A list of, say, half-a-dozen options might be presented as a simple list, perhaps with the most used options at the top. A list of a hundred or so items might be represented as an ordered list – that is a list plus the added complexity of a rule ordering the list. A telephone directory uses further rules; for example, it splits out three separate listings for personal name, business name and business type; for example, it abbreviates “Saint” to”St.” and so “St” then comes between “Sag” and “Saj” rather than “Sr” and “Su”. To use a telephone directory, the user must first learn the rules by which the directory is arranged.
Properly, size measures of the number of metadata terms needed multiplied by a factor for how much knowledge is required to use each term. For an identifier, such as a file name, there may be millions of names, but, once recognised as a file name, there is little further to know. On the other hand there may be very few business roles in the accounts department but routing a document to the right person may be require substantial knowledge about each role. The curse of automated telephone answering systems is not that the caller does not know what they want to say, but that they do not know how the called company organizes itself and so what each option actually applies.
Complexity is a measure of the number of rules needed. Alphabetical ordering is an example of a simple rule. Rules may include meta rules – rules that decide which rule should be invoked; such a rule requires more knowledge (is bigger) than a simple rule. For example, in a telephone directory, one needs to invoke the meta-rule “select personal, business or business type listing” before the “search listing alphabetically” rule.
As an example, access control rules rapidly become very complex: in a cross-institution medical project, a researcher may have the right to access a particular file type under the project rules but not under the normal rules for the institution, and a meta-rule must be added to decide which set of rules takes precedence Further, multiple projects may run across multiple institutions, with a researcher taking different roles in each project, and so may have access to a particular file when working on one project but not when working on another. Ensuring that complex combinations of rules do not accidentally give undesired results is challenging.
The metric for complexity is also related to the size of the user community. For example, if the user community is large and diverse, rules on access controls must be taught to both the administrators (who allocate access privileges to users) and to the information originators, who set the required privileges when they create a subject item. Unless the administrators understand the whole user community and in the same way as the originators, then there will be a mismatch between users and what they can read, potentially enabling inappropriate access and preventing appropriate (leaks and blockages). Unfortunately, eliciting the knowledge needed to determine size and complexity is as yet a specialist discipline, and converting these to hard metrics is a specialist human factors problem.
Coupling: Coupling occurs when there is a linkage between two or more elements in the business process. An example at the simplest level is where one document is used as a reference for another, so when the referenced document is updated, the citing document may also need to be changed. At a more complex level, responsibilities in an engineering projects may be split between the technical lead, who is responsible for project quality, and the project accountant, who tracks cost and time. However, quality, cost and time are often tightly coupled, leading to conflicts between the people in the two roles. In information management terms, updating the project bar chart is not something one person can do in isolation, at least, no if they want the project to succeed.
Very complex couplings occur in Product Configuration Management, where a change in the design of a single part can have knock-on effects to every assembly that uses the part, changing not only assembly design but also the cost and schedule of the whole project. Which is why Product Configuration Management should be separated out as an Information Management specialism in its own right.
Metrics on coupling are also a work in progress, and working out how an Information Management system should deal with it is as yet a matter of tact and judgement. Deciding what sort of coupling relationships should be made explicit in metadata requires a fine judgement covering what is really in scope, what is really possible and on how reliable must such information be. For example, a system which reports if a document has been superseded may be more risky than one reporting nothing at all, since in the later case the user would know to check, whereas in the former the user may simply assume that a null return is the same as the document being up-to-date.
When the business process changes, it may break old couplings and create new, so consequently users must relearn the process and any taxonomy it uses. Splitting the project manager role into technical lead and project accountant means that anyone wishing to influence the project must now know which of two people to talk to and how conflicts between the roles play out. Change also means that the old metadata taxonomy is wrong with respect to the new process, and new metadata will be wrong if the project is still working to the old process – and this is why telephone numbers remain important, since they provide a route that bypasses the Information Management system to enable users to find out what is really going on.
THE MUST OF METADATA
So, now we know what we are managing – a subject item; why we are managing it – to meet four business process requirements; the method we will use – hierarchical taxonomies of metadata; and we have some idea of how metrics impact the design. Can we therefore formulate generic guidance on what metadata is needed?
First, let us recast the aims into objectives – move from the business processes we want to support to definitions we can use to check we are doing the right thing. This recasting comes with a convenient acronym “MUST”:
M – Management information, such as subject item identifier;
U – Usage control, such as access permissions or copyright owner;
S – Search terms by which the subject item is found;
T – Trust, demonstrated perhaps by approval or audit trails.
The “MUST of Metadata” is a reminder both that metadata is a necessity and that it must not be a burden. The MUST categories themselves are not a taxonomy – the item version might be classified both as management and trust metadata. MUST is a heuristic, a reminder of what to think about rather than a check list to impose.
The art of metadata – and currently it is more art than science – is to match the complexity of the metadata with the “bigness” of the problem that the metadata is solving. To convert the art to a science we need good metrics that will match metadata to the corpus of subject items and the user community. Lacking such metrics, the notes below are finger-posts to the direction of travel, rather than a detailed map of the whole journey.
The key management metadata item is the subject item identifier, as this allows all the Information Management processes to reference that item.
Many data management systems allocate each subject its own identifier, typically a unique sequence of numbers and digits. This is a “small problem” solution – it assumes one, static organization where the data management system is the only system. Moreover, the meaning of the identifier is embedded within the data management system – it is the only that software that can find the subject item using that identifier. That “unique” string at best means nothing in any other system, and at worst, points to something else.
Such a “small problem” approach starts to break down when embedded in a bigger world – in business co-operations, mergers, or even just increased co-operation between departments. In a bigger world, information can travel in four directions:
peer-to-peer – as when two organisations contribute to the same document or design;
one-process-to-the-next – for example from design to manufacture and on to support, each with their own dedicated software tool set;
general-to-detail – where an overall goal is met by multiple specialists, each with their specialist software;
across time – be that simple recall of stored data, replacement of the data management tool or restructuring of the business (mergers, acquisitions, rationalisations).
The bigger world problem requires a subject identifier that persists across gaps in systems, so that a reference from A to B in one system persists as a reference when they get to the next system.
A huge world example is provided by aerospace part numbering. The aerospace industry has a substantial number of manufacturers, however each manufacturer will have hundreds of other companies in its supply chain, with some producing parts for many manufacturers (e.g. specialist suppliers of rivets). Moreover, there are a large number of airlines, each supported by in-house and contract maintenance organizations, some of which provide maintenance services to many airlines. Since putting the wrong part on an aeroplane can prove fatal, the industry is adopting a complex bar-coding system. An aerospace part number is made up of three elements: the part number given by the manufacturer, the id of the manufacturer (the context for the part number), and the id of the organizations allocating manufacturer’s identifiers – this last avoids confusion between manufacturers in different countries but with similar names. Fortunately the complexity ends there, as there are only three organizations supplying manufacturer ids: CAGE, Dunns, and their European equivalent. Web addresses use a similar structure: email@example.com, with the domain administration currently policed by the WWW consortium.
Here a complex metadata solution is needed because the domain is large (many parts, many companies) and coupled (part numbers being used across multiple processes). However, by standardising to three elements, the complexity is considerably reduced – the user need only learn one convention for part numbering.
Other metadata in the management bucket is there to link the subject item to the organizational processes that decide its fate. These might include life-cycle management to ensure documents used are current; records management to ensure that legally required records are tracked with the audit trails or custodian management, to ensure that someone remains responsible for an item after the company structure has changed.
The archetypal usage control is access management – making sure anyone who is allowed to see an item can see it, while ensuring anyone who should not does not. A small organization can use a simple, two step strategy:
Set up group G to have permission to access the item;
Make the person a member of group G.
At the other end of the access control spectrum is policy-based access control:
Allocate each person a set of “identity tokens” through a secure process;
Define a basic access permission using a combination of identity tokens (a policy);
Define what happens when several policies apply (a policy of policies).
Here a policy is typically a logical expression that combines identity tokens to infer permissions. For example, the accessor must work for department A in business role B and work on project C. In the end the policy must resolve to TRUE (allow access) or FALSE (deny access) ; consequently policy systems are typically logic based. A policy of policies defines what happens when two or more policies might apply – for example, legal access may trump commercial confidentially.
An essential point here is as the business evolves or a subject item moves on to a new process step, then over time access permissions may change. Ideally the system should be able to dynamically change existing policies or add new policies as processes and business structures change, with permissions decided at the point when the item is accessed . For mechanisms, see, for example, the Oasis-open standards SAML and XACML, or see PONDER 2 for a logic that also includes obligations such as the obligation to notify central administration should a sensitive file be accessed.
Other usage controls typically centre round copyright and intellectual property. For example, is the user of a website either a personal subscriber or do they work for a company that subscribes? Have they paid to see a film a set number of times? Can the user copy an image to use in a Web Page?
The aim of someone searching is to find the information they want without having to weed through a great deal of irrelevant information. However from an Information Management systems perspective, search is a three way trade-off between search success, populating metadata and computational cost. These issues are explored in more detail below.
Search success can be defined in terms of games theory, where you win if you get the information you need, and lose when you have to wade through irrelevant results. More precisely, the metric Search Success is a combination of four terms:
(Probability of finding valued information) multiplied by (Information Value) MINUS
(Probability of missing valued information) multiplied the (Cost of Missing) MINUS
(Probability of including irrelevant information) multiplied by the (Cost of Weeding)
[PLUS (Probability of excluding irrelevant information) multiplied by the (Cost of Weeding)]
This metric comes with three value functions: Information Value, Cost of Missing and Cost of Weeding. These value functions depend on the user aims:
For a pub quiz enquiry (e.g. Which film actor was born was Archibald Leech?) the Information Value is not high and the Cost of Missing a single source is low since many separate sources can provide the answer (Cary Grant). Here the cost of weeding is relatively high in so far as the answer is needed quickly. Note: In the case of Internet search, the number of irrelevant things ignored would overwhelm all the other terms – which would make the metric useless – and so the last term is ignored.
In a search for legal evidence across a business, the cost of missing a vital piece of evidence will be high, and the searcher will be willing to accept a much higher cost when weeding out irrelevancies.
That is, if you want to know what sort of searches would be valued, you must ask the users. These should include the users on the periphery, such as the legal department, because although they may be only occasional users, their searches may have high business value.
Populating Search Metadata
The hard problem with search metadata is getting it populated, fully populated and correctly populated. It needs to be populated to be usable, but if it is not fully populated or correctly populated users may not trust the metadata and instead process data manually – which makes the metadata pointless.
It may be hard to get the subject item creators to populate metadata when that data is used in someone else’s process – a process they may not understand or care about. Metadata entry can be enforced, but at the risk of making it unreliable because time-stressed workers will aim to meet their deadlines rather than unstated, unexamined quality criteria. Search data is particularly problematic because the search criteria used by the people that create the data may differ from those of the target audience and will almost certainly differ from those of the specialist investigator.
Automatic population may be proposed as an alternative. Entity extraction searches a document for recognisable entities such as place names, dates or names of people, however this is not as straightforward as it sounds. Consider for example, “Theresa May may make hay in Hay next May”, a short sentence, evidently about the prime minister visiting a book festival, but where words “may” and “hay” are used in multiple different ways. Semantic analysis uses the structure of language to provide some disambiguation, but is not yet entirely reliable. “Big Data” is a machine learning technique, but machine learning relies on there being a significant corpus of examples to train the algorithm, and consequently the needs of rarely used processes and small, specialist sub-populations of users may get lost in the noise.
As a result of these limitations, and reinforced by a history of failures of “magic bullet” applications, in many businesses search still relies on full text searches, though possibly using extensive indexes to give the needed response times (these indexes are Data Management artefacts in their own right rather then metadata for a specific item). Building and maintaining such indices may involve a substantial computational cost.
This does not mean that there is no hope that metadata will be properly populated. Where the metadata is important for the users’ own processes, they will make sure it is correctly populated. Where users are properly trained and motivated – where they see they work as being of value to others – then there is a hope of metadata being well populated. Perhaps the most effective populaters are professional curators of data, such as librarians, although many accounting systems see them as a cost rather than a benefit.
The sources of trust may be as simple as the name of the person who wrote the document or as complex as a list of approvals together with links showing that the approvers had the roles required by the process and that they met the qualifications required at the time of the approval – although the latter is getting into the realms of an integrated process management system.
Trust typically focuses on personal integrity, on technical integrity or on legal integrity. Metadata for personal integrity depends on recording who did what action and providing any necessary links to show that they had the right to take that action. Technical integrity typically focuses on showing that the data is unchanged (e.g. checksum) although more advanced approaches include validation properties which show that the software is interpreting the data correctly – e.g. a 3D shape has the same volume (See LOTAR, EN 9300/AS 9300). Legal integrity involves showing that that the information is what it purports to be, and may use technical means such as digital signatures. However, evidential weight is a matter of showing that the data can be trusted as evidence in a court of law, which is more a matter of demonstrating that the organisation processes are sound and being used correctly, with audit trails linking subject item to the processes (See BIP 0008).
There is a relationship between provenance and trust, but the two are not the same. Provenance shows where information came from, but not whether the sources, processes or people on the way are trustworthy. The W3C standard for Provenance provides a starting point for those interested in the details.
Metadata provides a summary of the business knowledge needed to manage a subject item. The subject of that knowledge is how the business operates rather than the information contained in the subject item – it is about who wants to know that information, whether they can access it and how much they can trust it. The acronym MUST – Manage, Usage, Search and Trust – is a heuristic, pointing to the processes to focus on.
Metrics tell us how large and complex the business is, and therefore how extensive and complex the metadata should be – too simple and the metadata will not do the job, too complex and it becomes a burden and an unnecessary expense. These metrics measure knowledge – something that is hard to quantify, consequently, the metrics are currently pointers to the approach needed, rather than hard and fast rules.
This paper is a starting point for Metadata and Information Management. TheSeanBarker Ltd is in the process of developing more extensive detail. Future papers will consider how to deal with business and software change, information management metrics and with particular types and collections of metadata, such as Dublin Core.
1See “The Hobbit” by J.R.R.Tolkien for this usage of this Old English word.
DOWNLOAD as PDF