Aleph’s dataset creation service produces bespoke, structured datasets, where none currently exist, to help fill knowledge gaps, and measure the hard to quantify.
We start by mapping out the analysis that the data will support, and by auditing existing datasets to understand what is already held by the client, what is readily available, and what information could be obtained from other sources. In the event that novel structured data must be generated, we consider how feasible different approaches would be, enabling us to conduct a cost-benefit analysis of creating the new dataset. Finally, we consider how these disparate sources of data might be adapted, augmented or combined to meet the customer’s requirement.
The approaches we have used to create data include ontology and schema design, web scraping, crowdsourcing, human-machine labelling, data enrichment and machine learning-based data synthesis. We are able to deliver data in a wide range of formats to fit in with our clients’ requirements.
01
Our client, a government research agency, required a gold standard labelled dataset on which to train domain-specific natural language processing (NLP) tools.
We employed an innovative hybrid approach to create entity and relationship text annotations at scale. This approach combined state-of-the-art automated information extraction techniques, with expert human-applied judgements and crowdsourced annotations to apply high-confidence judgements to open-source text data. The resulting dataset was distributed among the research community in question and used to train the next generation of NLP tools used in this area of government.
02
Working on behalf of our client in the insurance industry, Aleph augmented an existing dataset with values to predict the financial damage caused by terrorist events.
In order to add the previously unavailable data, we developed a novel modelling approach which took existing data and a small number of expert-provided values, and predicted the new values providing probabilistic ranges. The resulting augmented dataset was used by our clients in their work to advise their clients on the likely requirement for coverage.
03
Aleph produced a unique dataset for a prominent think tank which aggregated a series of indicators to help chart diplomatic tensions between a nation and the international community on a specific issue.
This dataset drew on a novel approach to classifying rhetoric as reported through the media, and combined this with other data and dynamically elicited expert judgements to produce an index. The dataset was used by the community studying this issue to track diplomatic tensions over time in order to support more objective analysis and policy advice.