We turn text into insights
Learn more

Services

We organize the flow of unstructured text into searchable and analyzable data.

Extraction

We build focused crawlers, near real-time crawlers, web scrapers or use any APIs to fetch, clean and deduplicate relevant data for you.

Annotation

We build or adapt content annotation tools and manage annotation process to prepare data for your machine learning projects.

Analysis

We create processing pipelines to get insights from your texts and export to format or data store of your choice.

Retrieval

We configure and integrate search servers for search, analysis and visualisation.

Cases

Media Intelligence

Each industry has a set of specific web sites publishing news, events and other business related content. The amount of such content is often overwhelming. In order to timely track product launches, changes in corporate structure and other similar events an automated monitoring system is need. A tool like this should be capable of timely information gathering and analysis, thus turning a torrent of news into a well organized set of facts.

Let us take pharmaceuticals industry as an example. There is a list of highly specific industry news sites publishing information about new drug releases and their trials. Alongside this information a very different, but still pharma related, type of articles is published - reports about management changes, company mergers and acquisitions. Both types of articles are very relevant for anyone following the state of pharma industry. Yet keeping track of such a diverse information is difficult.

To address issues like this TokenMill builds media intelligence systems capable to automatically classify articles by its type. In this case two classes of business and drug production events are introduced. For the drug production related class of articles we then start a text analysis engine to identify: drug ingredients and their names, disease names and symptoms or drug testing stages. For the business articles we detect company names, locations and person names. Once all this information is extracted, business insights about pharma products and corporate events can be generated.

Enterprise Semantic Search

Search is not a trivial task of matching user’s keywords with the set of available documents. The main problem search has to solve is the one of relevance - how to determine which documents match users information needs best? Web pages with are hyperlinked and there are lots of them. This information can be used to device algorithms which work pretty well for the web content. Yet enterprise search - dealing with the documents used and produced by the organization's internal systems - can not rely on such techniques.

Text Understanding

Increasingly organisations use data driven approach towards decision making. Written text is the prime means of conveying information and the amounts of text relevant for a well informed decision making is increasing at the explosive rate. Therefore, reading let alone analysing this torrent of data becomes resource intensive task which if automated could lead to significant savings and new business opportunities.

For example commodities mining industry publishes reports say on gold extraction details. These include mine layout descriptions, water consumption rates, depth, commodity densities, regulations applicable in the given situation and many other details. All provided in a descriptive text. The facts in those reports and their interrelations impact directly on the value of the mine. Similarly, medical reports on treatment of various diseases contain descriptions of symptoms, drugs, disease stages. All of this is presented in an unstructured narative form. Yet when it comes to processing vast multitudes of such documents with the aim of extracting common patterns, a manual analysis done by human beings falls short.

Language might appear as too nuanced and too complicated for the machine to understand. At TokenMill we build domain specific text understanding systems capable of extracting meaningful patterns. We use combined tools of artificial intelligence and fixed rules approaches to extract various facts, relations and patterns from the documents in a given domain. Thus the unmanageable flow of textual information is turned into valuable insights.

Content Annotation

Unstructured data be it text, audio or video needs to be annotated manually: descriptive tags attached to the whole content, individual parts of it marked as belonging to a certain type of things, persons, organizations or locations mentioned in the text labeled. In order to produce quality annotations the whole range of issues needs to be solved.

Different annotation tasks are best done with the help of different tools - annotating each word in a text is very different from labeling the whole article as belonging to a certain class. What constitutes an element to be annotated is another issue. Is an expression “Vilnius - London flight” contains a mention of two locations or is it a single entity of the route? If one is tasked with annotating opinions then coming up with definitive judgment might not be so straightforward.

To overcome those difficulties at TokenMill we are employing different annotation tools for different tasks. We use annotation guidelines to help annotators consistently tag elements of the text. As a result we are building quality corpuses of annotated texts which we then can use for machine learning tasks or as a high quality reference point for algorithm evaluation. In cases where no automation is possible, we provide clients with annotated texts as a final result of unstructured data analysis.

Novel cases we’d love to work on

Natural language generation is the area we’d very much like to expand into. Our current work aims at making the machine to understand human language. We’d very much like to master the opposite process - making the machine turn numbers and other structured data into a flow of well written text.

Take a simple example of a weather report. At a given date we might have the following weather data points: temperature = +32C, precipitation = 0%, humidity = 30%, wind = 4m/s. This can be taken in as the input for the text generation system and it would spit out something like [ital] “It’s very hot outside, stay in the shade as much as possible. Leave your umbrella at home and pack a bottle of water instead.” Text like that with slight variations could be generated for each different user depending on his or hers preferences and each slight change in weather data points will yield slightly different text. Or take massively multiplayer online games (fantasy sports is another case) with hundreds if not thousands of battles and encounters happening in its vast universe. Battle reports could be generated for fans and spectators of such games. Finance, engineering and other disciplines surely have cases where a narrative better conveys a message than a graph.

Clients

Glaucusis Solutions

NLP pipeline with focused crawler, financial instruments related event detection and time series database population.

State Consumer Rights Protection Authority

Semantic search engine for customer rights protection legislation.

Venture Radar

NLP pipeline with focused crawler and venture capital funding event detection.

Weborama

NLP library used as part of Weborama’s media monitoring package.

Startup

NLP application for scientific article analysis to identify concepts and discourse structure.

Kaunas University of Technology

Infrastructure for Lithuanian language processing: focused crawler, named entity recognition, text classification, clustering, deduplication, text similarity estimation and sentiment analysis.

State Food and Veterinary Service

Semantic search engine for food and veterinary regulations.

Social Artisan

NLP pipeline with focused web and social media crawler, named entity recognition, sentiment analysis and article classification.

ROI: Recruit

NLP pipeline with focused crawler, job advertisement identification and contact person recognition.

Startup

Pharmaceutical news monitoring system with domain specific named entity recognition and article storyline detection.

Startup

NLP pipeline to analyze scientific articles about microelectromechanical systems (MEMS) and populate MEMS ontologies.

Startup

NLP pipeline to detect commodities mining industry events.

Criminal Police (Lithuania)

NLP pipeline to monitor certain types of web pages for signs of criminal activities.

Lithuanian Agency for Science, Innovation and Technology (MITA)

Open Source projects: word stemmer for Lithuanian language and page function identification algorithm.
Research into customer care messages classification.

Technologies we use

We have solid experience with open source technologies covering whole cycle of text processing needs:

About

TokenMill is specialised technology consultancy with a focus on knowledge extraction from unstructured data.

We work with our clients to rapidly develop deep and lasting insights for their business from both private and public sources, be it information published on the web or internal documents.

Founded in 2010 TokenMill has worked with variety of clients from startups to corporations to government agencies.

As a result we have diverse and deep experience in helping you to manage and organize your unstructured information assets.

Our Team

Žygimantas Medelis

CEO

Donatas Remeika

Chief Text Miller

Dainius Jocas

Text Miller

Contacts

Phone: +370 699 77035

E-mail: info@tokenmill.lt

Registration address: Lvovo 13-12, Vilnius, Lithuania, LT-09313

Correspondence address: Aušros Vartų g. 12, Vilnius, Lithuania, LT-01303

Company Code: 302561138

VAT: LT100005720211