Unlocking insights through integrated text processing
Learn more


TokenMill is a specialist software development consultancy focusing on integrated text processing: crawling, retrieval, natural language processing (NLP), search, and natural language generation (NLG). Since 2010 we have attained significant expertise in these areas by helping our clients to develop internal or external solutions in domains such as finance, business intelligence, and e-commerce


We organize the flow of unstructured text into searchable and analyzable data.


We build focused crawlers, near real-time crawlers, web scrapers or use any APIs to fetch, clean and deduplicate relevant data for you.


We build or adapt content annotation tools and manage annotation process to prepare data for your machine learning projects.


We create processing pipelines to get insights from your texts and export to format or data store of your choice.


We configure and integrate search servers for search, analysis and visualisation.


Media Intelligence

Each industry has a set of specific web sites which publish news, events and other business related content. The amount of such content is often overwhelming. In order to timely track product launches, changes in corporate structure and other similar events an automated monitoring system is needed. A media intelligence tool gives you the capablity of timely information gathering and analysis, thus turning a stream of news into a well organized set of facts.

Let us take the pharmaceuticals industry as an example. There is a list of highly specific industry news sites publishing information about new drug releases and their trials. Alongside this information a very different, but still pharma related, type of articles is published - reports about management changes, company mergers and acquisitions. Both types of articles are very relevant for anyone following the state of pharma industry. Yet keeping track of such diverse information is difficult.

To address issues like this TokenMill builds media intelligence systems capable of automatically classify articles by its type. In this case two classes of business and drug production events are introduced. For the drug production related class of articles we then start a text analysis engine to identify: drug ingredients and their names, disease names and symptoms or drug testing stages. For the business articles we detect company names, locations and person names. Once all this information is extracted, business insights about pharma products and corporate events can be generated.

Enterprise Semantic Search

Search is not a trivial task of matching user’s keywords with a set of available documents. The main problem search has to solve is the one of relevance - how to determine which documents match users' information needs best? We can help you enterprise-wide search tools specific to your business needs.

Text Understanding

Organisations use data driven approaches towards decision making. Written text is the prime means of conveying information and the amounts of text relevant for a well informed decision making is increasing at the explosive rate. Therefore, reading let alone analysing all this data becomes resource intensive task which if automated could lead to significant savings and new business opportunities.

For example the gold mining industry publishes reports on gold extraction details. These include mine layout descriptions, water consumption rates, depth, commodity densities, regulations applicable in the given situation and many other details. All provided in a descriptive text. The facts in those reports and their interrelations impact directly the market's valuation of mining companies. Similarly, medical reports on treatment of various diseases contain descriptions of symptoms, drugs, disease stages. All of this is presented in an unstructured narative form. Yet when it comes to processing vast multitudes of such documents with the aim of extracting common patterns, a manual analysis done by human beings falls short.

Language might appear as too nuanced and too complicated for a machine to understand. At TokenMill we build domain specific text understanding systems capable of extracting meaningful patterns. We use combine tools of artificial intelligence and semantic rules to extract various facts, relationships and patterns from documents in a given domain. Thus the unmanageable flow of textual information is turned into valuable insights.

Content Annotation

Unstructured data be it text, audio or video needs to be annotated manually: descriptive tags attached to the whole content, individual parts of it marked as belonging to a certain type of things, persons, organizations or locations mentioned in the text labeled. In order to produce quality annotations the whole range of issues needs to be solved.

Different annotation tasks are best done with the help of different tools - annotating each word in a text is very different from labeling the whole article as belonging to a certain class. What constitutes an element to be annotated is another issue. Does an expression “Vilnius - London flight” mention two locations or is it a single entity, that of the route? If one is tasked with annotating opinions then coming up with a definitive judgment might not be so straightforward.

To overcome those difficulties at TokenMill we are employing different annotation tools for different tasks. We use annotation guidelines to help annotators consistently tag elements of the text. As a result we are building quality corpuses of annotated texts which we can then use for machine learning tasks or as a high quality reference point for algorithm evaluation. In cases where no automation is possible, we provide clients with annotated texts as a final result of unstructured data analysis.

Novel cases we’d love to work on

Natural Language Generation is the area we are very much intrested to expand into. Our current work aims at making the machine to understand human language. We’d very much like to master the opposite process - making the machine turn numbers and other structured data into a flow of well written text.

Take a simple example of a weather report. At a given date we might have the following weather data points: temperature = +32C, precipitation = 0%, humidity = 30%, wind = 4m/s. This can be taken in as the input for the text generation system and it would spit out something like [ital] “It’s very hot outside, stay in the shade as much as possible. Leave your umbrella at home and pack a bottle of water instead.” Text like that with slight variations could be generated for each different user depending on his or hers preferences and each slight change in weather data points will yield slightly different text. Or take massively multiplayer online games (fantasy sports is another case) with hundreds if not thousands of battles and encounters happening in its vast universe. Battle reports could be generated for fans and spectators of such games. Finance, engineering and other disciplines have cases where a narrative better conveys a message than a chart or a table.


Venture Radar (UK)

NLP pipeline with crawler and venture capital funding event detection.

Weborama (FR)

NLP library used as part of Weborama’s media monitoring package.

SaasMAX (US)

Custom company web page crawl to extract information about company activities

ROI: Recruit (SE)

NLP pipeline with crawler, job advertisement identification and contact person recognition.

Orbit Financial Technology (UK)

NLP pipeline with crawler. Event detection related to financial instruments. Timeseries database population.

State Consumer Rights Protection Authority (LT)

Semantic search engine for customer rights protection legislation.

Kaunas University of Technology (LT)

Infrastructure for Lithuanian language processing: crawler, named entity recognition, text classification, clustering, deduplication, text similarity estimation and sentiment analysis.

State Food and Veterinary Service (LT)

Semantic search engine for food and veterinary regulations.

Social Artisan (UK)

NLP pipeline with web and social media crawler, named entity recognition, sentiment analysis and article classification.

Lithuanian Criminal Police Department (LT)

NLP pipeline to monitor certain types of web pages for signs of criminal activities.

Lithuanian Agency for Science, Innovation and Technology (MITA)

Open Source projects: word stemmer for Lithuanian language and page function identification algorithm.
Research into customer care messages classification.

Contact Us

TokenMill UAB

Phone: +370 699 77035

E-mail: info@tokenmill.lt

Registered address: Lvovo 13-12, Vilnius, Lithuania, LT-09313

Correspondence address: Pylimo 5-4, Vilnius, Lithuania, LT-01117

Company Code: 302561138

VAT: LT100005720211