Introduction

Recently, we were tagging a lot of texts with spaCy in order to extract organization names. We faced a problem: many entities tagged by spaCy were not valid organization names at all. And it wasn’t actually the problem of spaCy itself: all extracted entities, at first sight, did look like organization names. The result could be better if we trained spaCy models more. However, this approach required a large corpus of properly labeled data which should also include a proper context. So we needed a simpler solution to filter out the wrong data.

The approach we decided to try to solve this problem was based on simple considerations: perhaps, if we looked into the texts, there might be some list of words that are used more frequently alongside organization names. For example, words like corporation, ltd, co., Llc, foundation, group, etc. What we needed was:

  1. A big enough list of those words for better data quality
  2. To be sure that these words really occur in the real world

Collecting Word Statistics

To test our hypothesis, we collected frequencies of separate word usages alongside a few manually labeled company names. We should mention here that we actually tried two approaches. The first one was directly counting the number of mentions of a specific word around a specific named entity. However, after we built the model with this kind of data, it turned out that such kind of model does not perform well. It barely passed the 50% precision threshold and lost recall to nearly 40%. So we modified this approach and it really showed better results.

HTML

To improve our approach we took into account distances between words. Here’s how we did it: we still collect the frequency of a word, but we also collect the weighted frequency (weight) of this word. The weight was defined in the following way:

formula

where dist is an integer distance measured in a number of words between a word and a named entity, α — parameter, increasing or decreasing the speed of words weight loss for higher distances. After experimenting with a small set of examples, we chose α to be 15.

Therefore, if a word is often close to a named entity, its weighted frequency will be relatively high, and in the opposite case — low. We also decided to ignore words that are more than 30 words away from the named entity, since they are unlikely to be related to this named entity. Here is the code in Python with which we collected stats for all separate words. It can be slightly improved, but we decided that it is good enough for one-time use.

import re
import numpy as np

from bs4 import BeautifulSoup
from IPython import display

def dist_weight(x, alpha = 15):
    """you may choose different alpha and see what happens.
    """

    return np.exp(-(x ** 2) / alpha)

word_re = re.compile("(?:^|[\-\\|\,\.\(\)]|\s+)([a-zA-Z][a-zA-Z]+)") #regex for finding words in text

def append_words_stats(words_stats, text, ne):
    words = [(text[match.span(1)[0] : match.span(1)[1]].lower(),
              match.span(1)[0],
              match.span(1)[1],
              idx) for idx, match in enumerate(word_re.finditer(text))] # find all words in text and their position
   
    for ne_match in re.finditer(ne[0].lower(), text.lower()): # for all positions of named entity in text
        ne_start = ne_match.span(0)[0]
        ne_word_num = None

      # There were cases when named entity was present in a text but was not a word in terms of the above regex.
      # Since we are not interested in absolute precision, we decided to bypass this case with such a simple way
       
        for word in words:
            if word[1] <= ne_start < word[2]:
                ne_word_num = word[3]
                break
               
        if ne_word_num:       
            for word in words:
                dist = abs(ne_word_num - word[3])
                if dist > 30:
                    continue
                   
                if word[0] not in words_stats:
                    words_stats[word[0]] = {
                        'org' : 0,
                        'org_f' : 0,
                        'non_org' : 0,
                        'non_org_f' : 0,

                    }

                ne_type = ("non_" if ne[1] == 0 else "") + "org"
                words_stats[word[0]][ne_type] += dist_weight(dist)
                words_stats[word[0]][ne_type + "_f"] += 1
           
words_stats = {}

for idx, filename in enumerate(filenames):
    with open(filename, "r", encoding = 'utf-8') as file:
        soup = BeautifulSoup(file.read())
       
        for p in soup.find_all("p"):
            for ne in nes:
                if ne[0].lower() in p.text.lower():
                    append_words_stats(words_stats, p.text, ne)
           
    display.clear_output(wait = True)
    print("{0} of {1} files passed.".format(idx + 1, len(filenames)))
    print("{0} words found.".format(len(words_stats)))

Note, that nes here is a list of two-element tuples, where the first element is a named entity text itself, and the second element is a binary value denoting whether a named entity is an organization name or not ( we labeled those manually). filenames is a list of those files that we saved while searching for each named entity.

After all the files were processed, we converted the collected stats into pandas DataFrame — words_stats_df(See code below). Then we created two new columns, which were the average weights of words for 1000 usages alongside names of organizations and non-organizations. Then, for each word, we calculated a value words_stats_df[“ir”] — our curiosity rate. The curiosity rate shows how many times on average some word was closer to organizations, than to non-organizations. Of course, a lot of data contains zeros, so when calculating new columns inf or zero can appear. For this reason, we passed the calculated value into a sigmoid function with an offset. Thus, if some word is nearly equally likely to be used with names of organizations and non-organizations, its curiosity rate would be close to 0.5. The interesting words are those that have a curiosity rate close to one or zero.

import pandas as pd

words_stats_df = pd.DataFrame\
.from_dict(words_stats,
          orient = "index",
          columns = [
                      'org', 'org_f',
                      'non_org',
'non_org_f',
                    ])\
.reset_index()\
.rename(str, columns = {'index' : 'word'})

words_stats_df["org_n"] = (words_stats_df["org"] / words_stats_df["org_f"]) * 1000                 
words_stats_df["non_org_n"] = (words_stats_df["non_org"] / words_stats_df["non_org_f"]) * 1000

def sigmoid(x, alpha = 1, offset = 0):
    return 1 / (1 + np.exp(-alpha * (x - offset)))

words_stats_df["ir"] = ((words_stats_df["org"] * words_stats_df["non_org_f"]) / \
                      (words_stats_df["org_f"] * words_stats_df["non_org"]))   \
                      .apply(lambda x: sigmoid(x, 5, 1))

words_stats_df.sort_values("ir", ascending = False, inplace = True)
Data colected

(We collected a few additional stats for better analysis)

And here is some progress: we can see that such words as subsidiary, merger, corporation, legal, ltd, etc are really at the top of the list. That means, that we are going in the right direction. Now, having this data we need to select a list of words that would be frequently used alongside organization names. A small note: this list should be a list of unbiased words, e.g. must not be anyhow related to company’s products or activities. Otherwise, these words would not be consistent with other organizations.

After passing briefly through this dataset, we chose a list of 100 words that we considered to be characteristic of organizations, and a list of 26 words, that we considered being characteristic of non-organizations:

orgs_selected_words = [
    'merger', 'conglomerate', 'corp', 'subsidiary', 'corporation', 'stock', 'founder', 'merged', 'acquired', 'subsidiaries', 'legal', 'exchange' , 'ltd', 'group', 'co', 'largest', 'renamed', 'revenues', 'supplier', 'venture', 'member', 'interest', 'owns', 'property', 'country', 'inc', 'firm', 'industries', 'division', 'partnership', 'shares', 'owned', 'operations', 'name', 'investment', 'facilities', 'changed', 'manufactured', 'general', 'revenue', 'ownership', 'management', 'cash', 'meeting', 'ranked', 'separated', 'shareholder', 'interests', 'affiliates', 'engaged', 'parent', 'reserved', 'rights', 'patents', 'capitalization', 'enlarge', 'complaining', 'alleged', 'proceed', 'anticipates', 'mergers', 'acquirer', 'wholly', 'demerged', 'merge', 'handing', 'european''nasdaq', 'german', 'purchased', 'france', 'biggest', 'investments', 'commission', 'europe', 'managed', 'assets', 'fund', 'senior', 'deal', 'funds', 'traded', 'acquisitions', 'charges', 'subsequent', 'wealth', 'hired', 'leverage', 'journal', 'early', 'bank', 'working', 'ordered', 'world', 'employee', 'contact', 'share', 'firms', 'shortage', 'founded',   
]

non_orgs_selected_words = [
    'consumer', 'home', 'buy', 'testing', 'reports', 'offering', 'offer', 'offers', 'special', 'reality', 'followed', 'failed', 'businesses', 'community', 'school', 'purchases', 'complex', 'detailed', 'buying', 'newer', 'events', 'enabled', 'alternative', 'advance', 'upcoming', 'releases',
]

selected_words = orgs_selected_words + non_orgs_selected_words

Collecting Train\Test Data

Now that we had a list of selected words, we needed to collect train and test data for our final model. Here are a few code samples in, again, Python, that we used for data collecting.

Preparing DataFrame for data collection:


ws_df = pd.DataFrame()

ws_df["ne_id"] = 0.
ws_df["is_org"] = 0.

for selected_word in selected_words:
ws_df[selected_word + "_w"] = 0.
ws_df[selected_word + "_f"] = 0.

for ne in nes:
ws_df.loc[ne[2]] = 0
ws_df.at[ne[2], "is_org"] = ne[1]

ws_df["org_id"] = ws_df.index

Here nes is a list of three-element tuples: the first and the second elements are the same, as above for nes, and the third one is a unique named entity id.

We adapted the code from the first code snippet to collect data, so here is the only things that changed:

def append_selected_words_stats(ws_df, text, selected_words, ne):
words = [(text[match.span(1)[0] : match.span(1)[1]].lower(),
match.span(1)[0],
match.span(1)[1],
idx,
1 if text[match.span(1)[0] : match.span(1)[1]].lower() in selected_words else 0)
for idx, match in enumerate(word_re.finditer(text))]

# ...................

if dist > 30:
continue

ws_df.at[ne[2], word[0] + "_w"] += dist_weight(dist)
ws_df.at[ne[2], word[0] + "_f"] += 1

def collect_words_stats(ws_df, filenames, nes):
# ....................
append_selected_words_stats(ws_df, p.text, selected_words, ne)

display.clear_output(wait = True)
print("{0} of {1} files passed.".format(idx + 1, len(filenames)))

collect_words_stats(ws_df, train_filenames, nes)

After the data was collected, we normalized it across the types of statistics:

def preprocess_ws(ws_df):
    for stat_type in ["w", "f"]:
        stat_type_cols = [selected_word + "_" + stat_type for selected_word in selected_words]

        s = ws_df[stat_type_cols].sum(axis = 1)

        for stat_type_col in stat_type_cols:
            ws_df[stat_type_col] /= s
           
        ws_df.fillna(0, inplace = True)
       
    print(ws_df.values.shape)

preprocess_ws(ws_df)

and split into train\test:

from sklearn.model_selection import train_test_split

train_nes, test_nes = train_test_split(nes, test_size = 0.25)

train_nes = [ne[2] for ne in train_nes]
test_nes  = [ne[2] for ne in test_nes]

train_ws_df = ws_df[ws_df["ne_id"].isin(train_nes)]
test_ws_df = ws_df[ws_df["ne_id"].isin(test_nes)]

Designing a Model

 

Designing a Model

Actually, before trying a machine learning approach, we’ve tried a simple binary regression model with manual weights calculation (using linear regression analysis formulas). But that model did not give any good results, so we decided to try a more complex one — Neural Networks.

First, we prepared data for the model:

from sklearn.preprocessing import OneHotEncoder

cols_order = [selected_word + "_w" for selected_word in selected_words] + \
[selected_word + "_f" for selected_word in selected_words]

x_train = train_ws_df[cols_order].values
x_test = test_ws_df[cols_order].values

enc = OneHotEncoder()
enc.fit(train_ws_df["is_org"].values.reshape((-1, 1)))
y_train = enc.transform(train_ws_df["is_org"].values.reshape((-1, 1))).toarray()

enc = OneHotEncoder()
enc.fit(test_ws_df["is_org"].values.reshape((-1, 1)))
y_test = enc.transform(test_ws_df["is_org"].values.reshape((-1, 1))).toarray()

Then, we designed the model (we compiled it with an SGD optimizer and a binary_crossentropy loss function):

from keras.models import Model
from keras.layers import Input, Dense

input = Input((x_train.shape[1],))

x = input

x = Dense(100, activation = 'relu')(x)
x = Dense(75, activation = 'sigmoid')(x)

output = Dense(1, activation = 'sigmoid')(x)

model = Model([input], [output])

model.summary()

"""
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 252) 0
_________________________________________________________________
dense_1 (Dense) (None, 100) 25300
_________________________________________________________________
dense_2 (Dense) (None, 75) 7575
_________________________________________________________________
dense_3 (Dense) (None, 1) 76
=================================================================
Total params: 32,951
Trainable params: 32,951
Non-trainable params: 0
_________________________________________________________________
"""

In fact, we tried a few model architectures. But, judging from the training results, there was some threshold on test data, that the model could not pass. We chose the architecture with the least overfitting.

 

Graf

We spotted an interesting thing while training the model: regardless of the architecture, the model wasn’t converging with a constant learning rate. It was converging only if we had first trained it for a few epochs with a low learning rate (epochs 1–100), then for a few epochs with a high learning rate (epochs 100–200), and then again with a low learning rate (epochs 200–300). If we kept a constantly low or high learning rate from the beginning, the model accuracy was stable and no better than random guessing.

Analyzing Results

There might be a few major reasons why we did not achieve any really high values of precision or recall for the model. Here is a list of those that became obvious after a brief analysis of data and results:

  1. Small dictionary of selected words. There were organizations, where selected words were used only a few times while the word itself was pretty frequently used alongside other organization names. Thus, a bigger dictionary of selected words might help. It must also include more general words for non-organizations.
  2. We were focusing on US-based organizations only. Thus, we used only English words and search terms. However, there were a few organizations (nearly 7% of all) that had none of the selected words anywhere around their names. It was due to the fact that the pages returned for those organizations were mostly not in English.  So, this factor can be eliminated by means of expanding the language scope with translations of selected words or with different words used instead in other languages.
  3. More labeled data. We labeled only ~1600 records, ~750 of which were organization names. Including various ways of how an organization name can be presented in a text abstract, this number can be insufficient to achieve the high model quality.
  4. Unbalanced data set. We wanted to select words that are not directly related to organization’s activities and products. This, however, did not seem to be our best decision. For example, we randomly chose a great company, that was mostly dealing with banks and investments. Therefore, we needed to include a few words  related to this activity to get better results from our model.
  5. More data in general This one is doubtful since we collected train\test data from about 5.5GBs of HTML pages, which seems to be a lot. But there are only ~8 HTML pages per one named entity, due to a number of search results that Bing returns by default.

Still, for the price of a lower recall for non-organizations (nearly 70% on the test, which most likely can be enhanced by solving the above mentioned problems), we got a precision rate of 75%. This is much better than what we would have achieved with only this approach, without the machine learning model (47% accuracy).

Conclusion

The results from the model show that the approach described can be applied for named entity classification or filtering. This might be useful when a model implemented or named entity tagging cannot deal with such complex cases like HTML pages, where not all named entities have a proper text context around them.

Read Also First part of this Research Searching for Related Organizations Addresses with Web-Search, spaCy, and RegExes

Author:
Volodymyr Sendetskyi,  Data Scientist MindCraft
Information Technology & Data Science