Version 1.0
4th August, 2021
Definition
The sector classification uses the Standard Industrial Classification (SIC) from the Office for National Statistics (ONS):
There are 4 levels of SIC: Section > Division > Group > Class
This app is currently able to detect to the Group level.
There are in total 21 Sections, 99 Divisions and 250+Groups.
Methodology
1. Supervised machine learning classification
In order to provide training and validation data sets, legal consultants gave labels to regulations, indicating which SIC codes the regulation applies to. The data and labels are then used to train the classification using supervised Machine Learning approach, where paragraph text is combined with document title to enrich context information within each paragraph. Feature vectors used to represent textual content during classification are based on BERT state of the art natural language processing sentence vectoriser, and Multi Layer Perceptron classifier is used to perform classification.
For the SIC Groups that have a large amount of regulations describing them, we trained a series of binary MLP (Multi-Layer Perceptron), one for each SIC group and it was able to predict 174 out of 265 SIC groups.
We have 2 ways to measure the accuracy of this methodology, one is on model level and one is one document level:
(1) Model level:
For each SIC group, we measure the accuracy of the binary model. For example, 98% accuracy of A011 means that for an A011 (Growing of non-perennial crops) binary classifier is 98% accurate in predicting whether a paragraph applies to A011 or not. A012 (Growing of perennial crops) binary classifier will also have its own accuracy etc.
The average accuracy of SIC Group prediction on model level is 96% (averaging performance of single SIC group models.
(2) Document level:
For each regulation document, we will run through the series of binary models and each model will give its opinion on whether they think that it applies to their sectors. The accuracy on the document level is generally lower than the model level because there are generally more sectors that apply to the document.
The methodology assigns a set of SIC groups to each document aggregating SIC group predictions from sections. The accuracy of the methodology of SIC group prediction on a document level is 84%.
2. Unsupervised Machine Learning
For SIC Groups that don’t have large amounts of regulation mentioning them to train a supervised machine learning model, we used an anchor-based approach and cosine similarity with vectors generated from ONS SIC to identify paragraphs related to a specific SIC group. Post-validation was performed and only SIC groups with higher than 80% accuracy were kept.
3. Rule-based
For regulations that are not sector specific or have broad application (e.g. consumer protection, workers welfare, etc.), we used the mapping from the policy topics to assign applicable sectors to the regulations.
Most of the regulations on legislation.gov.uk from National Archives provide a (set of) Subject highlighting the key subject of a regulation. Some of the regulation subjects are sector-specific, for example, Agriculture, Contamination of food, and they are used to further refine the sector application of a regulation.
The Policy Topics are generated from our proprietary topic modelling algorithms using the Subject of regulations listed on legislation.gov.uk from the National Archives as the base. Some of the policy topics are sector specific, for example, Financial markets, Construction, and they are used to further refine the sector application of a regulation.
Excluded SICs
This application excludes the below 3 SIC sections because they are not business related.
- O: Public administration and defence; compulsory social security,
- T: Activities of households as employers; undifferentiated goods-and-services-producing activities of households for own use,
- U:Activities of extraterritorial organizations and bodies) because they are not business related