30/11/2022
What I Learnt
I learnt about Big Data structures within companies. Such as the use of Data Warehouses and Data Lakes, programming platforms, environment setup, containerization of programs, importance of clear communication between Data Scientists and non-tech savy people. I also learnt will creating my model, common issues when training and testing models.
Data Structures Within Corporations
Based on my short experience during work experience I learnt that data is tracked and placed into a storage server of some sort and depending on the companies there may be a couple of them or there may be one. Then Data Scientists code and define models on another machine using a variety of IDEs like, JupyterLab, Visual Code Studio and even Neovim. Often on large scale server, useful for later, using their IDE of choice they code and define a model. They test it with a local sample that they have acquired, to reduce load on the Big Data platform, once they have tested that their code works they let get data from the Big Data platform and then proceed to train their model using the power of the server on which they are coding on. Using the server their models are able to train faster than on a conventional computer.
Although depending on the server and size of the dataset used, some issues can arise when working with large datasets. For example if your dataset is larger than the memory on the computer you are training on or larger than the allowed quota, the preprocessing of the data can take extraordinary amount of time as the server will need to modify a part then write it to swap and repeat until it all done. This can result in a lot of wasted CPU cycles and time. Therefore systems such as Spark are used.
What is Apache Spark
Apache Spark is a program that allows several computers to join together Memory and CPU cores to allow for the mass processing of data across machines. Each machine, Node as they are referred to by Apache, stores a part of the dataset in memory and machine connected to it tells the cluster manager to do something to the data, which then tells all the nodes to do the operation. Programs can communicate to Spark in a variety of ways, being through Python, R, Scala or Java. This allows the pre-training code run faster and more effectively. Once all the transformations are performed on the data the Program/Application receives the output which it will use for training.
What Other Systems are in Place
Another Extremely popular system in Australia is from a company called H20. H2O have created a platform on which you can give it data and it will create rudimentary models with extremely high accuracy.
My Model
During my time I created a credit fraud model to detect instances of fraud.
I experienced issues with overfitting where in most cases is would just say there were no cases of fraud, and while this provided a high accuracy is barely managed to pick out 50% of fraud cases making it a relatively poor model for detecting the cases of fraud. Features in the model such as time of the transaction (unix_time), age, city population, latitude and longitude were important features within the model.