Pawel Pralat

Enhancing expressive power of GNN

Mitacs Elevate Award with Mastercard (in progress, 2025-27), $120,000

Graph Neural Networks (GNNs) are rapidly evolving areas in machine learning, AI, network theory and network-based technologies, due to their ability to model complex relationships in graph-structured data. Enhancing the expressive power of GNNs is crucial for network-based technologies particularly in Mastercard, whose datasets are structured as graphs, reflecting relationships between users, transactions, merchants, and financial entities.

The primary objective of this project is to develop a GNN architecture that enhances the expressiveness of GNNs while maintaining scalability, enabling it to be deployed on large networks with millions of nodes. By focusing on expressiveness, we intend to create a model that captures intricate patterns within graph data, essential for tasks like anomaly detection, which is significant concern of Mastercard for fraud detection in financial transactions. In addition to expressiveness, a key challenge in GNNs is the over-smoothing phenomenon. Our architecture will incorporate mechanisms to mitigate this issue, allowing distinct node representations to be preserved over layers. Furthermore, our design aims to effectively capture the long-range dependencies, which is crucial in many graph-based applications. In addition to theoretical advancements, the practical application of our model is particularly important as it equips Mastercard with more powerful tools for complex graph-based tasks. Our architecture is intended for deployment within Mastercard products, resulting in enhanced performance across a wide range of Mastercard's data analysis tools, leading to more accurate fraud detection, predictions, personalized customer experiences, and ultimately improved services and products. Additionally, it contributes to the broader advancement of network and data analysis systems.

Anomaly Detection with Noisy Labels in Graphs

Mitacs Elevate Award with Mastercard (in progress, 2025-27), $120,000

The proposed research aims to address the challenge of detecting anomalies in noisily labelled graph data using Graph Neural Networks (GNNs), specifically in the context of Mastercard's transaction network. GNNs are particularly well-suited for this task as they can effectively capture the graphical nature of transactions, thereby revealing insights that are not feasible to obtain with tabular data, which has the potential to enhance our ability to identify fraudulent activity in the network. Fraudulent transactions, which make up a small fraction of the overall data, often deviate significantly from normal behaviour, making them detectable as anomalies. As a significant amount of fraudulent activity tends to go unreported in reality, one of the major challenges is to develop robust GNNs that can perform anomaly detection on noisily labeled graphs. These noisy labels may lead models to learn incorrect patterns, overfit to noise, and disrupt the message-passing mechanisms that GNNs rely on.

Graph-Driven Strategic Intelligence: Innovations in Forecasting, Delisting, and Marketing Optimization

Mitacs Accelerate Umbrella Award with Unilever Canada (completed, 2024)

Unilever at a glance, is one of the world's largest consumer packaged goods industries. Unilever is motivated by the goal of making sustainable living a common practice. Unilever is known for its great brands and the belief that doing business the right way drives superior performance. Unilever stands out as a leading company in the effective utilization of advanced data science and AI technologies to enhance its marketing performance. Numerous AI algorithms and models are operational at Unilever, aiding our business units in gaining precise insights into marketing data and facilitating informed decision-making. To accelerate advanced AI and analytical algorithms at Unilever, this project has been defined with the goal of investigating and executing the following market analysis use cases utilizing a semantic graph structure.
-- Objective 1 --- Unilever's Precision Forecasting: Innovating for Enhanced Accuracy and Efficiency
-- Objective 2 --- Transparency in Motion: Unilever’s Smart Delisting Implementation Tool
By leveraging semantic graph structures for precision forecasting, smart delisting, and LLM, our project aims to cultivate a dynamic and adaptable business intelligence framework, providing our company with the agility to navigate market complexities, foster innovation, and attain sustained success.

Classical and Structural Embeddings of Networks and Their Applications in Various ML Tasks

Mitacs Globalink Research Award (completed, 2024)

Users on social networks such as Twitter interact with each other without much knowledge of the real identity behind the accounts they interact with. This anonymity has created a perfect environment for bot accounts to influence the network by mimicking real- user behaviour. Although not all bot accounts have malicious intent, identifying bot accounts in general is an important and difficult task. In particular, recent developments in Large-Language-Models (LLMs) such as GPT-4 will make it more difficult to detect human versus bot generated language. Our goal is to investigate the predictive power of features that can be extracted from the underlying network structure. We plan to explore and compare two classes of embedding algorithms that can be used to take advantage of information that network structure provides. The first class consists of classical embedding techniques, which focus on learning proximity information. The second class consists of structural embedding algorithms, which capture the local structure of node neighbourhood. Our hypothesis is that structural embeddings have higher predictive power when it comes to bot detection and other related ML tasks.

Modelling and Mining Complex Networks as Hypergraphs

NSERC Alliance International Collaboration Grant with SGH Warsaw School of Economics (completed, 2022-24)

Hypergraphs have recently emerged as a useful representation for complex networks since they allow capturing more general relations than graphs. Research on the generalization of various graph-based tools and techniques is booming and, in parallel, new software packages are being developed. Having said that, the theory and tools are still not sufficiently developed to allow most problems to be tackled directly within this context. Hypergraphs are of particular interest in the field of knowledge discovery where most problems currently modelled as graphs would be more accurately modelled as hypergraphs. Those in the knowledge discovery field are particularly interested in the generalization of the concepts of modularity and diffusion to hypergraphs. Such generalizations require a firm theoretical base on which to develop these concepts. Unfortunately, although hypergraphs were formally defined in the 1960s (and various realizations of hypergraphs were studied long before that), the general formal theory is not as mature as required for the applications of interest to practitioners. The main goal of this collaboration is to develop formal theory, in conjunction with development of concrete applications. Given that different research groups in various parts of the world have recently started pursuing results in this direction, the proposal seems particularly well-timed as an occasion for the cross-pollination of research directions, the establishment of new collaborations and the exchange of tools and techniques.

Detecting and Responding to Hostile Information Activities: unsupervised methods for measuring the quality of graph embeddings

Canadian Department of National Defense project with Patagona Technologies (completed, 2021-22)

The rise in online organized disinformation campaigns present a significant challenge to Canadian national security. State and non-state hostile actors manipulate users on social media platforms to advance their interests. Patagona Technologies is a Toronto-based software development company started by two Ryerson alumni. We are currently working with the Canadian Department of National Defense to address the challenges posed by online hostile actors by analysing the structure and content of social networks.

Graph embeddings allow discrete mathematical graphs such as social networks to be mapped to sets of vectors of real values in a way that preserves certain properties of the original graph. Graph embeddings present a convenient representation for developing machine learning models to analyse graphs. Unfortunately the quality of these models is only as good as the quality of the embeddings so a method for evaluating the quality of embeddings is required.

This research project will expand on prior work done by Dr. Pralat on unsupervised methods for measuring the quality of graph embeddings. Specifically, this work will involve studying unsupervised methods for measuring the quality of structural graph embeddings (that is, struc2vec, GraphWave), as well as extending the random graph models used by these methods to multi-relational graphs.

COVID-19: Agent-based framework for modelling pandemics in urban environment

NSERC Alliance project with Security Compass, complemented with SOSCIP COVID-19 Response Program (completed, 2020-21)

The development of COVID-19 pandemic raises important questions on optimal policy design for managing and controlling the number of people affected. In order to answer these questions, one needs to better understand determinants of pandemic dynamics. Indeed, the development of epidemics depends on various factors including the intensity and frequency of social contacts and the amount of care and protection applied during those contacts. In particular, one area where the disease can be transmitted is the urban space of a large city such as Toronto.

The goal of the project is to create an agent-based framework for building virtual models of an urban area. This framework will be used as a virtual laboratory for testing various scenarios and their implications for the development of pandemics. In order for conclusions to be reliable, the models (known in the literature as synthetic population models or digital twins) have to be up to scale, with the number of agents comparable with the population of the city. This, in turn, requires implementations ready to be run in a large scale distributed computing environment in the cloud (for example, using the Elastic Compute Cloud - EC2 service at Amazon Web Services - AWS) as the algorithms behind the engine need high performance computing power.

The framework will allow us to evaluate different COVID-19 mitigation policy designs. This includes possible decisions such as decreasing proneness to wearing masks, closing down some non-essential, high-contact, social network nodes (for example, hairdressers), limiting the number of people having simultaneous social gatherings or reducing the number of people on streets altogether via promoting actions such as \#stayathome.

Embedding Complex Networks

Mitacs Research Training Award (RTA) (completed, 2020)

The goal of many machine learning applications is to make predictions or discover new patterns using graph- structured data as feature information. In order to extract useful structural information from graphs, one might want to try to embed it in a geometric space by assigning coordinates to each node such that nearby nodes are more likely to share an edge than those far from each other. There are many embedding algorithms (based on techniques from linear algebra, random walks, or deep learning) and the list constantly grows. Moreover, many of these algorithms have various parameters that can be carefully tuned to generate embeddings in some multidimensional spaces, possibly in different dimensions. The main question we try to answer in this research project is: How do we evaluate these embeddings? Which one is the best and should be used for a given task at hand?

In order to answer these questions, we propose a general framework that assigns the divergence score to each embedding which, in an unsupervised learning fashion, distinguishes good from bad embeddings. In order to benchmark embeddings, we generalize the Chung-Lu random graph model to incorporate geometry. The goal of this research project is to do a detailed grand study of the best graph embedding algorithms (Node2Vec, VERSE, LINE, GraphSAGE, Deep Walk, Struc2Vec, HOPE, SDNE, GCN, HARP) by performing a series of tests for a given application at hand as well as using our framework.

Improved transductive regression using interconnected data

Mitacs Accelerate project with Tesseraqt Optimization Inc (completed, 2019)

The explosion of labeled data from personal phones, apps, and sensors have enabled powerful supervised machine learning algorithms. However, in many real world learning applications, the quantity of unlabeled data far exceeds that of good quality labeled data. In situations such as these, where the unlabeled data is available to the learning algorithm, transductive (as opposed to inductive) learning can be useful. Most studies of transductive learning methods focus on transductive classification problems and label propagation. Extensions to transductive regression has been basic and relatively unexplored. This project aims to develop novel and improved transductive learning algorithms on hypergraphs to be applied in regression problems. Unlike graphs, hypergraphs can robustly incorporate data that involves relationships that are richer and more complex than pairwise. Representing these complex relationships in a pairwise fashion inevitably results in loss of information and ambiguity, which can degrade the performance of learning algorithms. This project will explore the use of hypergraphs to unify and integrate sparsely labelled data with other, complex data sources with the aim of improving state of the art transductive regression methods. The scope of the project will be to design and benchmark these new algorithms as well as study the characteristics of the related data sets that are most suitable for this type of approach.

Quantum-Resistance and Efficiency of Communications in Connected and Autonomous Vehicles

NSERC Collaborative Research and Development + Fields CQAM project with NXM Technologies Inc. (completed, 2018-22)

NXM Labs is making secure Internet-of-Things (IoT) devices for connected and autonomous vehicles. These devices are small and contain sensitive information and it is therefore critical to ensure these devices remain secure for the lifetime of the car. In their solution, NXM also uses a blockchain based authentication mechanism. Both IoT and blockchain technologies use encryption techniques for confidentiality and integrity of the information, but the currently deployed encryption standards are vulnerable against future quantum computers. Therefore, this project aims at designing techniques to make IoT and blockchain quantum-resistant using a combination of classical and quantum resistant cryptographic approaches. As an IoT world becomes closer to reality, the necessity of ensuring the security of its communication and data becomes increasingly important. Generations of devices will have been made and be in use while quantum computers become more sophisticated and powerful. As such, protecting IoT devices against future quantum attacks in the present is a priority. Another aspect of this project focuses more effective and efficient strategies for communications among connected vehicles that will be developed alongside the quantum-resistant technologies and will ensure the future security and efficiency of NXM's technologies moving forward.

Online Detection of Users' Anomalous Activities on Confidential File Sharing Platform

NSERC Engage project with TitanFile (completed, 2018-19)

TitanFile is a platform for exchanging confidential documents with authorized parties outside of the organization. The external sharing of confidential documents raises the risk of data exfiltration to unauthorized parties, whether maliciously or unintentionally due to human error. To mitigate this risk, we ask the following question: "is it possible to detect user activities that have symptoms of fraudulent or unintended behaviour in real time based on the user's activity pattern and available metadata describing the transaction?"

The goal of the project is to construct a system for online detection of malicious, spurious or erroneous activity of TitanFile's users. The targeted activities to be detected include: data theft, data exfiltration, external correspondence and file sharing. Those events can occur as a result of user mistakes or intentional actions.

The result of the project will be an analytical model that will generate typical profiles across customers and identify anomalies in their behaviour. Sample activities include: employees sending company data to unauthorized persons, accidentally communicating with the wrong person and sending confidential data to personal accounts. The system will be designed work in real time analyzing log streams of TitanFile - for each logged event the classifier will output a binary decision along with a confidence level. The solution will be provided with a Python module API for an easy integration with the existing TitanFile's infrastructure.

Automatic Personality Insights from Speech

Ontario Centres of Excellence, Voucher for Innovation and Productivity with IMC Business Architecture (completed, 2018-19)

IMC Business Architecture (IMCBA) has been developing a suite of innovative computational tools for automatic assessment of personality traits based on analysis of recordings of human speech. At the current stage, a prototype system has been developed that is able to estimate, by means of analysis of samples of recorded speech, customer time thinking styles according to MindTime personality classification system. Based on numerous positive results obtained so far, IMCBA believes that the precision of the results generated by their system will be significantly improved by analyzing the acoustic layer of speech samples along with their semantic content. The ability to derive insights on emotional states and personality traits from spoken word has been hypothesized and investigated in psychology for a long time. Whereas earlier works were mainly concerned with personality perception by humans in human to human interactions, recent research has focused on automatic extraction and quantification of personality cues from speech by computerized systems. In both cases significant correlation between certain properties of a speech signal, known as prosodic features, and personality scores has been observed. Incorporating such type of analysis into IMCBA analytical suite will allow to improve market value of their offering and gain advantage over competitors.

Agent-based simulation modelling of out-of-home advertising viewing opportunity

Ontario Centres of Excellence, Voucher for Innovation and Productivity with Environics Analytics (completed, 2018)

Many Canadian and American firms and agencies that use out-of-home (OOH) advertising to promote their products are interested in their targeting campaigns towards specific consumer segments. A successful advertising campaign requires a certain level of advertising viewing `opportunity-to-see' frequency of exposure in order for the brands and products to be remembered by potential customers and a right level of customer reach. Currently the out-of-home industry, especially in Canada, is seriously out of date and increasingly recognized as such.

The goal of the project is to create integrated multi-source simulation model that will allow Environics Analytics to design revolutionary and highly targeted optimal marketing campaigns for their customers. This, in turn, will allow for a better differentiation of promotional actions and so be more valuable for the customers - the advertising firms. Companies planning their marketing actions along with Environics Analytics will be able better to target desired socioeconomic groups and segments and to understand and to predict the response and impacts of their advertising campaigns.

Cognitive Claims AI

NSERC Engage project with IMC Business Architecture (completed, 2017)

IMC is developing a tool for the Property & Casualty (P&C) insurance industry. P&C companies are seeking ways to minimize their cost of claims (currently running at some 60% of premiums) as well as increase their rate of fraud detection, which they estimate at 10% of actual fraud.

IMC believes it is possible to improve this performance by using available data to predict the behaviour of the claimant and treat them appropriately. The most useful data would be the statement of claim by the customer, however, as this is in the form of a telephone interview, it is not readily useable for algorithmic modelling. Therefore, IMC is seeking to find a way of turning the content of the statement of claim into data useable for predictive modelling of the claim outcome. IMC has identified a number of tools for converting natural language data into numerical scores, but the considered prediction problems are non-standard and so require development of novel approaches to data modelling. The opportunity to collaborate with Dr. Pralat in researching the potential for such algorithms will allow IMC to determine if this tool represents a viable opportunity.

Modelling of Homophily and Social Influence via Random Graphs

NSERC Engage project with Alcatel-Lucent (completed, 2016-17)

The proliferation of cellular usage has given rise to massive amounts of data that, through data mining and analytics, promises to reveal a wealth of information on how agents interact with one another and effect one another's preferences. For example, cellular devices frequently communicate with cell towers, from which agent locations, and hence, agent activity profiles, are readily available. The company aims to understand the interconnections between agent profiles and, in particular, how these profiles co-evolve over time.

It is through the lens of social learning that we propose to model and derive value from agent profiles. The first step is to understand the social environments of the agents which is both shaped by the agents and influences the agents to adopt new behaviours. So, any relevant theory of social learning must account for at least two interrelated factors: network change as a result of agent attributes, and attribute updating as a result of network position. Two leading hypotheses in this area are that network ties are formed and deleted based on similarity or differences in agent attributes (homophily), and that certain attributes are likely to diffuse through existing network ties (social influence). This project aims to determine whether or not homophily and social influence are good models of networks described by agent location data and then use the resultant models to develop scalable analytics algorithms.

Hypergraph Theory and Applications

Project with The Tutte Institute for Mathematics and Computing (completed, 2015-16)

Myriad problems can be described in hypergraph terms, however, the theory and tools are not sufficiently developed to allow most problems to be tackled directly within this context. Hypergraphs are of particular interest in the field of knowledge discovery where most problems currently modelled as graphs would be more accurately modelled as hypergraphs. Those in the knowledge discovery field are particularly interested in the generalization of the concepts of modularity and diffusion to hypergraphs. Such generalizations require a firm theoretical base on which to develop these concepts. Unfortunately, although hypergraphs were formally defined in the 1960s (and various realizations of hypergraphs were studied long before that), the general formal theory is not as mature as required for the applications of interest to the TIMC. The TIMC wishes to encourage the development of this formal theory, in conjunction with development of concrete applications.

Relationship Mapping Analytics for Fundraising and Sales Prospect Research

NSERC Engage project with Charter Press Ltd. (completed, 2015-16)

Third Sector Publishing (TSP) has been successful selling CharityCAN subscriptions to fundraising organizations across Canada. This has been the result of incorporating a large volume of data from different sources that prospect researchers find useful as they attempt to identify potential donors for their organizations. For example, the Canadian data that TSP licenses from Blackbaud, Inc. includes over 7.3 million donation records - records of donations that individuals, foundations, and companies made to which organizations.

As well as licensed data, there is an abundance of publicly available data that will be useful to CharityCAN subscribers. TSP will be able to extract this data from websites through automated extraction processes. For example, most law firms in Canada create and post for free biographies of their lawyers. TSP will be able to add these biographies to the growing volume of useful data that already can be found on the CharityCAN platform.

The challenge for CharityCAN is connecting this growing number of data points. Relationship mapping refers to the identification of relationships among individuals. Relationship mapping becomes particularly useful when it can predict the strength or weakness of any relationship. CharityCAN requires sophisticated machine learning algorithms and data mining tools that will identify relationships among individuals, private-sector companies, and non-profit institutions, and then these algorithms should be able to predict the strength (or lack thereof) of these relationships. As a result, various (complex) networks could be formed and, with this in hand, some hybrid clustering methods could be used to extract groups of users that are potentially of interest to the subscribing institution or company (for example, for personalized and targeted solicitation).

Web Visitor Engagement Measurement and Maximization

Ontario Centres of Excellence Talent Edge Fellowship Program with The Globe and Mail (completed, 2014-15)

A very important measure of how well a news website is doing in providing content, as well as how attractive they are to advertisers, is how engaged their visitors are with their site. News websites need to maximize visitor engagement, however, they do not currently have an accurate way to measure engagement. Ideally they could measure a visitor's time spent looking at their website, but the web analytics software available in the marketplace all fall short in their ability to do this accurately, as they always miss the last page of a visit, and they include time that they should not (for example, when a visitor physically has walked away from their computer). The Globe and Mail is seeking a machine learning & big data solution to help them accurately measure engagement, and to then optimize for it. They need tools that help them optimize the selection of articles promoted on their section homepages at any given time, as well as their ordering, such that engagement is maximized.

Utilizing big data for business-to-business matching and recommendation system

NSERC Engage project with ComLinked Corp. (completed, 2014-15)

The social media industry is experiencing a tremendous amount of growth and innovation with new business models being developed especially in the B2C space. With the success of social media platforms such as Facebook, Twitter, and LinkedIn, the commercial segment has been looking to consolidate the main features and functionalities of these B2C platforms and apply it to solve real-life B2B problems. ComLinked is an online B2B platform where companies across all industries can create their online business profiles, and in addition to their basic company information, can list specific company information such as their founding year, their products and services and their customer's industries. Based on these elements, the platform uses matching algorithms to recommend companies to other companies to connect to. ComLinked Corp. is seeking to collaborate with the academic community to develop its core set of algorithms utilizing machine learning & big data solution.

A self-organizing dynamic network model increasing the efficiency of outdoor digital billboards

NSERC Engage project with KEEN Projection Media Ltd. (KPM) (completed, 2014)

KPM is developing a business model for infrastructure development and management (Coop Billboard Network - CBN - www.coopbn.com) with the goal of creating an optimum working platform which consolidates multiple LED outdoor billboards (of various designs, ages, models, suppliers, locations, etc.) under one umbrella (similar to what Expedia does to the hotel business). The company is looking for a dynamic system that assigns user requests to specific billboards and optimizes the network in a self-organizing manner. Modelling should play an important role in this system, since it is expected that the system will be able to predict future requests and available time slots based on the history of the process as well as current trends. The system is supposed to have some artificial intelligence built-in to not only predict these events but also self-correct the network behaviour in order to increase the efficiency and global performance of the network.

Exploiting Big Data for Customized Online News Recommendation System

NSERC Engage project with The Globe and Mail (completed, 2014)

The news industry is undergoing major changes, with competition from non-traditional, international competitors negatively impacting both readership levels (pageviews) and the ad revenue associated to each pageview. The Globe and Mail is seeking a machine learning and big data solution to help them come out on top in this period of change. A system that offers personalized content recommendations to each user would help greatly. However, because their content library, akin to a product catalog at a retailer, changes dramatically every minute with the arrival of fresh news articles, traditional recommender systems would have a very hard time providing good recommendations of fresh articles. Traditional recommender systems also fail to consider popularity as a function of how much a piece of content was promoted, and the business consideration of the revenue driven by a piece of content. This project will combine big data and advanced algorithms to account for these considerations while driving personalized content recommendations.

Personalized Mobile Recommender System

NSERC Engage project with BlackBerry (completed, 2013-14)

We are developing a series of recommendation algorithms to enhance the mobile user experience. The algorithms will utilize mobile user behavioural data and application contents to determine the most relevant applications to recommend to the end users. The system will be developed on the leading edge big data platform Apache Hadoop and algorithms will need to be distributed to hundreds of computing nodes and scale to millions of users and items. The leading edge algorithms we design will be benchmarked against industry standard algorithms on performance and scalability.

Intelligent Rating System

NSERC Engage project with Mako (completed, 2012-13)

We are developing a series of formulas and algorithms to map a new artificial intelligence rating system online. The core of the platform is to utilize advanced statistical and technological indicators to determine rank of the reviewed subject, with the biggest nugget being the ability to identify the quality/merit of each review based on many interconnected variables.

Dynamic clustering and prediction of taxi service demand

NSERC Engage project with Winston (completed, 2012)

The Winston mobile phone application completely transforms the archaic end-to-end taxi experience. By leveraging mobile technology and working with established, professional limousine service providers, they are able to connect users to car service in a way that makes sense today. Although they have a large amount of potentially important and relevant data, they have no tools to use it to improve their system efficiency. The goal of this project is to use the aggregated data to improve the demand prediction. By better predicting where and what time the demands are likely to occur using historical data, it should be possible to better allocate a driver's location in order to minimize passenger wait time and maximize coverage. The algorithm should automatically adapt and improve as more and more data are aggregated.

home page

Collaboration with industry partners and academia