Dr Santanu Dash

Senior Lecturer

s.k.dash@surrey.ac.uk

Academic and research departments

School of Computer Science and Electronic Engineering, Computer Science Research Centre, Surrey Centre for Cyber Security.

Publications

Alexis Butler, Dan O'Keeffe, Santanu Kumar Dash (2024)Links Between Package Popularity, Criticality, and Security in Software Ecosystems Zenodo

DOI: 10.5281/zenodo.14577849

Docker Volume containing a snapshot of the GraphDataset used in the MSR4P&S submission: Links Between Package Popularity, Criticality, and Security in Software Ecosystems. Insert Raw Bibtex Cite

Alexis Butler, Dan O'Keeffe, Santanu Kumar Dash (2024)Links Between Package Popularity, Criticality, and Security in Software Ecosystems, In: IEEE International Conference on Software Analysis, Evolution and Reengineering - Companion (SANER-C)ahead of printpp. 89-97

DOI: 10.1109/SANER-C66551.2025.00020

With the continued growth of Open Source Software (OSS), maintenance workloads have also continued to expand, this along with additional stressors results in maintainer burnout and churn. Given that the pool of those within a software ecosystem with the expertise and willingness to maintain a project is limited, maintenance efforts should be focussed on minimizing security risks with the greatest potential impact. One would expect a well maintained ecosystem to have strong security across all packages, or at the very least, strong security in packages that are core to the ecosystem. As such, dependency graphs for two ecosystems (Python, and JavaScript/Typescript) were captured to obtain criticality and popularity scores for each package. Security was measured at multiple points across the range of these metrics to establish the relationships between popularity, criticality and security. In doing so, a statistically significant moderate positive correlation between security and popularity for both ecosystems was found, along with mixed results for the correlation between security and criticality. These results can be used to assist in both, feature selection for machine learning based dependency risk measurement, and as a guide for dataset sampling for future security tooling evaluation.

Giuseppe Raffa, Jorge Blasco, Dan O'Keeffe, Santanu Kumar Dash (2024)Towards Inter-Service Data Flow Analysis of Serverless Applications, In: 2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)pp. 654-658 IEEE

DOI: 10.1109/SANER60148.2024.00072

The recent advent of serverless applications has created a need for static analysis tools to analyse them. However, the event-driven architecture of serverless applications, along with the black-box nature of the services they invoke, make static analysis challenging. In this work, we propose a novel approach to statically analysing serverless applications, with a focus on the identification of data flows that can lead to code injection and information leakage. To reach our goal, we first design a new suite of microbenchmarks, which we publicly release. The microbenchmarks are based on documented serverless-specific vulnerabilities and the characterization of an existing dataset. We then introduce our static analysis approach and show how it can factor in the effect of platform services and eventtriggered code execution by extracting relevant information from both infrastructure and application code. This information is used to obtain a synchronous equivalent of the underlying asynchronous system, which can be inspected with a general-purpose static analysis tool. Preliminary evaluation results using a prototype implementation of our approach and the microbenchmark suite confirm the potential of our analysis technique.

Guillermo Suarez-Tangil, Santanu Kumar Dash, Mansour Ahmadi, Johannes Kinder, Giorgio Giacinto, Lorenzo Cavallaro (2017)DroidSieve, In: Proceedings of the Seventh ACM on Conference on Data and Application Security and Privacy309pp. 309-320 ACM

DOI: 10.1145/3029806.3029825

With more than two million applications, Android marketplaces require automatic and scalable methods to efficiently vet apps for the absence of malicious threats. Recent techniques have successfully relied on the extraction of lightweight syntactic features suitable for machine learning classification, but despite their promising results, the very nature of such features suggest they would unlikely--on their own--be suitable for detecting obfuscated Android malware. To address this challenge, we propose DroidSieve, an Android malware classifier based on static analysis that is fast, accurate, and resilient to obfuscation. For a given app, DroidSieve first decides whether the app is malicious and, if so, classifies it as belonging to a family of related malware. DroidSieve exploits obfuscation-invariant features and artifacts introduced by obfuscation mechanisms used in malware. At the same time, these purely static features are designed for processing at scale and can be extracted quickly. For malware detection, we achieve up to 99.82% accuracy with zero false positives; for family identification of obfuscated malware, we achieve 99.26% accuracy at a fraction of the computational cost of state-of-the-art techniques.

Santanu Kumar Dash, Thambipillai Srikanthan (2009)Modeling RTOS Components for Instruction Cache Hit Rate Estimation, In: ISCAS: 2009 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, VOLS 1-55118428pp. 2978-2981 IEEE

DOI: 10.1109/ISCAS.2009.5118428

RTOS components have a direct impact on the instruction cache performance - an aspect that has been reported as a bottleneck in several studies. Therefore, there exists a need for insight into the instruction cache hit rates for various RTOS components at design time for efficient design space exploration. In this paper, we propose a technique to model RTOS components for instruction cache hit rate estimation. Our methods rely on rapid generation of hit rate values for different cache sizes for the RTOS components. We then fit the generated instruction cache hit rates using multivariate regression schemes to account for all parameters influencing the hit rate. The estimation techniques were tested for a large range of cache sizes and numerous parametric as well as non-parametric RTOS components. Comparison with hit rates generated by a cache simulator shows that these models can accurately estimate the hit rates with the mean difference in hit rates ranging from 0.00 to 0.05. The proposed technique also offers significant speed-up as compared to other modeling approaches because it does not require exhaustive cache hierarchy simulation for building the models.

Santanu Kumar Dash, Thambipillai Srikanthan (2009)Instruction Cache Tuning for Embedded Multitasking Applications, In: S Ceballos (eds.), RSP 2009: TWENTIETH IEEE/IFIP INTERNATIONAL SYMPOSIUM ON RAPID SYSTEM PROTOTYPING, PROCEEDINGSpp. 152-158 IEEE

DOI: 10.1109/RSP.2009.16

Cache tuning has been shown to achieve considerable energy savings and methods have also been proposed for tuning the cache for standalone embedded applications. However, with the increasing complexity of modern day embedded applications, RTOS based multitasking systems are fast becoming the norm. Therefore, there exists a need for techniques to tune the cache for multitasking systems. In this paper we present a framework for energy centric tuning of the instruction cache for embedded multitasking systems. Our framework is built upon a formal model for characterizing multitasking systems and is suitable for fast instruction cache tuning using loop profiling. We validate our proposed techniques by applying them to tune a predictor based filter cache hierarchy - a common solution for low power embedded systems. For all the multitasking programs tested, our techniques are able to successfully predict configurations that are optimal or near-optimal. The proposed methods are also able to achieve speed-ups of up to an order of magnitude compared to exhaustive design space exploration techniques.

Santanu Kumar Dash, Miltiadis Allamanis, Earl T. Barr (2018)RefiNym: using names to refine types, In: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineeringpp. 107-117 ACM

DOI: 10.1145/3236024.3236042

Source code is bimodal: it combines a formal, algorithmic channel and a natural language channel of identifiers and comments. In this work, we model the bimodality of code with name flows, an assignment flow graph augmented to track identifier names. Conceptual types are logically distinct types that do not always coincide with program types. Passwords and URLs are example conceptual types that can share the program type string. Our tool, RefiNym, is an unsupervised method that mines a lattice of conceptual types from name flows and reifies them into distinct nominal types. For string, RefiNym finds and splits conceptual types originally merged into a single type, reducing the number of same-type variables per scope from 8.7 to 2.2 while eliminating 21.9% of scopes that have more than one same-type variable in scope. This makes the code more self-documenting and frees the type system to prevent a developer from inadvertently assigning data across conceptual types.

Santanu Kumar Dash, Thambipillai Srikanthan (2008)Rapid Estimation of Instruction Cache Hit Rates Using Loop Profiling, In: 2008 INTERNATIONAL CONFERENCE ON APPLICATION-SPECIFIC SYSTEMS, ARCHITECTURES AND PROCESSORS4580189pp. 263-268 IEEE

DOI: 10.1109/ASAP.2008.4580189

Estimation of the hit rate curve for an application is the first step in application specific cache tuning. Several techniques have been proposed to meet this objective however most of these have dealt with the data cache with little attention to the instruction cache. In this paper, we propose a novel, lightweight and highly scalable technique for rapid estimation of the instruction cache hit rate curve for a given application. Our technique works at the basic block level and relies on a one-time loop profiling of the weighted control flow graph of the application followed by estimation of the hit rate for different cache sizes. It accounts for the spatial and temporal locality separately and is sensitive to the cache size as well as block size. The proposed technique is highly accurate and when compared with results from an actual cache simulator, the mean error in estimation ranged from 1.11% to 2.46% for the benchmarks tested.

Santanu Kumar Dash, Sven-Bodo Scholz, Bruce Christianson (2013)Modular design of data-parallel graph algorithms, In: 2013 International Conference on High Performance Computing & Simulation (HPCS)6641446pp. 398-404 IEEE

DOI: 10.1109/HPCSim.2013.6641446

Amorphous Data Parallelism has proven to be a suitable vehicle for implementing concurrent graph algorithms effectively on multi-core architectures. In view of the growing complexity of graph algorithms for information analysis, there is a need to facilitate modular design techniques in the context of Amorphous Data Parallelism. In this paper, we investigate what it takes to formulate algorithms possessing Amorphous Data Parallelism in a modular fashion enabling a large degree of code re-use. Using the betweenness centrality algorithm, a widely popular algorithm in the analysis of social networks, we demonstrate that a single optimisation technique can suffice to enable a modular programming style without loosing the efficiency of a tailor-made monolithic implementation.

S. K. Dash, T. Srikanthan (2010)Instruction cache tuning for embedded multitasking applications, In: IET computers & digital techniques4(6)pp. 439-457 Inst Engineering Technology-Iet

DOI: 10.1049/iet-cdt.2009.0066

With the advent of mobile and handheld devices, power consumption in embedded systems has become a key design issue. Recently, it has been shown that cache requirements of the applications vary widely and a significant amount of energy can be saved by tuning the cache parameters according to the needs of the application. To this end, techniques have been proposed to tune the cache for single-task-based systems but no work has been done to extend these techniques to multitasking applications. In this research work, the authors present novel, lightweight and fast techniques for energy-sensitive tuning of the instruction cache hierarchy for multitasking applications. Cache tuning for real-time operating systems (RTOS)-driven multitasking applications is achieved by intelligently separating the user tasks and RTOS components and profiling them in isolation to identify the nature of loops in them. We then apply the proposed techniques to tune a predictor-based filter cache hierarchy for instructions for both single-task-based applications and RTOS-driven multitasking applications. The proposed techniques are able to identify optimal or near-optimal filter and L1 cache sizes for all the applications tested and are up to an order of magnitude faster than exhaustive cache hierarchy simulation techniques. The proposed techniques are also highly scalable and can be relied upon to predict the instruction cache hit rate for any range of instruction cache sizes after a one-time simulation and profiling.

Mederic Hurier, Guillermo Suarez-Tangil, Santanu Kumar Dash, Tegawende F. Bissyande, Yves Le Traon, Jacques Klein, Lorenzo Cavallaro (2017)Euphony: Harmonious Unification of Cacophonous Anti-Virus Vendor Labels for Android Malware, In: 2017 IEEE/ACM 14TH INTERNATIONAL CONFERENCE ON MINING SOFTWARE REPOSITORIES (MSR 2017)7962391pp. 425-435 IEEE

DOI: 10.1109/MSR.2017.57

Android malware is now pervasive and evolving rapidly. Thousands of malware samples are discovered every day with new models of attacks. The growth of these threats has come hand in hand with the proliferation of collective repositories sharing the latest specimens. Having access to a large number of samples opens new research directions aiming at efficiently vetting apps. However, automatically inferring a reference dataset from those repositories is not straightforward and can inadvertently lead to unforeseen misconceptions. On the one hand, samples are often mis-labeled as different parties use distinct naming schemes for the same sample. On the other hand, samples are frequently mis-classified due to conceptual errors made during labeling processes. In this paper, we mine Anti Virus labels and analyze the associations between all labels given by different vendors to systematically unify common samples into family groups. The key novelty of our approach, named EU-PHONY [20], is that no a-priori knowledge on malware families is needed. We evaluate EUPHONY using reference datasets and more than 400 thousands additional samples outside of these datasets. Results show that EUPHONY can accurately label malware with a fine-grained clustering of families, while providing competitive performance against the state-of-the-art.

Santanu Kumar Dash, Sven-Bodo Scholz, Stephan Herhut, Bruce Christianson (2013)A scalable approach to computing representative lowest common ancestor in directed acyclic graphs, In: Theoretical computer science51325pp. 25-37 Elsevier

DOI: 10.1016/j.tcs.2013.09.030

LCA computation for vertex pairs in trees can be achieved in constant time after linear-time preprocessing. However, extension of these techniques to compute LCA for vertex-pairs in DAGs has been not possible due to the non-tree edges in a DAG. In this paper, we present an algorithm for computing the LCA for vertex pairs in a DAG which treats the DAG's spanning tree and its non-tree edges separately. Our approach enables us to tap the efficiency of existing LCA algorithms for trees. Furthermore, our algorithm decomposes the DAG into a set of component trees called clusters which significantly reduces the preprocessing necessary to incorporate non-tree edges in the LCA computation. Our algorithm seamlessly interpolates the performance graph between the best reported algorithms for trees and the best reported algorithms for DAGs depending on the incidence of non-tree edges in the DAG. Using the proposed techniques, it is possible to achieve near-linear preprocessing and constant query time for sparse DAGs. (C) 2013 Elsevier B.V. All rights reserved.

Santanu Kumar Dash, Fan Wu, Michail Basios, Lingbo Li, Leslie Kanthan (2020)Checkers, In: Proceedings of the IEEE/ACM 42nd International Conference on Software Engineering Workshopspp. 291-292 ACM

DOI: 10.1145/3387940.3392173

Advent of microservices has increased the popularity of the API-first design principles. Developers have been focusing on concretising the API to a system before building the system. An API-first approach assumes that the API will be correctly used. Inevitably, most developers, even experienced ones, end-up writing sub-optimal software because of using APIs incorrectly. In this paper, we discuss an automated approach for exploring API equivalence and a framework to synthesise semantically equivalent programs. Unlike existing approaches to API transplantation, we propose an amorphous or formless approach to software translation in which a single API could potentially be replaced by a synthesised sequence of APIs which ensures type progress. Our search is guided by the non-functional goals for the software, a type-theoretic notion of progress, the application's test suite and an automatic multi-modal embedding of the API from its documentation and code analysis.

Amit Deo, Santanu Kumar Dash, Guillermo Suarez-Tangil, Volodya Vovk, Lorenzo Cavallaro (2016)Prescience: Probabilistic Guidance on the Retraining Conundrum for Malware Detection, In: AISEC'16: PROCEEDINGS OF THE 2016 ACM WORKSHOP ON ARTIFICIAL INTELLIGENCE AND SECURITYpp. 71-82 Assoc Computing Machinery

DOI: 10.1145/2996758.2996769

Malware evolves perpetually and relies on increasingly sophisticated attacks to supersede defense strategies. Data driven approaches to malware detection run the risk of becoming rapidly antiquated. Keeping pace with malware requires models that are periodically enriched with fresh knowledge, commonly known as retraining. In this work, we propose the use of Venn-Abers predictors for assessing the quality of binary classification tasks as a first step towards identifying antiquated models. One of the key benefits behind the use of Venn-Abers predictors is that they are automatically well calibrated and offer probabilistic guidance on the identification of nonstationary populations of malware. Our framework is agnostic to the underlying classification algorithm and can then be used for building better retraining strategies in the presence of concept drift. Results obtained over a timeline-based evaluation with about 90K samples show that our framework can identify when models tend to become obsolete.

Santanu Kumar Dash, Guillermo Suarez-Tangil, Salahuddin Khan, Kimberly Tam, Mansour Ahmadi, Johannes Kinder, Lorenzo Cavallaro (2016)DroidScribe: Classifying Android Malware Based on Runtime Behavior, In: 2016 IEEE Security and Privacy Workshops (SPW)7527777pp. 252-261 IEEE

DOI: 10.1109/SPW.2016.25

The Android ecosystem has witnessed a surge in malware, which not only puts mobile devices at risk but also increases the burden on malware analysts assessing and categorizing threats. In this paper, we show how to use machine learning to automatically classify Android malware samples into families with high accuracy, while observing only their runtime behavior. We focus exclusively on dynamic analysis of runtime behavior to provide a clean point of comparison that is dual to static approaches. Specific challenges in the use of dynamic analysis on Android are the limited information gained from tracking low-level events and the imperfect coverage when testing apps, e.g., due to inactive command and control servers. We observe that on Android, pure system calls do not carry enough semantic content for classification and instead rely on lightweight virtual machine introspection to also reconstruct Android-level inter-process communication. To address the sparsity of data resulting from low coverage, we introduce a novel classification method that fuses Support Vector Machines with Conformal Prediction to generate high-accuracy prediction sets where the information is insufficient to pinpoint a single family.

Guillermo Suarez-Tangil, Santanu Kumar Dash, Pedro Garcia-Teodoro, Jose Camacho, Lorenzo Cavallaro (2018)Anomaly-based exploratory analysis and detection of exploits in android mediaserver, In: IET information security12(5)1pp. 404-413 Wiley

DOI: 10.1049/iet-ifs.2017.0460

Smartphone platforms are becoming increasingly complex, which gives way to software vulnerabilities difficult to identify and that might allow malware developers to gain unauthorised privileges through technical exploitation. However, the authors maintain that these types of attacks indirectly renders a number of unexpected behaviours in the system that can be profiled. In this work, the authors present CoME, an anomaly-based methodology aiming at detecting software exploitation in Android systems. CoME models the normal behaviour of a given software component or service and it is capable of identifying any unanticipated behaviour. To this end, they first monitor the normal operation of a given exploitable component through lightweight virtual introspection. Then, they use a multivariate analysis approach to estimate the normality model and detect anomalies. They evaluate their system against one of the most critical vulnerable and widely exploited services in Android, i.e. the mediaserver. Results show that the proposed approach can not only provide a meaningful explanatory of discriminant features for illegitimate activities, but can also be used to accurately detect malicious software exploitations at runtime.

Profir-Petru Pârțachi, Santanu Kumar Dash, Miltiadis Allamanis, Earl T. Barr (2020)Flexeme: untangling commits using lexical flows, In: Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineeringpp. 63-74 ACM

DOI: 10.1145/3368089.3409693

Today, most developers bundle changes into commits that they submit to a shared code repository. Tangled commits intermix distinct concerns, such as a bug fix and a new feature. They cause issues for developers, reviewers, and researchers alike: they restrict the usability of tools such as git bisect, make patch comprehension more difficult, and force researchers who mine software repositories to contend with noise. We present a novel data structure, the -NFG, a multiversion Program Dependency Graph augmented with name flows. A -NFG directly and simultaneously encodes different program versions, thereby capturing commits, and annotates data flow edges with the names/lexemes that flow across them. Our technique, Flexeme, builds a -NFG from commits, then applies Agglomerative Clustering using Graph Similarity to that -NFG to untangle its commits. At the untangling task on a C# corpus, our implementation, Heddle, improves the state-of-the-art on accuracy by 0.14, achieving 0.81, in a fraction of the time: Heddle is 32 times faster than the previous state-of-the-art.

Michal Tereszkowski-Kaminski, Santanu Kumar Dash, Guillermo Suarez-Tangil (2024)A Study of Malicious Source Code Reuse Among GitHub, StackOverflow and Underground Forums, In: 29th European Symposium on Research in Computer Security - Proceedings, Part III14984pp. 45-66 Springer

DOI: 10.1007/978-3-031-70896-1_3

To date, most analysis of collaboration between malware authors has been performed on meta-data and compiled binaries, while ignoring artifacts present in the source code. We collect a vast amount of malicious source code from Underground Forums posts, Underground Forum code attachments, and GitHub repositories and devise a methodology that allows us to filter out most auxiliary code, leaving the measurement to focus on malicious code. We leverage this to perform an in-depth measurement of the reuse of malicious code between these malware centers as well as StackOverflow. We find that our methodology has high precision in identifying malicious code (93.1%) and provides a contemporary snapshot of malware code reuse across the Web, offering insights into the manners in which this takes place.