Dr Santanu Dash
Academic and research departments
School of Computer Science and Electronic Engineering, Computer Science Research Centre, Surrey Centre for Cyber Security.Publications
With more than two million applications, Android marketplaces require automatic and scalable methods to efficiently vet apps for the absence of malicious threats. Recent techniques have successfully relied on the extraction of lightweight syntactic features suitable for machine learning classification, but despite their promising results, the very nature of such features suggest they would unlikely--on their own--be suitable for detecting obfuscated Android malware. To address this challenge, we propose DroidSieve, an Android malware classifier based on static analysis that is fast, accurate, and resilient to obfuscation. For a given app, DroidSieve first decides whether the app is malicious and, if so, classifies it as belonging to a family of related malware. DroidSieve exploits obfuscation-invariant features and artifacts introduced by obfuscation mechanisms used in malware. At the same time, these purely static features are designed for processing at scale and can be extracted quickly. For malware detection, we achieve up to 99.82% accuracy with zero false positives; for family identification of obfuscated malware, we achieve 99.26% accuracy at a fraction of the computational cost of state-of-the-art techniques.
RTOS components have a direct impact on the instruction cache performance - an aspect that has been reported as a bottleneck in several studies. Therefore, there exists a need for insight into the instruction cache hit rates for various RTOS components at design time for efficient design space exploration. In this paper, we propose a technique to model RTOS components for instruction cache hit rate estimation. Our methods rely on rapid generation of hit rate values for different cache sizes for the RTOS components. We then fit the generated instruction cache hit rates using multivariate regression schemes to account for all parameters influencing the hit rate. The estimation techniques were tested for a large range of cache sizes and numerous parametric as well as non-parametric RTOS components. Comparison with hit rates generated by a cache simulator shows that these models can accurately estimate the hit rates with the mean difference in hit rates ranging from 0.00 to 0.05. The proposed technique also offers significant speed-up as compared to other modeling approaches because it does not require exhaustive cache hierarchy simulation for building the models.
Cache tuning has been shown to achieve considerable energy savings and methods have also been proposed for tuning the cache for standalone embedded applications. However, with the increasing complexity of modern day embedded applications, RTOS based multitasking systems are fast becoming the norm. Therefore, there exists a need for techniques to tune the cache for multitasking systems. In this paper we present a framework for energy centric tuning of the instruction cache for embedded multitasking systems. Our framework is built upon a formal model for characterizing multitasking systems and is suitable for fast instruction cache tuning using loop profiling. We validate our proposed techniques by applying them to tune a predictor based filter cache hierarchy - a common solution for low power embedded systems. For all the multitasking programs tested, our techniques are able to successfully predict configurations that are optimal or near-optimal. The proposed methods are also able to achieve speed-ups of up to an order of magnitude compared to exhaustive design space exploration techniques.
Source code is bimodal: it combines a formal, algorithmic channel and a natural language channel of identifiers and comments. In this work, we model the bimodality of code with name flows, an assignment flow graph augmented to track identifier names. Conceptual types are logically distinct types that do not always coincide with program types. Passwords and URLs are example conceptual types that can share the program type string. Our tool, RefiNym, is an unsupervised method that mines a lattice of conceptual types from name flows and reifies them into distinct nominal types. For string, RefiNym finds and splits conceptual types originally merged into a single type, reducing the number of same-type variables per scope from 8.7 to 2.2 while eliminating 21.9% of scopes that have more than one same-type variable in scope. This makes the code more self-documenting and frees the type system to prevent a developer from inadvertently assigning data across conceptual types.
Estimation of the hit rate curve for an application is the first step in application specific cache tuning. Several techniques have been proposed to meet this objective however most of these have dealt with the data cache with little attention to the instruction cache. In this paper, we propose a novel, lightweight and highly scalable technique for rapid estimation of the instruction cache hit rate curve for a given application. Our technique works at the basic block level and relies on a one-time loop profiling of the weighted control flow graph of the application followed by estimation of the hit rate for different cache sizes. It accounts for the spatial and temporal locality separately and is sensitive to the cache size as well as block size. The proposed technique is highly accurate and when compared with results from an actual cache simulator, the mean error in estimation ranged from 1.11% to 2.46% for the benchmarks tested.
Amorphous Data Parallelism has proven to be a suitable vehicle for implementing concurrent graph algorithms effectively on multi-core architectures. In view of the growing complexity of graph algorithms for information analysis, there is a need to facilitate modular design techniques in the context of Amorphous Data Parallelism. In this paper, we investigate what it takes to formulate algorithms possessing Amorphous Data Parallelism in a modular fashion enabling a large degree of code re-use. Using the betweenness centrality algorithm, a widely popular algorithm in the analysis of social networks, we demonstrate that a single optimisation technique can suffice to enable a modular programming style without loosing the efficiency of a tailor-made monolithic implementation.
With the advent of mobile and handheld devices, power consumption in embedded systems has become a key design issue. Recently, it has been shown that cache requirements of the applications vary widely and a significant amount of energy can be saved by tuning the cache parameters according to the needs of the application. To this end, techniques have been proposed to tune the cache for single-task-based systems but no work has been done to extend these techniques to multitasking applications. In this research work, the authors present novel, lightweight and fast techniques for energy-sensitive tuning of the instruction cache hierarchy for multitasking applications. Cache tuning for real-time operating systems (RTOS)-driven multitasking applications is achieved by intelligently separating the user tasks and RTOS components and profiling them in isolation to identify the nature of loops in them. We then apply the proposed techniques to tune a predictor-based filter cache hierarchy for instructions for both single-task-based applications and RTOS-driven multitasking applications. The proposed techniques are able to identify optimal or near-optimal filter and L1 cache sizes for all the applications tested and are up to an order of magnitude faster than exhaustive cache hierarchy simulation techniques. The proposed techniques are also highly scalable and can be relied upon to predict the instruction cache hit rate for any range of instruction cache sizes after a one-time simulation and profiling.
Android malware is now pervasive and evolving rapidly. Thousands of malware samples are discovered every day with new models of attacks. The growth of these threats has come hand in hand with the proliferation of collective repositories sharing the latest specimens. Having access to a large number of samples opens new research directions aiming at efficiently vetting apps. However, automatically inferring a reference dataset from those repositories is not straightforward and can inadvertently lead to unforeseen misconceptions. On the one hand, samples are often mis-labeled as different parties use distinct naming schemes for the same sample. On the other hand, samples are frequently mis-classified due to conceptual errors made during labeling processes. In this paper, we mine Anti Virus labels and analyze the associations between all labels given by different vendors to systematically unify common samples into family groups. The key novelty of our approach, named EU-PHONY [20], is that no a-priori knowledge on malware families is needed. We evaluate EUPHONY using reference datasets and more than 400 thousands additional samples outside of these datasets. Results show that EUPHONY can accurately label malware with a fine-grained clustering of families, while providing competitive performance against the state-of-the-art.
LCA computation for vertex pairs in trees can be achieved in constant time after linear-time preprocessing. However, extension of these techniques to compute LCA for vertex-pairs in DAGs has been not possible due to the non-tree edges in a DAG. In this paper, we present an algorithm for computing the LCA for vertex pairs in a DAG which treats the DAG's spanning tree and its non-tree edges separately. Our approach enables us to tap the efficiency of existing LCA algorithms for trees. Furthermore, our algorithm decomposes the DAG into a set of component trees called clusters which significantly reduces the preprocessing necessary to incorporate non-tree edges in the LCA computation. Our algorithm seamlessly interpolates the performance graph between the best reported algorithms for trees and the best reported algorithms for DAGs depending on the incidence of non-tree edges in the DAG. Using the proposed techniques, it is possible to achieve near-linear preprocessing and constant query time for sparse DAGs. (C) 2013 Elsevier B.V. All rights reserved.
Advent of microservices has increased the popularity of the API-first design principles. Developers have been focusing on concretising the API to a system before building the system. An API-first approach assumes that the API will be correctly used. Inevitably, most developers, even experienced ones, end-up writing sub-optimal software because of using APIs incorrectly. In this paper, we discuss an automated approach for exploring API equivalence and a framework to synthesise semantically equivalent programs. Unlike existing approaches to API transplantation, we propose an amorphous or formless approach to software translation in which a single API could potentially be replaced by a synthesised sequence of APIs which ensures type progress. Our search is guided by the non-functional goals for the software, a type-theoretic notion of progress, the application's test suite and an automatic multi-modal embedding of the API from its documentation and code analysis.
Malware evolves perpetually and relies on increasingly sophisticated attacks to supersede defense strategies. Data driven approaches to malware detection run the risk of becoming rapidly antiquated. Keeping pace with malware requires models that are periodically enriched with fresh knowledge, commonly known as retraining. In this work, we propose the use of Venn-Abers predictors for assessing the quality of binary classification tasks as a first step towards identifying antiquated models. One of the key benefits behind the use of Venn-Abers predictors is that they are automatically well calibrated and offer probabilistic guidance on the identification of nonstationary populations of malware. Our framework is agnostic to the underlying classification algorithm and can then be used for building better retraining strategies in the presence of concept drift. Results obtained over a timeline-based evaluation with about 90K samples show that our framework can identify when models tend to become obsolete.
The Android ecosystem has witnessed a surge in malware, which not only puts mobile devices at risk but also increases the burden on malware analysts assessing and categorizing threats. In this paper, we show how to use machine learning to automatically classify Android malware samples into families with high accuracy, while observing only their runtime behavior. We focus exclusively on dynamic analysis of runtime behavior to provide a clean point of comparison that is dual to static approaches. Specific challenges in the use of dynamic analysis on Android are the limited information gained from tracking low-level events and the imperfect coverage when testing apps, e.g., due to inactive command and control servers. We observe that on Android, pure system calls do not carry enough semantic content for classification and instead rely on lightweight virtual machine introspection to also reconstruct Android-level inter-process communication. To address the sparsity of data resulting from low coverage, we introduce a novel classification method that fuses Support Vector Machines with Conformal Prediction to generate high-accuracy prediction sets where the information is insufficient to pinpoint a single family.
Smartphone platforms are becoming increasingly complex, which gives way to software vulnerabilities difficult to identify and that might allow malware developers to gain unauthorised privileges through technical exploitation. However, the authors maintain that these types of attacks indirectly renders a number of unexpected behaviours in the system that can be profiled. In this work, the authors present CoME, an anomaly-based methodology aiming at detecting software exploitation in Android systems. CoME models the normal behaviour of a given software component or service and it is capable of identifying any unanticipated behaviour. To this end, they first monitor the normal operation of a given exploitable component through lightweight virtual introspection. Then, they use a multivariate analysis approach to estimate the normality model and detect anomalies. They evaluate their system against one of the most critical vulnerable and widely exploited services in Android, i.e. the mediaserver. Results show that the proposed approach can not only provide a meaningful explanatory of discriminant features for illegitimate activities, but can also be used to accurately detect malicious software exploitations at runtime.
Today, most developers bundle changes into commits that they submit to a shared code repository. Tangled commits intermix distinct concerns, such as a bug fix and a new feature. They cause issues for developers, reviewers, and researchers alike: they restrict the usability of tools such as git bisect, make patch comprehension more difficult, and force researchers who mine software repositories to contend with noise. We present a novel data structure, the -NFG, a multiversion Program Dependency Graph augmented with name flows. A -NFG directly and simultaneously encodes different program versions, thereby capturing commits, and annotates data flow edges with the names/lexemes that flow across them. Our technique, Flexeme, builds a -NFG from commits, then applies Agglomerative Clustering using Graph Similarity to that -NFG to untangle its commits. At the untangling task on a C# corpus, our implementation, Heddle, improves the state-of-the-art on accuracy by 0.14, achieving 0.81, in a fraction of the time: Heddle is 32 times faster than the previous state-of-the-art.
To date, most analysis of collaboration between malware authors has been performed on meta-data and compiled binaries, while ignoring artifacts present in the source code. We collect a vast amount of malicious source code from Underground Forums posts, Underground Forum code attachments, and GitHub repositories and devise a methodology that allows us to filter out most auxiliary code, leaving the measurement to focus on malicious code. We leverage this to perform an in-depth measurement of the reuse of malicious code between these malware centers as well as StackOverflow. We find that our methodology has high precision in identifying malicious code (93.1%) and provides a contemporary snapshot of malware code reuse across the Web, offering insights into the manners in which this takes place.