Wednesday, 29 October 2014

Finalyearstudentsproject - 2014 PROJECT LISTS


NETWORKING AND SECURITY:

1.           ON WIRELESS SCHEDULING ALGORITHMS FOR MINIMIZING THE QUEUE-OVERFLOW PROBABILITY
In this paper, we are interested in wireless scheduling algorithms for the downlink of a single cell that can minimize the queue-overflow probability. Specifically, in a large-deviation setting, we are interested in algorithms that maximize the asymptotic decay-rate of the queue-overflow probability, as the queue-overflow threshold approaches infinity. We first derive an upper bound on the decay-rate of the queue-overflow probability over all scheduling policies. We then focus on a class of scheduling algorithms collectively referred to as the “α-algorithms”. For a given α 1, the -algorithm picks the user for service at each time that has the largest product of the transmission rate multiplied by the backlog raised to the power α. We show that when the overflow metric is appropriately modified, the minimum-cost-to-overflow under the -algorithm can be achieved by a simple linear path, and it can be written as the solution of a vector-optimization problem. Using this structural property, we then show that when α approaches infinity, the α-algorithms asymptotically achieve the largest decay-rate of the queue overflow probability. Finally, this result enables us to design scheduling algorithms that are both close-to-optimal in terms of the asymptotic decay-rate of the overflow probability, and empirically shown to maintain small queue-overflow probabilities over queue-length ranges of practical interest.

2.           EFFICIENT AND DYNAMIC ROUTING TOPOLOGY INFERENCE FROM END-TO-END MEASUREMENTS
Inferring the routing topology and link performance from a node to a set of other nodes is an important component in network monitoring and application design. In this paper we propose a general framework for designing topology inference algorithms based on additive metrics. The framework can flexibly fuse information from multiple measurements to achieve better estimation accuracy. We develop computationally efficient (polynomial-time) topology inference algorithms based on the framework. We prove that the probability of correct topology inference of our algorithms converges to one exponentially fast in the number of probing packets. In particular, for applications where nodes may join or leave frequently such as overlay network construction, application-layer multicast, peer-to-peer file sharing/streaming, we propose a novel sequential topology inference algorithm which significantly reduces the probing overhead and can efficiently handle node dynamics. We demonstrate the effectiveness of the proposed inference algorithms via Internet experiments.








KNOWLEDGE AND DATA MINING:

3.           BINRANK: SCALING DYNAMIC AUTHORITY-BASED SEARCH USING MATERIALIZED SUBGRAPHS
Dynamic authority-based keyword search algorithms, such as ObjectRank and personalized PageRank, leverage semantic link information to provide high quality, high recall search in databases, and the Web. Conceptually, these algorithms require a querytime PageRank-style iterative computation over the full graph. This computation is too expensive for large graphs, and not feasible at query time. Alternatively, building an index of precomputed results for some or all keywords involves very expensive preprocessing. We introduce BinRank, a system that approximates ObjectRank results by utilizing a hybrid approach inspired by materialized views in traditional query processing. We materialize a number of relatively small subsets of the data graph in such a way that any keyword query can be answered by running ObjectRank on only one of the subgraphs. BinRank generates the subgraphs by partitioning all the terms in the corpus based on their co-occurrence, executing ObjectRank for each partition using the terms to generate a set of random walk starting points, and keeping only those objects that receive non-negligible scores. The intuition is that a subgraph that contains all objects and links relevant to a set of related terms should have all the information needed to rank objects with respect to one of these terms. We demonstrate that BinRank can achieve subsecond query execution time on the English Wikipedia data set, while producing high-quality search results that closely approximate the results of ObjectRank on the original graph. The Wikipedia link graph contains about 108 edges, which is at least two orders of magnitude larger than what prior state of the art dynamic authority-based search systems have been able to demonstrate. Our experimental evaluation investigates the trade-off between query execution time, quality of the results, and storage requirements of BinRank.

4.           CLOSENESS: A NEW PRIVACY MEASURE FOR DATA PUBLISHING
The k-anonymity privacy requirement for publishing microdata requires that each equivalence class (i.e., a set of records that are indistinguishable from each other with respect to certain “identifying” attributes) contains at least k records. Recently, several authors have recognized that k-anonymity cannot prevent attribute disclosure. The notion of `-diversity has been proposed to address this; `-diversity requires that each equivalence class has at least ` well-represented (in Section 2) values for each sensitive attribute. In this article, we show that `-diversity has a number of limitations. In particular, it is neither necessary nor sufficient to prevent attribute disclosure. Motivated by these limitations, we propose a new notion of privacy called “closeness”. We first present the base model t-closeness, which requires that the distribution of a sensitive attribute in any equivalence class is close to the distribution of the attribute in the overall table (i.e., the distance between the two distributions should be no more than a threshold t). We then propose a more flexible privacy model called (n, t)-closeness that offers higher utility. We describe our desiderata for designing a distance measure between two probability distributions and present two distance measures. We discuss the rationale for using closeness as a privacy measure and illustrate its advantages through examples and experiments.



5.           DATA LEAKAGE DETECTION
We study the following problem: A data distributor has given sensitive data to a set of supposedly trusted agents (third parties). Some of the data is leaked and found in an unauthorized place (e.g., on the web or somebody’s laptop). The distributor must assess the likelihood that the leaked data came from one or more agents, as opposed to having been independently gathered by other means. We propose data allocation strategies (across the agents) that improve the probability of identifying leakages. These methods do not rely on alterations of the released data (e.g., watermarks). In some cases we can also inject “realistic but fake” data records to further improve our chances of detecting leakage and identifying the guilty party.


6.           PAM: AN EFFICIENT AND PRIVACY-AWARE MONITORING FRAMEWORK FOR CONTINUOUSLY MOVING OBJECTS
Efficiency and privacy are two fundamental issues in moving object monitoring. This paper proposes a privacy-aware monitoring (PAM) framework that addresses both issues. The framework distinguishes itself from the existing work by being the first to holistically address the issues of location updating in terms of monitoring accuracy, efficiency, and privacy, particularly, when and how mobile clients should send location updates to the server. Based on the notions of safe region and most probable result, PAM performs location updates only when they would likely alter the query results. Furthermore, by designing various client update strategies, the framework is flexible and able to optimize accuracy, privacy, or efficiency. We develop efficient query evaluation/reevaluation and safe region computation algorithms in the framework. The experimental results show that PAM substantially outperforms traditional schemes in terms of monitoring accuracy, CPU cost, and scalability while achieving close-to-optimal communication cost.

7.           P2P REPUTATION MANAGEMENT USING DISTRIBUTED IDENTITIES AND DECENTRALIZED RECOMMENDATION CHAINS
Peer-to-peer (P2P) networks are vulnerable to peers who cheat, propagate malicious code, leech on the network, or simply do not cooperate. The traditional security techniques developed for the centralized distributed systems like client-server networks are insufficient for P2P networks by the virtue of their centralized nature. The absence of a central authority in a P2P network poses unique challenges for reputation management in the network. These challenges include identity management of the peers, secure reputation data management, Sybil attacks, and above all, availability of reputation data. In this paper, we present a cryptographic protocol for ensuring secure and timely availability of the reputation data of a peer to other peers at extremely low costs. The past behavior of the peer is encapsulated in its digital reputation, and is subsequently used to predict its future actions. As a result, a peer’s reputation motivates it to cooperate and desist from malicious activities. The cryptographic protocol is coupled with self-certification and cryptographic mechanisms for identity management and countering Sybil attack. We illustrate the security and the efficiency of the system analytically and by means of simulations in a completely decentralized Gnutella-like P2P network.

8.           MANAGING MULTIDIMENSIONAL HISTORICAL AGGREGATE DATA IN UNSTRUCTURED P2P NETWORKS
A P2P-based framework supporting the extraction of aggregates from historical multidimensional data is proposed, which provides efficient and robust query evaluation. When a data population is published, data are summarized in a synopsis, consisting of an index built on top of a set of subsynopses (storing compressed representations of distinct data portions). The index and the subsynopses are distributed across the network, and suitable replication mechanisms taking into account the query workload and network conditions are employed that provide the appropriate coverage for both the index and the subsynopses.

9.           Deriving Concept-Based User Profiles from Search Engine Logs
User profiling is a fundamental component of any personalization applications. Most existing user profiling strategies are based on objects that users are interested in (i.e., positive preferences), but not the objects that users dislike (i.e., negative preferences). In this paper, we focus on search engine personalization and develop several concept-based user profiling methods that are based on both positive and negative preferences. We evaluate the proposed methods against our previously proposed personalized query clustering method. Experimental results show that profiles which capture and utilize both of the user’s positive and negative preferences perform the best. An important result from the experiments is that profiles with negative preferences can increase the separation between similar and dissimilar queries. The separation provides a clear threshold for an agglomerative clustering algorithm to terminate and improve the overall quality of the resulting query clusters.

10.       Parallelizing Itinerary-Based KNN Query Processing in Wireless Sensor Networks
Wireless sensor networks have been proposed for facilitating various monitoring applications (e.g., environmental monitoring and military surveillance) over a wide geographical region. In these applications, spatial queries that collect data from wireless sensor networks play an important role. One such query is the K-Nearest Neighbor (KNN) query that facilitates collection of sensor data samples based on a given query location and the number of samples specified (i.e., K). Recently, itinerary-based KNN query processing techniques, which propagate queries and collect data along a predetermined itinerary, have been developed. Prior studies demonstrate that itinerary-based KNN query processing algorithms are able to achieve better energy efficiency than other existing algorithms developed upon tree-based network infrastructures. However, how to derive itineraries for KNN query based on different performance requirements remains a challenging problem. In this paper, we propose a Parallel Concentric-circle Itinerary based KNN (PCIKNN) query processing technique that derives different itineraries by optimizing either query latency or energy consumption. The performance of PCIKNN is analyzed mathematically and evaluated through extensive experiments. Experimental results show that PCIKNN outperforms the state-of-the-art techniques.

11.       Towards an Automatic Detection of Sensitive Information in a Database
In order to validate user requirements, tests are often conducted on real data. However, developments and tests are more and more outsourced, leading companies to provide external staff with real confidential data. A solution to this problem is known as Data Scrambling. Many algorithms aim at smartly replacing true data by false but realistic ones. However, nothing has been developed to automate the crucial task of the detection of the data to be scrambled. In this paper we propose an innovative approach - and its implementation as an expert system - to achieve the automatic detection of the candidate attributes for scrambling. Our approach is mainly based on semantic rules that determine which concepts have to be scrambled, and on a linguistic component that retrieves the attributes that semantically correspond to these concepts. Since attributes cannot be considered independently from each other we also address the challenging problem of the propagation of the scrambling among the whole database. An important contribution of our approach is to provide a semantic modeling of sensitive data. This knowledge is made available through production rules, operationalizing the sensitive data detection.

12.       ViDE: A Vision-Based Approach for Deep Web Data Extraction
Deep Web contents are accessed by queries submitted to Web databases and the returned data records are enwrapped in dynamically generated Web pages (they will be called deep Web pages in this paper). Extracting structured data from deep Web pages is a challenging problem due to the underlying intricate structures of such pages. Until now, a large number of techniques have been proposed to address this problem, but all of them have inherent limitations because they are Web-page-programming-language dependent. As the popular two-dimensional media, the contents on Web pages are always displayed regularly for users to browse. This motivates us to seek a different way for deep Web data extraction to overcome the limitations of previous works by utilizing some interesting common visual features on the deep Web pages. In this paper, a novel vision-based approach that is Web-page programming- language-independent is proposed. This approach primarily utilizes the visual features on the deep Web pages to implement deep Web data extraction, including data record extraction and data item extraction. We also propose a new evaluation measure revision to capture the amount of human effort needed to produce perfect extraction. Our experiments on a large set of Web databases show that the proposed vision-based approach is highly effective for deep Web data extraction.


13.       An UpDown Directed Acyclic Graph Approach for Sequential Pattern Mining
Traditional pattern growth-based approaches for sequential pattern mining derive length-(k þ 1) patterns based on the projected databases of length-k patterns recursively. At each level of recursion, they unidirectionally grow the length of detected patterns by one along the suffix of detected patterns, which needs k levels of recursion to find a length-k pattern. In this paper, a novel data structure, UpDown Directed Acyclic Graph (UDDAG), is invented for efficient sequential pattern mining. UDDAG allows bidirectional pattern growth along both ends of detected patterns. Thus, a length-k pattern can be detected in blog2k þ 1c levels of recursion at best, which results in fewer levels of recursion and faster pattern growth. When minSup is large such that the average pattern length is close to 1, UDDAG and PrefixSpan have similar performance because the problem degrades into frequent item counting problem. However, UDDAG scales up much better. It often outperforms PrefixSpan by almost one order of magnitude in scalability tests. UDDAG is also considerably faster than Spade and LapinSpam. Except for extreme cases, UDDAG uses comparable memory to that of PrefixSpan and less memory than Spade and LapinSpam. Additionally, the special feature of UDDAG enables its extension toward applications involving searching in large spaces.

14.       Domain-Driven Classification Based on Multiple Criteria and Multiple Constraint-Level Programming for Intelligent Credit Scoring
Extracting knowledge from the transaction records and the personal data of credit card holders has great profit potential for the banking industry. The challenge is to detect/predict bankrupts and to keep and recruit the profitable customers. However, grouping and targeting credit card customers by traditional data-driven mining often does not directly meet the needs of the banking industry, because data-driven mining automatically generates classification outputs that are imprecise, meaningless, and beyond users’ control. In this paper, we provide a novel domain-driven classification method that takes advantage of multiple criteria and multiple constraint level programming for intelligent credit scoring. The method involves credit scoring to produce a set of customers’ scores that allows the classification results actionable and controllable by human interaction during the scoring process. Domain knowledge and experts’ experience parameters are built into the criteria and constraint functions of mathematical programming and the human and machine Conversation is employed to generate an efficient and precise solution. Experiments based on various data sets validated the effectiveness and efficiency of the proposed methods.

15.       Feature Selection Using f-Information Measures in Fuzzy Approximation Spaces
The selection of nonredundant and relevant features of real-valued data sets is a highly challenging problem. A novel feature selection method is presented here based on fuzzy-rough sets by maximizing the relevance and minimizing the redundancy of the selected features. By introducing the fuzzy equivalence partition matrix, a novel representation of Shannon’s entropy for fuzzy approximation spaces is proposed to measure the relevance and redundancy of features suitable for real-valued data sets. The fuzzy equivalence partition matrix also offers an efficient way to calculate many more information measures, termed as f-information measures. Several f-information measures are shown to be effective for selecting nonredundant and relevant features of real-valued data sets. This paper compares the performance of different f-information measures for feature selection in fuzzy approximation spaces. Some quantitative indexes are introduced based on fuzzy-rough sets for evaluating the performance of proposed method. The effectiveness of the proposed method, along with a comparison with other methods, is demonstrated on a set of real-life data sets.

MOBILE COMPUTING:
16.       SECURE DATA COLLECTION IN WIRELESS SENSOR NETWORKS USING RANDOMIZED DISPERSIVE ROUTES
Compromised-node and denial-of-service are two key attacks in wireless sensor networks (WSNs). In this paper, we study routing mechanisms that circumvent (bypass) black holes formed by these attacks. We argue that existing multi-path routing approaches are vulnerable to such attacks, mainly due to their deterministic nature. So once an adversary acquires the routing algorithm, it can compute the same routes known to the source, and hence endanger all information sent over these routes. In this paper, we develop mechanisms that generate randomized multipath routes. Under our design, the routes taken by the “shares” of different packets change over time. So even if the routing algorithm becomes known to the adversary, the adversary still cannot pinpoint the routes traversed by each packet. Besides randomness, the routes generated by our mechanisms are also highly dispersive and energy-efficient, making them quite capable of bypassing black holes at low energy cost. Extensive simulations are conducted to verify the validity of our mechanisms.

17.       A Novel Dual-Index Design to Efficiently Support Snapshot Location-Based Query Processing in Mobile Environments
Location-based services are increasingly popular recently. Many applications aim to support a large number of users in metro area (i.e., dense networks). To cope with this challenge, we present a framework that supports location-based services on MOVing objects in road Networks (MOVNet, for short) [26]. MOVNet’s dual-index design utilizes an on-disk R-tree to store the network connectivities and an in-memory grid structure to maintain moving object position updates. In this paper, we extend the functionality of MOVNet to support snapshot range queries as well as snapshot k nearest neighbor queries. Given an arbitrary edge in the space, we analyze the minimum and maximum number of grid cells that are possibly affected. We show that the maximum bound can be used in snapshot range query processing to prune the search space. We demonstrate via theoretical analysis and experimental results that MOVNet yields excellent performance with various networks while scaling to a very large number of moving objects.

PARELLEL AND DISTRIBUTED SYSTEM:

18.       PRIVACY-CONSCIOUS LOCATION-BASED QUERIES IN MOBILE ENVIRONMENTS
In location-based services, users with location-aware mobile devices are able to make queries about their surroundings anywhere and at any time. While this ubiquitous computing paradigm brings great convenience for information access, it also raises concerns over potential intrusion into user location privacy. To protect location privacy, one typical approach is to cloak user locations into spatial regions based on user-specified privacy requirements, and to transform location-based queries into region-based queries. In this paper, we identify and address three new issues concerning this location cloaking approach. First, we study the representation of cloaking regions and show that a circular region generally leads to a small result size for region-based queries. Second, we develop a mobility-aware location cloaking technique to resist trace analysis attacks. Two cloaking algorithms, namely MaxAccu_Cloak and MinComm_Cloak, are designed based on different performance objectives. Finally, we develop an efficient polynomial algorithm for evaluating circular-region-based kNN queries. Two query processing modes, namely bulk and progressive, are presented to return query results either all at once or in an incremental manner. Experimental results show that our proposed mobility-aware cloaking algorithms significantly improve the quality of location cloaking in terms of an entropy measure without compromising much on query latency or communication cost. Moreover, the progressive query processing mode achieves a shorter response time than the bulk mode by parallelizing the query evaluation and result transmission.

SECURE COMPUTING:

19.       Using Web-Referral Architectures to Mitigate Denial-of-Service Threats
The web is a complicated graph, with millions of websites interlinked together. In this paper, we propose to use this web sitegraph structure to mitigate flooding attacks on a website, using new web referral architecture for privileged service (“WRAPS”). WRAPS allows a legitimate client to obtain a privilege URL through a simple click on a referral hyperlink, from a website trusted by the target website. Using that URL, the client can get privileged access to the target website in a manner that is far less vulnerable to a distributed denial-of-service (DDoS) flooding attack than normal access would be. WRAPS does not require changes to web client software and is extremely lightweight for referrer websites, which makes its deployment easy. The massive scale of the web sitegraph could deter attempts to isolate a website through blocking all referrers. We present the design of WRAPS, and the implementation of a prototype system used to evaluate our proposal. Our empirical study demonstrates that WRAPS enables legitimate clients to connect to a website smoothly in spite of a very intensive flooding attack, at the cost of small overheads on the website’s ISP’s edge routers. We discuss the security properties of WRAPS and a simple approach to encourage many small websites to help protect an important site during DoS attacks.

SOFTWARE ENGINEERING:

20.       Vulnerability Discovery with Attack Injection
The increasing reliance put on networked computer systems demands higher levels of dependability. This is even more relevant as new threats and forms of attack are constantly being revealed, compromising the security of systems. This paper addresses this problem by presenting an attack injection methodology for the automatic discovery of vulnerabilities in software components. The proposed methodology, implemented in AJECT, follows an approach similar to hackers and security analysts to discover vulnerabilities in network-connected servers. AJECT uses a specification of the server’s communication protocol and predefined test case generation algorithms to automatically create a large number of attacks. Then, while it injects these attacks through the network, it monitors the execution of the server in the target system and the responses returned to the clients. The observation of an unexpected behavior suggests the presence of a vulnerability that was triggered by some particular attack (or group of attacks). This attack can then be used to reproduce the anomaly and to assist the removal of the error. To assess the usefulness of this approach, several attack injection campaigns were performed with 16 publicly available POP and IMAP servers. The results show that AJECT could effectively be used to locate vulnerabilities, even on well-known servers tested throughout the years.

21.       On Event-Based Middleware for Location-Aware Mobile Applications
As mobile applications become more widespread, programming paradigms and middleware architectures designed to support their development are becoming increasingly important. The event-based programming paradigm is a strong candidate for the development of mobile applications due to its inherent support for the loose coupling between components required by mobile applications. However, existing middleware that supports the event-based programming paradigm is not well suited to supporting location-aware mobile applications in which highly mobile components come together dynamically to collaborate at some location. This paper presents a number of techniques including location-independent announcement and subscription coupled with location-dependent filtering and event delivery that can be used by event-based middleware to support such collaboration. We describe how these techniques have been implemented in STEAM, an event-based middleware with a fully decentralized architecture, which is particularly well suited to deployment in ad hoc network environments. The cost of such location-based event dissemination and the benefits of distributed event filtering are evaluated.

CLOUD COMPUTING:

22.       An Efficient Hybrid Peer-to-Peer System for Distributed Data Sharing
Peer-to-peer overlay networks are widely used in distributed systems. Based on whether a regular topology is maintained among peers, peer-to-peer networks can be divided into two categories: structured peer-to-peer networks in which peers are connected by a regular topology, and unstructured peer-to-peer networks in which the topology is arbitrary. Structured peer-to-peer networks usually can provide efficient and accurate services but need to spend a lot of effort in maintaining the regular topology. On the other hand, unstructured peer-to-peer networks are extremely resilient to the frequent peer joining and leaving but this is usually achieved at the expense of efficiency. The objective of this work is to design a hybrid peer-to-peer system for distributed data sharing which combines the advantages of both types of peer-to-peer networks and minimizes their disadvantages. The proposed hybrid peer to- peer system is composed of two parts: the first part is a structured core network which forms the backbone of the hybrid system; the second part is made of multiple unstructured peer-to-peer networks each of which is attached to a node in the core network. The core structured network can narrow down the data lookup within a certain unstructured network accurately, while the unstructured networks provide a low-cost mechanism for peers to join or leave the system freely. A data lookup operation first checks the local unstructured network, and then, the structured network. This two-tier hierarchy can decouple the flexibility of the system from the efficiency of the system. Our simulation results demonstrate that the hybrid peer-to-peer system can utilize both the efficiency of structured peer-to peer network and the flexibility of the unstructured peer-to-peer network and achieve a good balance between the two types of networks.

23.       A Model for Reducing Power Consumption in Peer-to-Peer Systems
Information systems based on the cloud computing model and peer-to-peer (P2P) model are now getting popular. In the cloud computing model, a cloud of servers support thin clients with various types of service like Web pages and databases. On the other hand, every computer is peer and there is no centralized coordinator in the P2P model. It is getting more significant to discuss how to reduce the total electric power consumption of computers in information systems to realize eco-society. In this paper, we consider a Web type of application on P2P overlay networks. First, we discuss a model for showing how much each server peer consumes electric power to perform Web requests from client peers. Then, we discuss algorithms for a client peer to select a server peer in a collection of server peers so that the total power consumption can be reduced while some constraint like deadline one is satisfied. Lastly, we evaluate the algorithms in terms of the total power consumption and throughput compared with traditional round robin algorithms.

24.       A cloud oriented approach for people-provided services

Web 2.0 started a paradigm shift concerning content generation. Users spend hours browsing and producing contents to share, most of the times, freely. During these activities, their laptops and home PCs are considerably underused. By combining the amount of idle processing capacity with increasingly fast internet accesses, every computer can be considered as a computational “nearby” resource. The question is: what to do with them? This paper explores how these resources can be used to give something back to the user in a secure and private way.

INTERNET AND WEB APPLICATION:

25.       A Solution Model and Tool for Supporting the Negotiation of Security Decisions in E-Business Collaborations
Sharing, comparing and negotiating security related actions and requirements across businesses has always been a complicated matter. Issues arise due to semantic gaps, disparity in security documentation and formats, and incomplete security-related information during negotiations, to say the least. As collaborations amongst e-businesses in particular increase, there is a growing, implicit need to address these issues and ease companies’ deliberations on security. Our research has investigated this topic in substantial detail, and in this paper we present a novel solution model and tool for supporting businesses through these tasks. Initial evaluation results and feedback from interviewed security professionals affirm the use and suitability of our proposals in supporting the security actions negotiation process.


26.       WSOTF: An Automatic Testing Tool for Web Services Composition

This paper presents an automatic conformance testing tool with timing constraints from a formal specification (TEFSM: Timed Extended Finite State Machine) of web services composition (WSOTF: Web Service composition, Online Testing Framework), that is implemented by an online testing algorithm. This algorithm combines simultaneously idea of test execution and debug to generate and simultaneously execute the test cases. In this tool, the complete test scenario (timed test case suite and data) is built during test execution. This tool focus on unit testing, it means that only the service composition under test is tested and all its partners will be simulated by the WSOTF. This tool also considers the timing constraints and synchronous time delay. We can also use this tool for debug that is not easy while we develop a composite of Web services.

27.       TRIC: An Infrastructure for Trust and Reputation Across Virtual Communities
Virtual communities are gradually becoming an inherent part of modern life. Trust and reputation systems, which are considered key enablers of virtual communities, These systems support the accumulation of member reputation information and leverage this information to increase the likelihood of successful member interactions and to better protect the community from fraudulent members. Reputation information is an important part of a user's identity, which makes this data both sensitive and desired for communities to share. Reputation sharing is also motivated by the individual user that can leverage her state in new communities using the reputation she gained in others. In previous work we have presented the Cross-Community Reputation (CCR) model for sharing reputation knowledge among communities. The CCR model identifies the fundamental terms required for a meaningful sharing of reputation information among communities and proposes means to make the information sharing feasible. This paper presents the TRIC infrastructure for enabling CCR systems. The major concerns of TRIC are to protect the user's rights for privacy and control over her data and to draw some architecture guidelines that can be implemented in more than one way. We explain the major components required for this type of infrastructure and discuss known design alternatives and their tradeoffs in the case of TRIC.

28.       Crawling Online Social Graphs
Extensive research has been conducted on top of online social networks (OSNs), while little attention has been paid to the data collection process. Due to the large scale of OSNs and their privacy control policies, a partial data set is often used for analysis. The data set analyzed is decided by many factors including the choice of seeds, node selection algorithms, and the sample size. These factors may introduce biases and further contaminate or even skew the results. To evaluate the impact of different factors, this paper examines the OSN graph crawling problem, where the nodes are OSN users and the edges are the links (or relationship) among these users. More specifically, by looking at various factors in the crawling process, the following problems are addressed in this paper: 1) Efficiency: How fast different crawlers discover nodes/links; 2) Sensitivity: How different OSNs and the number of protected users affect crawlers; 3) Bias: How major graph properties are skewed. To the best of our knowledge, our simulations on four real world online social graphs provide the first in-depth empirical answers to these questions.





29.       Combining the Missing Link: an Incremental Topic Model of Document Content and Hyperlink
The content and structure of linked information such as sets of web pages or research paper archives are dynamic and keep on changing. Even though different methods are proposed to exploit both the link structure and the content information, no existing approach can effectively deal with this evolution. We propose a novel joint model, called Link-IPLSI, to combine texts and links in a topic modeling framework incrementally. The model takes advantage of a novel link updating technique that can cope with dynamic changes of online document streams in a faster and scalable way. Furthermore, an adaptive asymmetric learning method is adopted to freely control the assignment of weights to terms and citations. Experimental results on two different sources of online information demonstrate the time saving strength of our method and indicate that our model leads to systematic improvements in the quality of classification and link prediction.

30.       Query-Aware Complex Object Buffer Management in XML Information Retrieval

In this paper, we analyse the data access characteristics of a typical XML information retrieval system and propose a new query aware buffer replacement algorithm based on prediction of Minimum Reuse Distance (MRD for short). The algorithm predicts an object's next reference distance according to the retrieval system's running status and replaces the objects that have maximum reuse distances. The factors considered in the replacement algorithm include the access frequency, creation cost, and size of objects, as well as the queries being executed. By taking into account the queries currently running or queuing in the system, MRD algorithm can predict more accurately the reuse distances of index data objects.

31.       Recording How-Provenance on Probabilistic Databases
Tracking data provenance (or lineage) has become increasingly important in many large-scale applications, and a few methods have been proposed to record data provenance recently. However, most of previous works mainly focus on deterministic databases except Trio style lineage that aims at probabilistic databases, which is much more challenging because of the exponential growth of possible world instances and dependence among intermediate tuples. This paper proposes an approach, named PHP-tree, to model how-provenance upon probabilistic databases. we also show how to evaluate probability based on a PHP-tree. Compared with Trio style lineage, our approach is independent of intermediate results and can calculate the probability both cases of restricted and complete propagation of data provenance. Detailed experimental results show the effectiveness, efficiency and scalability of our proposed model.

32.       Efficient Peer-to-Peer Similarity Query Processing for High-dimensional Data
Objects, such as a digital image, a text document or a DNA sequence are usually represented in a high dimensional feature space. A fundamental issue in (peer-to-peer) P2P systems is to support an efficient similarity search for high-dimensional data in metric spaces. Prior works suffer from some fundamental limitations, such as being not adaptive to a highly dynamic network, poor search efficiency under skewed data scenarios, large maintenance overhead and etc. In this study, we propose an efficient scheme, Dragon, to support P2P similarity search in metric spaces. Dragon achieves the efficiency through the following designs: 1) Dragon is based on our previous designed P2P network, Phoenix, which has the optimal routing efficiency in dynamic scenarios. 2) We design a locality-preserving naming algorithm and a routing tree for each peer in Phoenix to support range queries. A radius-estimated method is proposed to transform a kNN query to a range query. 3) A load-balancing algorithm is given to support strong query processing under skewed data distributions. Extensive experiments verify the superiority of Dragon over existing works.


33.       Two Novel Adaptive Symbolic Representations for Similarity Search in Time Series Databases
Since the last decade, we have seen an increasing level of interest in time series data mining due to its variety of real-world applications. Numerous representation models of time series have been proposed for data mining, including piecewise polynomial models, spectral models, and the recently proposed symbolic models, such as Symbolic Aggregate approXimation (SAX) and its multiresolution extension, indexable Symbolic Aggregate approXimation (iSAX). In spite of many advantages of dimensionality/numerosity reduction, and lower bounding distance measures, the quality of SAX approximation is highly dependent on the Gaussian distributed property of time series, especially in reduced-dimensionality literature. In this paper, we introduce a novel adaptive symbolic approach based on the combination of SAX and k¬-means algorithm which we call adaptive SAX (aSAX). The proposed representation greatly outperforms the classic SAX not only on the highly Gaussian distribution datasets, but also on the lack of Gaussian distribution datasets with a variety of dimensionality reduction. In addition to being competitive with, or superior to, the classic SAX, we extend aSAX to the multiresolution symbolic representation called indexable adaptive SAX (iaSAX). Our empirical experiments with real-world time series datasets confirm the theoretical analyses as well as the efficiency of the two proposed algorithms in terms of the tightness of lower bound, pruning power and number of random disk accesses.

34.       An Asymmetric Similarity Measure for Tag Clustering on Flickr
Web 2.0 tools and environments have made tagging, the act of assigning keywords to on-line objects, a popular way to annotate shared resources. The success of now-prominent tagging systems makes tagging "the natural way for people to classify objects as well as an attractive way to discover new material". One of the most challenging problems is to harvest the semantics from these systems, which can support a number of applications, including tag clustering and tag recommendation. We conduct detailed studies on different types of tag relations and tag similarity measures, and propose a scalable measure that we name Reliability Factor Similarity Measure (RFSM). We compare it with two other measures having similar scalability by integrating them into hierarchical clustering methods and performing tag clustering on a subset of Flickr data. The results suggest that RFSM outperforms those two measures when it is applies for tag clustering purpose. We also present an alternative way of utilizing discovered tag relations to set up tag refining rules in order to deal with some noise in the initial tag sets, which can in turn improve the precision of tag relations.

35.       Effective Clustering of Dense and Concentrated Online Communities

Most clustering algorithms tend to separate large scale online communities into several meaningful sub-communities by extracting cut points and cut edges. However, these algorithms are not effective on dense and concentrated graphs which do not have any meaningful cut points. Common problems with the previous algorithms are as follows. First, the size of the first cluster is too large as it may contain many incompatible users. Second, the quality and the purity of the clusters are very low. Third, only the dominant first cluster is found to be meaningful. To address these problems, we first propose a graph transformation to separate large scale online communities into two different types of meaningful subgraphs. The first subgraph is the intimacy graph and the second is the reputation graph. Then, we present the effective algorithms for discovering good sub-communities and for excluding incompatible users in these subgraphs. The experimental results show that our algorithms allow for extracting more suitable and meaningful sub-communities than the previous work in dense online networks.

36.       Graph-based Recommendation on Social Networks
Recommender systems have emerged as an essential response to the rapidly growing digital information phenomenon in which users are finding it more and more difficult to locate the right information at the right time. Systems under Web2.0 allow users not only to give resources- ratings but also to assign tags to them. Tags play a significant role in Web 2.0. They can be used for navigation, browsing, recommendation and so on. In this paper, we propose a novel recommendation algorithm, which is based on social networks. The social network is established among users and items, taking into account both the information of ratings and tags. We consider users' co-tagging behaviors and add the similarity relationship to the graph to enhance the performance. Our algorithm is based on the Random Walk with Restarts but provides a more natural and efficient way to represent social networks. Having considered the influence of tags, the transition matrix is denser and the recommendation is more accurate. By evaluating our new algorithm and comparing it to the baseline algorithm which is used in many real world recommender systems on a real life dataset, we make the conclusion that our method performs better than the baseline method.

37.       Improving a News Recommendation System in Adapting to Interests of a User with Storage of a Constant Size
It is desired to have a system that can recommend news articles according to interests of a user, which would change with time. In this paper, we attempt to improve a news recommendation system with supervised classification by integrating a clustering method into the system in order to adapt flexibly to the variety of interests of a user. To follow the changes of interests with time, we need to make not only the classifier but also the clustering module of the system be easily updatable. We construct clusters in the feature space from combining one-dimensional clusters to make the size of storage to hold for updating clusters be constant. We let the data distribution in each one-dimensional space be influenced by the clustering results from another one-dimensional space, thereby taking into account of data distribution in the original multiple dimensional space. Main contribution of this paper is to propose a method that can achieve both of the two goals: to improve the performance of a news recommendation system and to make the amount of data to hold for updating clusters be constant. Some experimental results are shown and the effective and weak points of our proposal are discussed.

38.       Dynamically Analyzing Time Constraints in Workflow Systems with Fixed-Date Constraint
In workflow management systems (WFMSs), time management plays an essential role in controlling the lifecycle of business processes. Especially, run-time analysis of time constraints is necessary to help process manager proactively detect possible deadline violations and appropriately handle these violations. Traditional time constraint analyses either present deterministic results which are too restrictive in highly uncertain workflow processes, or only consider static analysis at workflow build-time. For such an issue, this paper proposes a dynamic approach for analyzing time constraints during process execution. To be specific, based on a Petri-net-extended stochastic model, this approach first analyzes activity instances' continuous probabilities of satisfying time constraints when a process instance is initiated. Afterwards, during the execution of this process instance, the approach dynamically updates these probabilities whenever an activity instance is completed. Moreover, an example process instance in real-world WFMSs shows the practicality of our approach.

39.       Retrieving Cultural Heritage Information with Google Earth
In recent years, a great number of digital archives have been built in the field of cultural heritage preservation. Besides providing a systematic way to store and handle digitized versions of valuable artifacts and archival items these archives allow scientists, preservation officers and other professionals to access documents and information that may otherwise be unavailable. Opening the cultural archives for interoperation as well as public access is often a desirable objective. For this purpose, a user interface everybody knows and understands would be a good choice. In this paper, we propose to use Google Earth as a search and retrieval client for public access to digital archives. A mapping from a metadata model comprising geospatial, temporal, and thematical tags onto the Google KML is presented. This way Google Earth can easily be extended with cultural heritage information as stored in one or more digital archives. Any user all around the world can use Google Earth to locate, display and access the desired information by simply downloading and installing the appropriate KML file.

40.       Hash Tree: A New Hybrid Index for Flash Disks
Flash disks have become a popular alternative for magnetic disks, due to their fast I/O speed and other features such as small-size, shock-resistance, energy-efficient and non-volatile. However, flash disks also have characteristics of out-of-place update and asymmetric I/O latencies for read, write, and erase operations, which introduce new challenges into the indexing mechanism for flash disks. Traditional index structures do not take the flash I/O characteristics into account and, therefore, will cause poor performance. To address this problem, we present a new hybrid index structure for flash disks in this paper, which is called HashTree. HashTree aims at getting better update performance while keeping relatively high search efficiency. The HashTree uses a hash-based index to split the indexed records into several buckets, and then we develop a new tree structure named FTree to organize the records in each bucket. Compared with the previous tree-based indexes, our policy can reduce the costs of maintaining the hierarchical tree structure. We also introduce a tuning mechanism into the HashTree so that we can obtain appropriate trade-off between search performance and update performance. We conducted an experiment on a commercial SSD to evaluate the performance of HashTree. Compared with its competitors such as BFTL and FD-tree, our experimental results show that HashTree performs best in terms of both update performance and overall performance.

41.       Service Collaboration Network: A Novel Mechanism for Web Service Management
This paper proposes a novel approach to service management and monitoring on the basis of service collaboration relations that can be extracted from service composition history. An undirected and weighted service collaboration network is constructed therefore. On the network, two abstractions are defined: service centrality, which measures a service's collaboration capability comparing to other services for a long period, and service activeness, which measures the frequency of a service's collaboration with others for a short period. The capability of service management among Open APIs is investigated and demonstrated. Experimental data are collected from Programmable Web with the period from 2005-9-14 to 2009-9-21. Experiments show that the defined metrics can properly reflect the service properties.

42.       Suggesting Topic-Based Query Terms as You Type

Query term suggestion that interactively expands the queries is an indispensable technique to help users formulate high-quality queries and has attracted much attention in the community of web search. Existing methods usually suggest terms based on statistics in documents as well as query logs and external dictionaries, and they neglect the fact that the topic information is very crucial because it helps retrieve topically relevant documents. To give users gratification, we propose a novel term suggestion method: as the user types in queries letter by letter, we suggest the terms that are topically coherent with the query and could retrieve relevant documents instantly. For effectively suggesting highly relevant terms, we propose a generative model by incorporating the topical coherence of terms. The model learns the topics from the underlying documents based on Latent Dirichlet Allocation (LDA). For achieving the goal of instant query suggestion, we use a trie structure to index and access terms. We devise an efficient top-k algorithm to suggest terms as users type in queries. Experimental results show that our approach not only improves the effectiveness of term suggestion, but also achieves better efficiency and scalability.




43.       Optimizing Query Processing for the Hidden Web
The term Deep Web (sometimes also called Hidden Web) refers to the data content that is created dynamically as the result of a specific search on the Web. In this respect, such content resides outside web pages, and is only accessible through interaction with the web site typically via HTML forms. It is believed that the size of the Deep Web is several orders of magnitude larger than that of the so-called Surface Web, i.e., the web that is accessible and indexable by search engines.


44.       A Metric for Measuring Members’ Contribution to Information Propagation in Social Network Sites
The phenomenon of propagation is universal in our daily life. For example, infectious diseases can be transmitted from one person to another, hot news is disseminated widely on the Internet, and classic passages written by the popular users can be shared by many other users in online social network sites. With the emerging of online social network sites, such as Facebook, YouTube, and Flickr, many literatures try to analyse patterns of information propagation and design effective virtual marketing strategies in these sites. However, few metrics have been designed to measure the characteristic of information propagation. In this paper, we propose a novel metric for measuring members' contribution to information propagation in online social network sites. As a case, we analyse large-scale traces of members' contribution to photo dissemination in Flickr and find that the distribution of members' contribution follows a power law distribution, which reveals that most of the information is created and propagated by a few members.

45.       Domain-oriented Deep Web Data Sources’ Discovery and Identification
As Deep Web contains tremendous well-structured data sources, how to integrate data sources in Deep Web has become a hotspot in current research. Accurately discovering and identifying Deep Web data sources related to a specific domain become key issues. We propose a Domain-Oriented Deep Web data source Discovery method (DO-DWD) and a novel Domain Identification strategy of Deep Web data sources (DIDW). In the discovery stage, we use machine learning algorithms and some heuristic rules to find query interfaces of the data sources; In the identification stage, we identify Deep Web data sources associated with the domain by calculating the relevance between a query interface and the domain based on semantic similarity. Finally, we have extensive experiments on a real data set showing that DO-DWD and DIDW are of high correctness and accuracy.

46.       Towards Video Management over Relational Database
Video has become popular in our daily life for both professional and consumer applications. Both low level video processing and high level semantic video analysis are critically computational tasks in application domains. Most of current video computing tools are developed for specific analytic tasks, they are lack higher level interoperability with database and treat database merely as a relational data storage engine rather than an analytic platform, which causes inefficient data access and massive amount of data movement. In this paper, we study how to support video data management over relational database, and present our initial solutions of video data storage mechanism, video data access method and efficient video analytics. We also illustrate our ongoing prototype system HybVideo that developed in a novel architecture. It integrates above solutions to tackle the major challenges of providing a platform for both storage and analysis of video data.

47.       Phosphor: A Cloud based DRM Scheme with SimCard
As 3G networks provide enhanced capabilities of data transportation, a considerable amount of mobile applications and services, which involve mass of unstructured digital content, e.g., video, audio, are available. Meanwhile, pirate and illegal distribution of these digital contents are severe issues. Digital Rights Management (DRM) aims at protecting unstructured digital contents from being abused through regulating the usage of digital contents. However, existing mobile DRM schemes more or less suffer from poor compatibility, bad practicality or high cost. Besides, to the best of our knowledge, fewer of the existent DRM schemes concern for efficient unstructured data management. In this paper, we propose Phosphor, a cloud based mobile DRM scheme with sim card. In details, an unstructured data management system is adopted for efficient data management services in DRM backend (servers and systems). Meanwhile, sim card is introduced to Phosphor, which not only reduces the cost, but also provides higher security. We have implemented our DRM scheme, which demonstrates that Phosphor is efficient, secure and practicable.

SERVICES COMPUTING:

48.       Rule-Based Coordination of Distributed Web Service Transactions

Current approaches to transactional support of distributed processes in service-oriented environments are limited to scenarios where the participant initiating the process maintains a controlling position throughout the lifetime of the process. This constraint impedes support of complex processes where participants may only possess limited local views on the overall process. Furthermore, there is little support of dynamic aspects: failure or exit of participants usually leads to cancelation of the whole process. In this paper, we address these limitations by introducing a framework that strengthens the role of the coordinator and allows for largely autonomous coordination of dynamic processes. We first discuss motivating examples and analyze existing approaches to transactional coordination. Subsequently, we present our framework TracG, which is based on WS-BusinessActivity. It contains at its core a set of rules for deciding on the ongoing confirmation or cancelation status of participants' work and protocol extensions for monitoring the progress of a process. Various types of participant vitality for a process are distinguished, facilitating the controlled exit of nonvital participants as well as continuation of a process in case of tolerable failures. The implementation of the framework is presented and discussed regarding interoperability issues.