Project outline: The project was focused on efficient image recognition (computer vision, machine learning) for fast storage and recovery on queries by visual content, i.e., query(myimage) will recover images that were visually similar to myimage. For this purpose, a visual feature extraction process was applied to both the query image (myimage) and the images on the database, which consisted of creating a feature vector with a numerical value for each feature such as: colour, texture, etc. Images in the database were stored along their feature vectors.
To further improve performance, the database was partitioned among several computing-storage nodes to allow parallel and distributed processing and scalability. Scalability was two-fold: as to the number of queries, and as to the number of computing-storage nodes.
The partitioning of images was a novel similarity content-based partitioning based on a two-step machine-learning-based partitioning strategy. During the first step, a coarse partitioning takes place by distributing image data among the available nodes so that similar images group together around the nodes. The second partitioning consists of grouping the images locally on each node by means of clustering. A combination of a k-nearest neighbour classifier and a radial-basis function neural network for clustering as the two-step partitioning strategy was found to perform best. At search time, the strategy reduces the search space and also allows parallel execution of various image queries. The system makes a critical ‘intelligent’ decision in determining the candidate database node(s) where the best matches are very likely to be found.
To evaluate our strategy, an image retrieval system was developed based on a set of hierarchical image classes and on global-based image features such colour histograms and texture expressed in the form of the magnitude response of Gabor filters. A feature extractor exploits two levels of parallelism, idle nodes in the network and the potential availability of multiprocessor nodes. In addition, a Web image collector was implemented. The main advantage of the partitioning approach is that it is scalable to both, number of computers and number of queries. allowing queries to proceed in parallel and to scale.
The Pareto Front technique was used to evaluate the effectiveness of the image data distribution approach, which allowed us to configure our system to obtain above 80% quality of queries. The specific objective of such an evaluation was to explore trade-offs between the number of nodes of the network, the distribution of images and the quality of queries. This evaluation process can be carried out before the actual distribution takes place. This may be of great importance since this trade-off analysis may help in making a decision on what distribution can achieve the best balance according the specific needs of the application (range of quality permissible on queries) and the computing resources available. The evaluation analysis can also be performed as post-distribution. This post-distribution evaluation may be of great significance for distributed system where either the number of classes (or images), the number of nodes or the quality of the queries have significantly changed and the system needs to explore the best way to be updated.
Fields: Machine Learning, Computer Vision, Distributed Systems, Parallel Computing, Optimisation,
Technologies used: computer vision, machine learning, distributed systems and parallel computing, clustering and classification; probabilistic and statistical models, high-dimensional reduction techniques, data structures for highly dynamic, diverse and high-dimensional data sets; the Parallel Virtual Machine framework; multi-threading programming, C++, Python.
An overview of the architecture of the search engine (SE).
Web image robot was implemented for collecting images from the Web.
Interaction of various software components of the SE.
A feature extractor exploits two levels of parallelism, idle nodes in the network and the potential availability of multiprocessor nodes.
A herarchical set of image categories under consideration.
To evaluate the strategy, an image retrieval system was developed based on a set of hierarchical image classes and on global-based image features such colour histograms and texture expressed in the form of the magnitude response of Gabor filters. A training/testing set of images was built with a set of images collected from the Web.
Image data distribution model using Machine Learning (Classification and Clusterring).
To build this model, a novel similarity content-based partitioning approach was developed. It is based on a two-step partitioning strategy. During the first step, a coarse partitioning takes place by distributing image data among the available nodes so that similar images group together around the nodes. The second partitioning consists of grouping the images locally on each node by means of clustering. A combination of a k-nearest neighbour classifier and a radial-basis function neural network for clustering as the two-step partitioning strategy was found to perform best. At search time, the strategy reduces the search space and also allows parallel execution of various image queries. The system makes a critical ‘Intelligent’ decision in determining the candidate database node(s) where the best matches are very likely to be found. The main advantage of the partitioning approach is that it is scalable to both, number of computers and number of queries. The Pareto front technique was used to evaluate the effectiveness of the image data distribution approach.
(This project is also presented in our "Experience - Mobile Computing" with the title Data Collection from Sensors through Mobile Phones for the Automotive Industry.)
The aim of this project was to identify vehicle features that could be enhanced in order to improve their performance in future vehicle designs.
A large data set was collected from a fleet of different types of vehicles for a predetermined set of trips. Data was collected from several sensors available in different parts of each vehicle while driving. A mobile app was used to collect the data from each vehicle and to subsequently upload the data to the cloud for further processing. Data collected included petrol consumption, travel distance, break usage, gear timing change, among other.
A wide range of machine learning algorithms and tools were applied to leverage data insights for the automotive industry.
A small cluster with Hadoop ecosystem was set up and and used; in addition to the cloud.
BDSP is a web system that allows users to perform Big Data tasks from any device with access to the Internet. BDSP uses web services to integrate different data sources and processing tasks that are automatically specified onto software analysis tools such as Mahoot and NLTK and run on a MapReduce/Hadoop cluster. BDSP environment on the left.
We reported the use of BDSP on these applications:
sentiment analysis over tweets from Twitter
K-means clustering
data dimensionality reduction
At the time of development, BDSP was somewhat similar only to Databricks (https://databricks.com/) in that both were web systems.
BDSP was designed to serve as a tool for rapid-prototyping of Big Data projects, as a base platform to be tuned and extended according to need, and as a training vehicle both in data analysis and in developing software for Big Data tasks. BDSP was published thus:
BDSP: A Big Data Start Platform. ASONAM 2015: IEEE/ACM Intl. Conf. on Advances in Social Networks Analysis and Mining. Paris, France, August 27-28, 2015, pp 1110–1117, 2015.
MapReduce is a programming model and execution environment developed by Google to process very large amounts of data (Exabytes) in parallel in off-the-shelf clusters that (1) does not require parallel programming, (2) is fault tolerant and (3) balances the workload transparently to the programmer.
Hive is a Datawarehouse developed by Facebook that provides a database infrastructure atop MapReduce, translating SQL queries to mapreduce jobs (pairs of map and reduce functions).
We optimised the SQL queries produced by Hive through reducing the number of mapreduce jobs whenever possible. The Hive-QL compiler was extended to carry out the optimisations which were evaluated running onLine analytical processing (OLAP) queries from the TPC-H benchmark on a 20-node Hadoop cluster. Performance was improved up to 30%.
A program that plays dominoes was implemented in Matlab. The program plays in modalities of two players (one against one) and four players (two against two). The AI relies on search trees, the Minmax algorithm and probability-based heuristics.
The program performs like a medium level player. Further improvements can be done incorporating a knowledge database and Reinforcement Learning. Strategies similar to the ones applied here can be used in problems with uncertainty where the whole state of the world is unknown.
Two different Navigation systems were implemented. The first one used a Proportional-Integral-Derivative controller to drive a mobile robot through corridors. Maps and landmarks where used so the robot could deliver mail from one office to another.
The second one was implemented by using supervised learning with Artificial Neural Networks. The robot was trained to wander corridors while keeping a central position between walls. The programming was done in C and Matlab.
This implementation uses real numbers instead of bits (called Evolution Strategy) and was done in Matlab. Several genetic operators where implemented such as: mutation, crossover in one point, crossover in multiple points, extrapolation crossover, mean crossover, weighted mean crossover and local intermediate crossover.
Each genetic operator acts on a percentage of the whole population. Greater population percentage is given to the operators that perform better and less population percentage to the ones that don't perform well. This percentage is varied through the evolution as the performance of the operators change. Some operators perform better at the first stages of the evolution but at later stages others perform better than them. This implementation was tested on an optics optimization problem where the parameters of lenses had to be found.