Identification of population patterns using advanced machine learning techniques applied to mobile phone and geolocation data
Source: Thinking Machines
Context and Motivation
Human mobility is a cornerstone of contemporary life: it shapes urban form, steers transportation systems, and influences economic, social, and environmental outcomes. The exponential growth of mobile devices and location-aware services has produced vast spatiotemporal datasets, opening unprecedented opportunities to observe movement patterns at scale. At the same time, these data introduce serious challenges, such as privacy concerns, heterogeneity, noise, sparsity, and the demand for models that are both accurate and interpretable.
Among the various forms of mobile data, Call Detail Records (CDRs) represent a particularly potent source for population-level analysis. CDRs are generated passively by mobile network operators for billing and operational purposes. Each time a user interacts with the network, a record is created containing the time of the event and the identity of the Radio Base Station (RBS) that handled the connection. Although the positioning accuracy using CDR can be between 100 and 1,000 metres, the extensive spatial and temporal coverage, coupled with large-scale sample sizes representing significant population fractions, makes CDR data particularly attractive for understanding population-level mobility dynamics.
Beyond traditional CDR analysis, emerging data sources such as geolocated social media offer complementary perspectives on human mobility. Platforms like Twitter/X provide geotagged content that can capture cross-border movements and international mobility patterns that CDRs, limited by operator licensing boundaries, typically cannot observe.
Research Gaps
Despite substantial progress in leveraging mobile phone data for mobility analysis, several critical research gaps persist:
- Limitations of unsupervised approaches: Traditional clustering methods (e.g., DBSCAN, k-means) suffer from parameter sensitivity, lack of semantic grounding, single-label constraints, and limited accuracy (41–74%) compared to supervised approaches (78%+).
- Multi-label and contextual modelling: Urban areas often serve multiple overlapping functions (e.g., residential + retail), yet single-label approaches dominate the literature.
- Generalisation and transferability: Models trained on one city rarely generalise to another with different spatial structures, cultural contexts, or network topologies.
- Cross-border mobility: CDRs are limited to national operator boundaries, leaving transnational commuting patterns poorly measured.
- Integration with urban semantics: Optimal strategies for integrating auxiliary data sources (land use maps, Points of Interest, census data) remain an active area of investigation.
Objectives
The primary aim of this PhD thesis is to develop, implement, and evaluate advanced machine learning methods—especially supervised and multi-label classification techniques—for robustly identifying population mobility patterns from mobile phone data, with emphasis on creating generalisable and interpretable solutions applicable across diverse geographic contexts.
- Systematic review and benchmarking of existing methods for identifying mobility and population patterns from positioning data.
- Performance enhancement via data and feature engineering, including scaling, sampling strategies, instance selection, and spatially-aware cross-validation.
- Application of multi-label classification for places and movements, using both problem transformation methods (Binary Relevance, Classifier Chains, Label Powerset) and algorithm adaptation methods (ML-kNN).
- Evaluation, generalisation, and transfer learning across different cities and contexts with varying urban morphologies.
- Emphasis on interpretability and practical deployment via feature importance analysis and Shapley values.
- Extension to cross-border mobility analysis using alternative data sources such as geolocated social media.
Core Publications
This thesis is submitted as a thesis-by-compendium, bringing together six peer-reviewed and under-review journal articles:
- SAMPLID (ISPRS International Journal of Geo-Information, 2024) — Introduces supervised place identification from CDRs, achieving up to 78.5% overall accuracy in Milan.
- Trento Extension (Applied Sciences, 2025) — Evaluates supervised land-use classification in an alpine environment, with Random Forest achieving 64.7% overall accuracy.
- MAPLID (International Journal of Geographical Information Science, 2026) — Extends place identification to a multi-label setting, reporting 88.3% average precision with Label Powerset + Random Forest.
- Local-k Selection (Engineering Applications of Artificial Intelligence, 2022) — Proposes local (instance-wise) selection of k for multi-label kNN.
- ML-proxkNN (under revision, Information Processing & Management, 2026) — Introduces a spatially-aware multi-label kNN achieving 92.0% average precision on Milan.
- Cross-Border Mobility (under revision, EPJ Data Science, 2026) — Detects cross-border commuters using geolocated social media, achieving 98% accuracy with zero-shot transfer across regions.