Some flavor of machine learning with
applications in geophysics

width:600px

Léonard Seydoux
ISTerre, Univeristé Grenoble Alpes

leonard.seydoux@univ-grenoble-alpes.fr


Frontiers in Geosciences

Main targets

  • Flyover machine learning basics and main applications
  • Decrypt a machine-learning based geoscience paper
  • Implement machine-learning solutions when needed


width:400px

We will also answer any questions you may have

Beyreuther et al. GJI (2012)

References

The deep learning book

  • History of machine learning
  • Introduction to linear algebra
  • Machine-learning basics
  • Advanced deep learning
  • Full online content at www.deeplearningbook.org

Goodfellow et al., MIT press (2016)

scikit-learn
a machine learning library

  • Open source python library
  • Algorithms linked with courses
  • Greybox-style

Lots of examples there!

Pedregosa et al., JMLR (2011)

3blue1brown.com

  • Animated math
  • Ultra pedagogic
  • Some videos about deep nets

I strongly recommand

What is machine learning?

Machine Learning – high-level definition

Machine learning is the research field where are studied algorithms that
improve the performance PP at solving a task TT from experience EE

A well-defined learning problem is given by PP, TT and EE

Other possible definitions that also work

  • Machine learning = solving inverse problems
  • Machine learning = statistical model testing
  • Machine learning = statistical model inference

Machine learning is useful, when

  • human expertise does not exist (navigation on Mars)
  • human expertise does not formalize easily (face recognition)
  • too many variables are to take into account (medical decision)
  • the data is too big (data compression)
  • performance improvement is needed (computational speed)

No need to use machine learning if classical strategies work

Famous examples – Image classification

width:700px

  • Task: recognize and locate objects in image
  • Performance: classification/location error
  • Experience: labeled images

Various applications in medicine, anomaly detection, etc.

Famous examples – Image captioning


width:600px

  • Task: generate caption from image
  • Performance: caption quality
  • Experience: already captionned images

Example of application : autonomious cars

Pranoy Radhakrishnan (2019)

Famous examples – Image captionning

https://vimeo.com/146492001

Famous examples – Data compression


Data compression with autoencoder

Jiang et al. IEEE (2017)

Other applications ?

👏👏👏

Famous examples – Hand-written digits recognition


width:500px

  • Task: recognize digits from pixels
  • Performance: recognition success
  • Experience: labeled hand-written digits

How would you proceed without learning?

Main types of machine learning

SUPERVISED LEARNING
Find the mapping f(x)=yf(x) = y given training data
samples xx and labels yy. (e.g. regression, classification, ...)


UNSUPERVISED LEARNING
Given training data xx, infer any existing internal structure.
(clustering, anomaly detection, blind source separtion ...)

Supervised machine learning

Solve y=f(x)y = f(x), please.

Useful definitions

Symbol Name
{xiX}i=1N\left\{ \vec{x}_i \in \mathbb{X} \right\}_{i = 1\ldots N} Collection of data samples
{yiY}i=1N\left\{ \vec{y}_i \in \mathbb{Y} \right\}_{i = 1\ldots N} Collection of labels
x={x1,xF}\vec{x}=\{x_1, \ldots x_F\} Set of sample features
y={y1,yT}\vec{y}=\{y_1, \ldots y_T\} Set of label targets
NN Dataset dimension
FF Feature space dimension
TT Target space dimension
X\mathbb{X} Data space
Y\mathbb{Y} Label space

Common supervized tasks

width:700px

  • Regression: both xx and yy are continuous
  • Classification: xx is continuous, yy is descrete

Supervised task – Regression

Given data samples xXx \in \mathbb{X} and derised corresponding
labels yYy \in \mathbb{Y}, find a mapping function f:XYf : \mathbb{X} \rightarrow \mathbb{Y}
between xx and yy so that f(x)=yf(x) = y is true most of the time.

→ Illustration with linear regression

Linear Regression

Which coefficients a,bRa, b \in \mathbb{R} best explain the relationship between xx and yRy \in \mathbb{R} so that

y=ax+by = ax + b ?

We need to optimize aa and bb. Here, optimize = learn

Linear Regression

Brute-force learning
– a modern way for
"testing plenty of solutions"

How to select the best model?

Linear Regression

y~i\tilde{y}_i is the label predicted by the model for a given data sample xix_i

y~i=axi+b\tilde{y}_i = ax_i + b


With the true label yiy_i, we can define the prediction loss:

ϵP=i=1N(y~iyi)2\epsilon_P = \sum_{i=1}^{N} \left( \tilde{y}_i - y_i\right)^2

Possible loss – least squares

Linear Regression

The task becomes finding
θ=(a,b)R\theta^* = (a^*, b^*) \in \mathbb{R} so that


θ=arg mina,bi(y~iyi)\displaystyle \theta^* = \argmin_{a, b}\sum_i \left( \tilde{y}_i - y_i \right)

Brute-force can get a little crazy
Any better idea?

Parameter search with gradient descent

How do we infer θ=arg mina,bi(y~iyi)\displaystyle \theta^* = \argmin_{a, b}\sum_i \left( \tilde{y}_i - y_i \right) efficiently?

  1. Define an initial model θ0=(a0,b0)\theta_0 = (a_0, b_0)
  2. Observe δϵ\delta \epsilon for θ1=(a0+δa,b0+δb)\theta_1 = (a_0 + \delta a, b_0 + \delta b)
  3. Update the model accordingly θ1θ0\theta_1 \rightarrow \theta_0
  4. Iterate until the loss does not evolve anymore

The perturbation amplitude is called the learning rate
An iteration is called a training epoch

Parameter search with gradient descent


width:1000px

Every data scientist need to cross-validate the learning rate

Different gradient descend techniques


width:550px

Need to select the right optimizer (watch out the loss, intuition about physics,...)

Overfitting

Many parameters may be used for complex regressions.

The green model error \rightarrow 0, yet the model does not generalize.

How do we prevent overfitting?

Prevent overfitting with
training & testing data

The parameters are learned from the training data with the training loss


The algorithm must also minimize the testing loss
(= generalization error)

We stop at the tradeoff

Prevent overfitting with
regularization penalty

Adding a penalty to the loss function for preventing noise contamination

minθL(fθ(xi),yi)+R(fθ)\displaystyle \min_\theta \sum L (f_\theta(x_i), y_i) + \color{red}{R(f_\theta)}

How strong? Which penalty?

Which regression algorithm?

  • Different loss functions
    e.g.: least square, entropy, likelihood, ...
  • Different regularization
    e.g.: L1, L2, random dropout, ...
  • Different models
    e.g.: linear, nearest neightbors, ...

As many definitions as regression algorithms 👉

Another supervised algorithm – Support-vector machine

Works for regression or classification

Finding the right hyperplane

In two dimensions, a hyperplane is a straight line.
What if te data cannot be split by a line?


width:400pxwidth:400px

Representation matters! Here comes the kernel trick

kernel trick

Increase the data dimension with non-linear feature combinations, and find the hyperplane "there" in the high-dimensional space.


width:600px

Here, we have the Radial Basis Function (RBF) kernel ϕ(x,y)=(x,y,ex2+y2)\phi(x, y) = (x, y, e^{-x^2 + y^2})

Back in 2D, the kernel projection is non-linear

Example of Support vector machine with different kernels


Supervised machine learning strategies

  • Representation matters
  • Kernel tricks are common strategies to improve performances
  • As many regression/classification algorithms as combinations of losses, models and optimizers

Machine learning for signal analysis

width:800px

How do we deal with waveforms?

Naive attempt – Waveform as a set of feature

A NN-point waveform can be seen as a
single data sample with NN features

width:400px

Naive attempt – Waveform as a set of feature

These two waveforms are shifted a tiny little bit in the time domain.
This makes a huge change in the feature comparison...

width:400px
width:400px

A simple distance measurement lead to a 32% correlation between the waveforms
We should change the cost function or extract more relevant features

Manual feature extraction for classification of volcanic events

Malfante et al. MLVSS (2017)

Manual feature extraction for classification of volcanic events


width:550px       width:500px

What if we used the spectrogram directly?
What would be the drawback?

Malfante et al. MLVSS (2017)

Manual feature extraction for time-to-failure earthquake prediction


width:450px     width:500px

Task – predict the time-to-failure from the "seismic" data

Manual feature extraction for time-to-failure earthquake prediction


width:550px

Random forest regressor

Random forest = forest of decision trees


While decision trees tend to overfit, random forests are powerful

Manual feature extraction for time-to-failure earthquake prediction


width:750px

Here we go! Remember, that's in a laboratory!

Manual feature extraction for time-to-failure earthquake prediction

Here are the feature that are of greatest help to perform the task

Deep neural networks

width:800px

Learning abstract features from data

Which features should matter?

An artificial neuron
y=φ(Σωx+b)y = \varphi(\Sigma\omega x + b)


Where xRmx \in \mathbb{R}^m is an input data sample, wRmw \in \mathbb{R}^{m} are the weights and bRb \in \mathbb{R} the bias, and the function φ\varphi is the non-linear activation. The output is a new feature yRy \in \mathbb{R}.

width:600px

The idea is to transform the input data into another feature space

Activations allow for non-linear regression


width:800px

Fully-connected neural network

The depth correspond the the number of hidden layers

Fully-connected neural network

HUGE number of parameters to invert !...
Requires BIG datasets

Let's try out

width:700px

Convolutional neural networks


width:900px

Instead of using weights and bias, one can use convolutional filters

Let's try out with convolutional filters

width:700px

Different convolution arithmetics

width:200px width:200px width:200px
width:200px width:200px width:200px width:200px

ConvNetQuake – a CNN for earthquake detection and location


width:330px            width:500px

Black dots are the test data... 🧐

ConvNetQuake – a CNN for earthquake detection and location


width:900px

Results show that we can (roughly) learn the
location of an earthquake from a single station!

PhaseNet – a seismic phase detector


width:900px

Task: learn to recognize P and S picking from manual picks

PhaseNet – a seismic phase detector


width:900px

P accuracy: 93%
S accuracy: 85%

PhaseNet – a seismic phase detector


width:900px

P accuracy: 93%
S accuracy: 85%

PhaseNet – a seismic phase detector


width:900px

P accuracy: 93%
S accuracy: 85%

Unsupervised learning

width:500px

Exploring the data with implicit models

Unsupervised-learning common tasks

  • Clustering
  • Blind source separation
  • Data compression
  • De-noising

Clustering – most common definitions


Depending on the definition used, the techniques and results may vary.
Clustering is an exploratory task, every result make sense

Most simple (yet, powerful) example – K-means

Find KK cluster centroids μk\mu_k that minimize
the within-cluster variance. In order words:

arg minsk=1KxCkxμi2\displaystyle\argmin_s \sum_{k=1}^K \sum_{\mathbf{x} \in C_k} \| \mathbf{x} - \mathbf{\mu}_i\| ^ 2

We have the data and a task to solve

Most simple (yet, powerful) example – K-means

  1. Initialize with (random) centroids
  2. Assign each data samples to the closest centroid
  3. Update the centroid with the within-cluster data
  4. Iterate until minimum is reached

Most simple (yet, powerful) example – K-means

  1. Initialize with (random) centroids
  2. Assign each data samples to the closest centroid
  3. Update the centroid with the within-cluster data
  4. Iterate until minimum is reached

The final decision boundaries are called Voronoid cells

How many clusters?

Silhouette = distance of CiC_i- to CjC_j-samples. When ~ 0, it indicates bad splitting.

How many clusters?

Clustering splitting occurs when silhouettes below average silhouette score

How many clusters?

KK-means is adapted for linearly separable and balanced problems

Is your data linearly separable?


width:1000px

What algorithm is the best in each case?

Is your data linearly separable?


width:1000px

Should we change the features or the algorithm? We will try out tomorrow 😌

Clustering of volcanic tremors at Kamchatka


width:700px

We will do better than this!

Unsupervised learning
Principal component analysis


Let consider xRF×N\mathbf{x} \in \mathbb{R}^{F \times N}, find a decomposition of x\mathbf{x} so that

x=i=1Fuiλivit\mathbf{x} = \sum_{i=1}^{F} \mathbf{u}_i \lambda_i \mathbf{v}_i^t

The ui\mathbf{u}_i are the (orthogonal)
principal components.

Principal component analysis for compression

x~=i=1F<Fuiλivit\tilde{\mathbf{x}} = \sum_{i=1}^{F'< F} \mathbf{u}_i \lambda_i \mathbf{v}_i^t


The image looks good after 64 principal component (instead of 128)

Is your data linearly separable? 😌



width:500px


If not, you can use the kernel PCA!

Another decomposition: Independant component analysis


Same as PCA without the constrain on orthogonality

Independant component analysis for blind-source separation


PCA fails to separate the different sources in that case → ICA does better

Example of clustering – Human-induced seismic sources

Holtzman et al. (2018)

Example of clustering – Human-induced seismic sources


Feature extraction from seismograms

Example of clustering – Human-induced seismic sources

The clusters naturally highlight the amount of injected water

Holtzman et al. (2018)

Deep unsupervised learning

width:800px

Learn how to encode the data from iteself

Deep autoencoders

Task: learn to reconstrut the data (X~\tilde{X}) from itself (XX) with constrains


width:800px

💡 an autoencoder is an deep-net extent of principal component analysis

Autoencoder in action
Seismic data compression

Learn to encode a 512-point seismogram into a 32-point with best reconstuction quality

width:300px

Valentine & Trampert GJI (2012)

Autoencoder in action II
Data quality assessment

Rejecting the seismograms that are hard to decode with low errow.

Valentine & Trampert GJI (2012)

The end !

Any 🧐 ?

Aknowledgements

  • These slides have been made with Marp Next, a framework for slide formatting with Markdown.
  • The content of these slides were inspired from the scikit-learn toolkit, a wonderful open source python library.
  • Also, the structure of the course was largely based on the one of the Deep learning book.
  • Online videos from 3Blue1Brown where extremely inspiring for pedagogic explanations, and some of them used in this course.