Many docking programs have been developed. Although these differ in the algorithms used, every docking program must be able to perform three (not necessarily distinct) basic operations:
So, now we will start Molecular Docking of Chymotrypsin C (CTRC) with Doxorubicin
Lets, see the flowchart for the process we will go through.
We will now build our ligand i.e. Doxorubicin and optimize it for docking.
Check this out for a video based tutorial :
Linear regression attempts to fit a line of best fit to a data set, using one or more features as coefficients for a linear equation. It is an approach for modelling the relationship between dependent variable and independent variables.
In a linear regression model, each target (dependent) variable is estimated to be a weighted sum of the input variables, offset by some constant, known as a bias :
$$ Y = X.W^T + b \tag{1} $$
Where,
$$
Y = \begin{bmatrix}y_1\\y_2\\.\\.\\y_n\end{bmatrix}_{n\times1} X = \begin{bmatrix}x_{11} & x_{12} & . & . & x_{1n} \\x_{21} & x_{22} & . & . & x_{2n} \\. & . & . & . & .\\.& .& . & . & .\\x_{n1} & x_{n2} & . & . & x_{nn}\end{bmatrix}_{n\times n} W^T = \begin{bmatrix}w_1\\ w_2\\.\\.\\w_n\end{bmatrix}_{n\times 1} b = \begin{bmatrix}b_1\\ b_2\\.\\.\\b_n\end{bmatrix}_{n\times 1}
$$
We get the following expansion of the equation 1 : $$ y_1 = w_1x_{11} + w_2x_{12} + . ..+ w_nx_{1n} + b_1 $$
Now, lets take an example for better explanation. Here, we take a look at advertisement data which has the following data :
The above data consists of sales of a particular product along with advertisement budget for the product in TV, radio and newspaper media. Our objective is to increase the sales of the product and we can control the budget of advertisement. So, if we determine the relationship between advertisement budget and sales, we can figure out how to increase the sales of a product by introducing changes in the advertisement budget. Here, the independent variables \((x_i)\) will be the advertisement budget for each of the three media and the dependent variable\((y)\) will be the sales of the product. The relationship between independent and dependent variables in this data can be defined as : $$ y = w_1x_1 + w_2x_2 + w_3x_3 + b \tag{2} $$ where, \( y\) is sales and \(x_1,x_2,x_3\) are advertisement budgets for TV, radio and newspaper respectively. The above equation can be written in matrix form as : $$ y = X.W^T +b $$
where $$ X = \begin{bmatrix}x_1 & x_2 & x_3\end{bmatrix}_{1\times 3} \hspace{1cm} W = \begin{bmatrix}w_1 & w_2 & w_3\end{bmatrix}_{1\times 3} $$
Now that we have discussed the linear regression equation, lets move on towards implementation and discuss the concepts implemented.
I am using Advertisement Data which was also used in ISLR book.
We can remove the inbuilt index in the data.
# removing the inbuilt index column
df.drop('Unnamed: 0', axis = 1, inplace=True)
df.head()
Now lets get some more information on the dataset.
Lets divide the dataset into target and input variables and then split it into test and train data
x = df.drop('sales', axis =1).values
y = df[['sales']].values
# Converting the numpy array features to pytorch tensors.
inputs = torch.from_numpy(x).float()
targets = torch.from_numpy(y).float()
# Split Data into train and test
X_train,X_test,y_train,y_test = train_test_split(inputs,targets,test_size=0.20,random_state=0)
For the data split, I decided on 80–20 split for train and test set.
Our model is simply a function that performs a matrix multiplication of the inputs
and the weights w
(transposed) and adds the bias b
(see equation 2). So we initialize the
Weight matrix and bias
We can define the model as follows:
def model(x):
return x @ w.t() + b
@
represents matrix multiplication in PyTorch, and the .t
method returns the transpose of a tensor. The matrix obtained by passing the input data into the model is a set of predictions for the target variables.(see equation 2)
We need a way to evaluate how well our model is performing. We can compare the model’s predictions with the actual targets, using the following method:
The result is a single number, known as the mean squared error (MSE).
def mse(t1, t2):
diff = t1-t2
return torch.sum(diff*diff)/diff.numel()
torch.sum
returns the sum of all the elements in a tensor, and the .numel
method returns the number of elements in a tensor.
Let’s compute the mean squared error for the current predictions of our model.
Here’s how we can interpret the result: On average, each element in the prediction differs from the actual target by about 276.829088 (square root of the loss 76634.3438). And that’s pretty bad, considering the numbers we are trying to predict are themselves in the range 1-27. Also, the result is called the loss, because it indicates how bad the model is at predicting the target variables. Lower the loss, better the model.
We’ll now minimize the loss function using the gradient descent algorithm. Intuitively, gradient descent takes small, linear steps down the slope of a function in each feature dimension, with the size of each step determined by the partial derivative of the cost function with respect to that feature and a learning rate multiplier \(\eta\). If tuned properly, the algorithm converges on a global minimum by iteratively adjusting feature weights \(\theta\) of the cost function, as shown here for two feature dimensions: $$ \theta_0 := \theta_0 - \eta\frac{\partial}{\partial\theta_0} J(\theta_0,\theta_1) $$ $$ \theta_1 := \theta_1 - \eta\frac{\partial}{\partial\theta_1} J(\theta_0,\theta_1) $$
Given that : $$ h_\theta(x) = \theta_0 + \theta_1x_1 $$ $$ J(\theta) = \frac{1}{2m}\sum\limits_{i = 1}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2 $$
For more about it read here : CS229_Notes
With PyTorch, we can automatically compute the gradient or derivative of the loss w.r.t. to the weights and biases, because they have requires_grad
set to True
.
We’ll reduce the loss and improve our model using the gradient descent optimization algorithm, which has the following steps:
Generate predictions
Calculate the loss
Compute gradients w.r.t the weights and biases
Adjust the weights by subtracting a small quantity proportional to the gradient
Reset the gradients to zero
Let’s implement the above step by step.
Let’s take a look at the loss after 1 epoch.
We have already achieved a significant reduction in the loss, simply by adjusting the weights and biases slightly using gradient descent.
To reduce the loss further, we can repeat the process of adjusting the weights and biases using the gradients multiple times. Each iteration is called an epoch. Let’s train the model for 1000 epochs.
Now lets see the final loss :
As you can see, the loss is now much lower than what we started out with.
Let’s look at the model’s predictions and compare them with the targets.
y_preds = model(X_test)
Lets plot actual vs predicted :
Now this whole process can be done using pytorch builtins. Check it out :
SIFT (Scale Invariant Feature Transform) is a feature detection algorithm in computer vision to detect and describe local features in images. It was created by David Lowe from the University British Columbia in 1999. David Lowe presents the SIFT algorithm in his original paper titled Distinctive Image Features from Scale-Invariant Keypoints.
Image features extracted by SIFT are reasonably invariant to various changes such as their llumination image noise, rotation, scaling, and small changes in viewpoint.
There are four main stages involved in SIFT algorithm :
We will now examine these stages in detail.
Before going into this, we will visit the idea of scale space theory and then, see how it has been used in SIFT.
Scale-space theory is a framework for multiscale image representation, which has been developed by the computer vision community with complementary motivations from physics and biologic vision. The idea is to handle the multiscale nature of real-world objects, which implies that objects may be perceived in different ways depending on the scale of observation.
The first stage is to identify locations and scales that can be repeatably assigned under differing views of the same object. Detecting locations that are invariant to scale change of the image can be accomplished by searching for stable features across all possible scales, using a continuous function of scale known as scale space.
The scale space is defined by the function:
$$ L(x, y, \sigma) = G(x, y, \sigma)* I(x, y) $$
Where:
So, we first take the original image and blur it using a Gaussian convolution. What follows is a sequence of further convolutions with increasing standard deviation(σ). Images of same size (with different blur levels) are called an Octave. Then, we downsize the original image by a factor of 2. This starts another row of convolutions. We repeat this process until the pictures are too small to proceed.
Now we have constructed a scale space. We do this to handle the multiscale nature of real-world objects.
Since we are finding the most stable image features we consider Lapcian of Gaussian. In detailed experimental comparisons, Mikolajczyk (2002) found that maxima and minima of Laplacian of Gaussian produce the most stable image features compared to a range of other possible image functions, such as the gradient, Hessian, or Harris corner function.
The problem that occurs here is that calculating all those second order derivatives is computationally intensive so we use Difference of Gaussians which is an approximation of LoG. Difference of Gaussian is obtained as the difference of Gaussian blurring of an image with two different σ and is given by:
$$ D(x,y,\sigma) = (G(x,y,k\sigma) - G(x,y,\sigma)) * I(x,y) $$ $$ D(x,y,\sigma) = L(x,y,k\sigma) - L(x,y,\sigma) $$ It is represented in below image:
This is done for all octaves. The resulting images are an approximation of scale invariant laplacian of gaussian (which produces stable image keypoints).
Now that we have found potential keypoints, we have to refine it further for more accurate results.
The first step is to locate the maxima and minima of Difference of Gaussian(DoG) images. Each pixel in the DoG images is compared to its 8 neighbours at the same scale, plus the 9 corresponding neighbours at neighbouring scales. If the pixel is a local maximum or minimum, it is selected as a candidate keypoint.
Once a keypoint candidate has been found by comparing a pixel to its neighbors, the next step is to refine the location of these feature points to sub-pixel accuracy whilst simultaneously removing any poor features.
The sub-pixel localization proceeds by fitting a 3D quadratic function to the local sample points to determine the interpolated location of the maximum. This approach uses the Taylor expansion (up to the quadratic terms) of the scale-space function, D(x, y, σ), shifted so that the origin is at the sample point:
$$ D(x) = D + \frac{\partial D^T}{\partial x} x + \frac{1}{2}x^T \frac{\partial^2 D}{\partial x^2}x $$
The location of the extremum, \(x̂\) , is determined by taking the derivative of this function with respect to x and setting it to zero, giving:
$$ \hat{x} = - \frac{\partial^2 D^{-1}}{\partial x^2}\frac{\partial D}{\partial x} $$
On solving, we’ll get subpixel key point locations. Now we need to remove keypoints which have low contrast or lie along the edge as they are not useful to us.
The function value at the extremum, D(x̂), is useful for rejecting unstable extrema with low contrast. This can be obtained by substituting extremum x̂ into the Taylor Expansion (upto quadratic terms) as given above, giving: $$ D(\hat{x} ) = D + \frac{1}{2} \frac{\partial D^T}{\partial x}\hat{x} $$
This is achieved by using a 2x2 Hessian matrix (H) to compute the principal curvature. A poorly defined peak in the difference-of-Gaussian function will have a large principal curvature across the edge but a small one in the perpendicular direction.
So from the calculation from hessian matrix we reject the flats and edges and keep the corner keypoints.
To determine the keypoint orientation, a gradient orientation histogram is computed in the neighbourhood of the keypoint.
The magnitude and orientation is calculated for all pixels around the keypoint. Then, a histogram with 36 bins covering 360 degrees is created.
$$ m(x,y) = \sqrt{(L(x+1,y) - L(x-1,y))^2 + (L(x,y+1) - L(x,y-1))^2 } $$
$$ \theta(x,y) = \tan^{-1} {\frac{L(x,y+1) - L(x,y-1)}{L(x+1,y) - L(x-1,y)}} $$
\(m(x,y)\) is magnitude and \(\theta(x,y)\) is the orientation of the pixel at x,y location. Each sample added to the histogram is weighted by its gradient magnitude and by a Gaussian-weighted circular window with a σ that is 1.5 times that of the scale of the keypoint.
Each sample added to the histogram is weighted by its gradient magnitude and by a Gaussian-weighted circular window with a σ that is 1.5 times that of the scale of the keypoint.
When it is done for all the pixels around the keypoint, the histogram will have a peak at some point. And any peaks above 80% of the highest peak are converted into a new keypoint. This new keypoint has the same location and scale as the original. But it’s orientation is equal to the other peak.
Once a keypoint orientation has been selected, the feature descriptor is computed as a set of orientation histograms.
To do this, a 16x16 window around the keypoint is taken. It is divided into 16 sub-blocks of 4x4 size.
Within each 4x4 window, gradient magnitudes and orientations are calculated. These orientations are put into an 8 bin histogram.
Histograms contain 8 bins each, and each descriptor contains an array of 4 histograms around the keypoint. This leads to a SIFT feature vector with 4 × 4 × 8 = 128 elements.
This feature vector introduces a few complications.
In this section we will be performing object detection using SIFT with the help of opencv library in python.
Now before starting object detection let’s first see the keypoint detection.
import cv2
import numpy as np
import matplotlib.pyplot as plt
train_img = cv2.imread('train.jpg') # train image
query_img = cv2.imread('query.jpg') # query/test image
# Turn Images to grayscale
def to_gray(color_img):
gray = cv2.cvtColor(color_img, cv2.COLOR_BGR2GRAY)
return gray
train_img_gray = to_gray(train_img)
query_img_gray = to_gray(query_img)
# Initialise SIFT detector
sift = cv2.xfeatures2d.SIFT_create()
# Generate SIFT keypoints and descriptors
train_kp, train_desc = sift.detectAndCompute(train_img_gray, None)
query_kp, query_desc = sift.detectAndCompute(query_img_gray, None)
plt.figure(1)
plt.imshow((cv2.drawKeypoints(train_img_gray, train_kp, train_img.copy())))
plt.title('Train Image Keypoints')
plt.figure(2)
plt.imshow((cv2.drawKeypoints(query_img_gray, query_kp, query_img.copy())))
plt.title('Query Image Keypoints')
plt.show()
Here I took pictures of Taj Mahal from different viewpoints for train and query image.
As you can see from the above code that the following function :
# Initialise SIFT detector
sift = cv2.xfeatures2d.SIFT_create()
# Generate SIFT keypoints and descriptors
train_kp, train_desc = sift.detectAndCompute(train_img_gray, None)
query_kp, query_desc = sift.detectAndCompute(query_img_gray, None)
is the one that does the computations for the SIFT algorithm and returns keypoints and descriptors of the image.
We can use the keypoints and descriptors for feature matching between two objects and finally find object in the query image.
Now we move onto feature matching part. We will match features in one image with others.
For feature matching we are using Brute-Force matcher provided by OpenCV. You can also use FLANN Matcher in OpenCV as I will use in further section of the tutorial.
Brute-Force matcher takes the descriptor of one feature in first set and is matched with all other features in second set using some distance calculation. And the closest one is returned.
# create a BFMatcher object which will match up the SIFT features
bf = cv2.BFMatcher(cv2.NORM_L2, crossCheck=True)
matches = bf.match(train_desc, query_desc)
# Sort the matches in the order of their distance.
matches = sorted(matches, key = lambda x:x.distance)
# draw the top N matches
N_MATCHES = 100
match_img = cv2.drawMatches(
train_img, train_kp,
query_img, query_kp,
matches[:N_MATCHES], query_img.copy(), flags=0)
plt.figure(3)
plt.imshow(match_img)
plt.show()
The visualization of how the SIFT features match up each other across the two images is as follow:
So till now we have found the keypoints and descriptors for the train and query images and then matched top keypoints and visualized it. But this is still not sufficient to find the object.
For that, we can use a function cv2.findHomography(). If we pass the set of points from both the images, it will find the perpective transformation of that object. Then we can use cv2.perspectiveTransform() to find the object. It needs atleast four correct points to find the transformation.
So now we will use a train image and then try to detect it in the real-time.
import cv2
import numpy as np
import matplotlib.pyplot as plt
# Threshold
MIN_MATCH_COUNT=30
# Initiate SIFT detector
sift=cv2.xfeatures2d.SIFT_create()
# Create the Flann Matcher object
FLANN_INDEX_KDITREE=0
flannParam=dict(algorithm=FLANN_INDEX_KDITREE,tree=5)
flann=cv2.FlannBasedMatcher(flannParam,{})
train_img = cv2.imread("obama1.jpg",0) # train image
# find the keypoints and descriptors with SIFT
kp1,desc1 = sift.detectAndCompute(train_img,None)
# draw keypoints of the train image
train_img_kp= cv2.drawKeypoints(train_img,kp1,None,(255,0,0),4)
# show the train image keypoints
plt.imshow(train_img_kp)
plt.show()
# start capturing video
cap = cv2.VideoCapture(0)
while True:
ret, frame = cap.read()
# turn the frame captured into grayscale.
gray = cv2.cvtColor(frame,cv2.COLOR_BGR2GRAY)
# find the keypoints and descriptors with SIFT of the frame captured.
kp2, desc2 = sift.detectAndCompute(gray,None)
# Obtain matches using K-Nearest Neighbor Method.
#'matches' is the number of similar matches found in both images.
matches=flann.knnMatch(desc2,desc1,k=2)
# store all the good matches as per Lowe's ratio test.
goodMatch=[]
for m,n in matches:
if(m.distance<0.75*n.distance):
goodMatch.append(m)
# If enough matches are found, we extract the locations of matched keypoints in both the images.
# They are passed to find the perpective transformation.
# Then we are able to locate our object.
if(len(goodMatch)>MIN_MATCH_COUNT):
tp=[] # src_pts
qp=[] # dst_pts
for m in goodMatch:
tp.append(kp1[m.trainIdx].pt)
qp.append(kp2[m.queryIdx].pt)
tp,qp=np.float32((tp,qp))
H,status=cv2.findHomography(tp,qp,cv2.RANSAC,3.0)
h,w = train_img.shape
train_outline= np.float32([[[0,0],[0,h-1],[w-1,h-1],[w-1,0]]])
query_outline = cv2.perspectiveTransform(train_outline,H)
cv2.polylines(frame,[np.int32(query_outline)],True,(0,255,0),5)
cv2.putText(frame,'Object Found',(50,50), cv2.FONT_HERSHEY_COMPLEX, 2 ,(0,255,0), 2)
print("Match Found-")
print(len(goodMatch),MIN_MATCH_COUNT)
else:
print("Not Enough match found-")
print(len(goodMatch),MIN_MATCH_COUNT)
cv2.imshow('result',frame)
if cv2.waitKey(1) == 13:
break
cap.release()
cv2.destroyAllWindows()
Result :
To see the full code for this post check out this repository
Navneet Dalal and Bill Triggs introduced Histogram of Oriented Gradients(HOG) features in 2005. Histogram of Oriented Gradients (HOG) is a feature descriptor used in image processing, mainly for object detection. A feature descriptor is a representation of an image or an image patch that simplifies the image by extracting useful information from it.
The principle behind the histogram of oriented gradients descriptor is that local object appearance and shape within an image can be described by the distribution of intensity gradients or edge directions. The x and y derivatives of an image (Gradients) are useful because the magnitude of gradients is large around edges and corners due to abrupt change in intensity and we know that edges and corners pack in a lot more information about object shape than flat regions. So, the histograms of directions of gradients are used as features in this descriptor.
Now that we know basic principle of Histogram of Oriented Gradients we will be moving into how we calculate the histograms and how these feature vectors, that are obtained from the HOG descriptor, are used by the classifier such a SVM to detect the concerned object.
Preprocessing of image involves normalising the image but it is entirely optional. It is used to improve performance of the HOG descriptor. Since, here we are building a simple descriptor we don’t use any normalisation in preprocessing.
The first actual step in the HOG descriptor is to compute the image gradient in both the x and y direction.
Let us take an example. Say the pixel Q has values surrounding it as shown below:
We can calculate the Gradient magnitude for Q in x and y direction as follow: $$ G_x = 100 -50 =50 $$
$$ G_y = 120 -70 =50 $$
We can get the magnitude of the gradient as:
$$ G= \sqrt{(G_x)^2 + (G_y)^2} = 70.7 $$ And the direction of the gradient as :
$$ \theta = arctan({\frac {G_y} {G_x}}) = 45^\circ $$
The image is divided into 8×8 cell blocks and a histogram of gradients is calculated for each 8×8 cell block.
The histogram is essentially a vector of 9 buckets ( numbers ) corresponding to angles from \(0^ \circ\) to \(180^ \circ\) (\(20^ \circ\) increments.)
The values of these 64 cells (8X8) are binned and cumulatively added into these 9 buckets.
This essentially reduces 64 values into 9 values.
A great illustration of this is shown on learnopencv. The following figure shows how it is done. The blue pixel encircled has an angle of \(80^ \circ\) and magnitude of 2. So it adds 2 to the 5th bin. The gradient at the pixel encircled using red has an angle of \(10^ \circ\) and magnitude of 4. Since \(0^ \circ\) is half way between \(0^ \circ\) and \(20^ \circ\), the vote by the pixel splits evenly into the two bins.
After the creation of histogram of oriented gradients we need to something else too. Gradient is sensitive to overall lighting. If we say divide/multiply pixel values by some constant in order to make it lighter/ darker the gradient magnitude will change and so will histogram values. We want that histogram values be independent of lighting. Normalisation is done on the histogram vector v within a block. One of the following norms could be used:
Now, we could simply normalise the 9×1 histogram vector but it is better to normalise a bigger sized block of 16×16. A 16×16 block has 4 histograms (8×8 cell results to one histogram) which can be concatenated to form a 36 x 1 element vector and normalised. The 16×16 window then moves by 8 pixels and a normalised 36×1 vector is calculated over this window and the process is repeated for the image.
This vector is now used to train classifiers such as SVM and then do object detection.
Here is a snippet to visualise HOG features of an Image provided in Scikit-Image’s docs to visualize HOG features.
import matplotlib.pyplot as plt
from skimage.feature import hog
from skimage import data, exposure
image = data.astronaut()
fd, hog_image = hog(image, orientations=8, pixels_per_cell=(16, 16),
cells_per_block=(1, 1), visualize=True, multichannel=True)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6), sharex=True, sharey=True)
ax1.axis('off')
ax1.imshow(image, cmap=plt.cm.gray)
ax1.set_title('Input image')
# Rescale histogram for better display
hog_image_rescaled = exposure.rescale_intensity(hog_image, in_range=(0, 10))
ax2.axis('off')
ax2.imshow(hog_image_rescaled, cmap=plt.cm.gray)
ax2.set_title('Histogram of Oriented Gradients')
plt.show()
Output as follow :
In this tutorial we will be performing a simple Face Detection using HOG features.
We need to first train the classifier in order to do face detection so first we will need to have training set for the classifier.
Labelled Faces in the Wild dataset provided by Scikit-Learn consists of variety of faces which is perfect for our positive set.
from sklearn.datasets import fetch_lfw_people
faces = fetch_lfw_people()
positive_patches = faces.images
For Negative set we need images without face on them. Scikit-Image offers images which can be used in this case. To increase the size of negative set we extract patches of image at different scale using Patch Extractor from Scikit-Learn.
from skimage import data, transform
from sklearn.feature_extraction.image import PatchExtractor
imgs_to_use = ['camera', 'text', 'coins', 'moon',
'page', 'clock', 'immunohistochemistry',
'chelsea', 'coffee', 'hubble_deep_field']
images = [color.rgb2gray(getattr(data, name)())
for name in imgs_to_use]
def extract_patches(img, N, scale=1.0, patch_size=positive_patches[0].shape):
extracted_patch_size = tuple((scale * np.array(patch_size)).astype(int))
extractor = PatchExtractor(patch_size=extracted_patch_size,
max_patches=N, random_state=0)
patches = extractor.transform(img[np.newaxis])
if scale != 1:
patches = np.array([transform.resize(patch, patch_size)
for patch in patches])
return patches
negative_patches = np.vstack([extract_patches(im, 1000, scale)
for im in images for scale in [0.5, 1.0, 2.0]])
Scikit-Image’s feature module offers a function skimage.feature.hog which extracts Histogram of Oriented Gradients (HOG) features for a given image. we combine the positive and negative set and compute the HOG features
from skimage import feature
from itertools import chain
X_train = np.array([feature.hog(im)
for im in chain(positive_patches,
negative_patches)])
y_train = np.zeros(X_train.shape[0])
y_train[:positive_patches.shape[0]] = 1
We will use Scikit-Learn’s LinearSVC with a grid search over a few choices of the C parameter:
from sklearn.svm import LinearSVC
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(LinearSVC(dual=False), {'C': [1.0, 2.0, 4.0, 8.0]},cv=3)
grid.fit(X_train, y_train)
grid.best_score_
We will take the best estimator and then build a model.
model = grid.best_estimator_
model.fit(X_train, y_train)
Now that we have built the Model we can test it on a new image to see how it detects the faces.
from skimage import io
img = io.imread('testpic.jpg',as_gray=True)
img = skimage.transform.rescale(img, 0.5)
indices, patches = zip(*sliding_window(img))
patches_hog = np.array([feature.hog(patch) for patch in patches])
labels = model.predict(patches_hog)
We are detecting the face by using a sliding window which goes over the image patches. Then we find the HOG feature of these patches. Finally, we run it through the classification model that we build and predict the face in the image. The image below is one of the test images. We can see that the classifier detected patches and most of them overlap the face in the image.
To see the full code for this post check out this repository
]]>A neural network is a mathematical model or computational model inspired by biological neural networks. It consists of an interconnected group of artiﬁcial neurons. The sructure and functioning of the central nervous system constituing neurons, axons, dentrites and syanpses which make up the processing parts of the biological neural networks were the original inspiration that led to the developement of computational models of neural networks.
The ﬁrst computational model of a neuron was presented in 1943 by W. Mc Culloch and W. Pitts. They called this model threshold logic.The model paved the way for neural network research to split into two distinct approaches. One approach focused on biological processes in the brain and the other focused on the application of neural networks to artificial intelligence.
In 1958, Rossenblatt conceived the Perceptron, an algorithm for pattern recognition based on a two-layer learning computer network using simple addition and subtractio. His work had big repercussion but in 1969 a violent critic by Minsky and Papert was published.
The work on neural network was slow down but John Hopﬁeld convinced of the power of neural network came out with his model in 1982 and boost research in this ﬁeld. Hopﬁeld Network is a particular case of Neural Network. It is based on physics, inspired by spin system.
Hopfield Network is a recurrent neural network with bipolar threshold neurons. Hopﬁeld network consists of a set of interconnected neurons which update their activation values asynchronously. The activation values are binary, usually {-1,1}. The update of a unit depends on the other units of the network and on itself.
A neuron in Hopfield model is binary and defined by the standard McCulloch-Pitts model of a neuron:
\[n_i (t+1)= \theta(\sum_{j}w_{ij} n_j (t) - \mu_i) \tag{1}\]
where n_{i}(t+1) is the i^{th} neuron at time t+1, n_{j}(t) is the j^{th} neuron at time t, w_{ij} is the weight matrix called synaptic weights , θ is the step function and μ is the bias. In the Hopfield model the neurons have a binary output taking values -1 and 1. Thus the model has following form:
\[S_i(t+1) = sgn(\sum_{j}w_{ij} S_j(t) - \vartheta_i) \tag{2} \]
where S_{i} and n_{i} are related through the formula: S_{i} = 2n_{i} - 1 (Since n_{i} ϵ [0,1] and S_{i} ϵ [-1,1] ). ϑ_{i} is the threshold, so if the input is above the threshold it will fire 1. So here S represents the neurons which were represented as n in equation 1. The sgn sign here is the signum function which is described as follow:
$$
sgn(x) =
\begin{cases}
-1 & \text{if $x$ < 0,}\\\
0 & \text{if $x$ = 0,}\\\
1 & \text{if $x$ > 0}
\end{cases}
$$
For ease of analysis in this post we will drop the threshold (ϑ_{i }= 0) as we will analyse mainly random patterns and thresholds are not useful in this context. In this case the model is written as:
$$ S_i(t+1) = sgn(\sum_{j}w_{ij} S_j(t) ) \tag{3} $$
In this post we are looking at Auto-associative model of Hopfield Network. It can store useful information in memory and later it is able to reproduce this information from partially broken patterns.
For training procedure, it doesn’t require any iterations. It includes just an outer product between input vector and transposed input vector to fill the weighted matrix w_{ij} (or synaptic weights) and in case of many patterns it is as follow:
$$ w_{i,j} = \frac{1}{N} \sum_{\mu=1}^p \epsilon_i^\mu\epsilon_j^\mu \tag{4} $$
where, ε is the pattern and p = total number of patterns.
The main advantage of Auto-associative network is that it is able to recover pattern from the memory using just a partial information about the pattern. There are two main approaches to this situation.
We update the neurons as specified in equation 3:
Now that we have covered the basics lets start implementing Hopfield network.
import matplotlib.pyplot as plt
import numpy as np
nb_patterns = 4 # Number of patterns to learn
pattern_width = 5
pattern_height = 5
max_iterations = 10
# Define Patterns
patterns = np.array([
[1,-1,-1,-1,1,1,-1,1,1,-1,1,-1,1,1,-1,1,-1,1,1,-1,1,-1,-1,-1,1.], # Letter D
[-1,-1,-1,-1,-1,1,1,1,-1,1,1,1,1,-1,1,-1,1,1,-1,1,-1,-1,-1,1,1.], # Letter J
[1,-1,-1,-1,-1,-1,1,1,1,1,-1,1,1,1,1,-1,1,1,1,1,1,-1,-1,-1,-1.], # Letter C
[-1,1,1,1,-1,-1,-1,1,-1,-1,-1,1,-1,1,-1,-1,1,1,1,-1,-1,1,1,1,-1.],], # Letter M
dtype=np.float)
So we import the necessary libraries and define the patterns we want the network to learn. Here, we are defining 4 patterns. We can visualise them with the help of the code below:
# Show the patterns
fig, ax = plt.subplots(1, nb_patterns, figsize=(15, 10))
for i in range(nb_patterns):
ax[i].matshow(patterns[i].reshape((pattern_height, pattern_width)), cmap='gray')
ax[i].set_xticks([])
ax[i].set_yticks([])
Which gives the out put as :
We now train the network by filling the weight matrix as defined in the equation 4
# Train the network
W = np.zeros((pattern_width * pattern_height, pattern_width * pattern_height))
for i in range(pattern_width * pattern_height):
for j in range(pattern_width * pattern_height):
if i == j or W[i, j] != 0.0:
continue
w = 0.0
for n in range(nb_patterns):
w += patterns[n, i] * patterns[n, j]
W[i, j] = w / patterns.shape[0]
W[j, i] = W[i, j]
Now that we have trained the network. We will create a corrupted pattern to test on this network.
# Test the Network
# Create a corrupted pattern S
S = np.array( [1,-1,-1,-1,-1,1,1,1,1,1,-1,1,1,1,1,-1,1,1,1,1,1,1,-1,-1,-1.],
dtype=np.float)
# Show the corrupted pattern
fig, ax = plt.subplots()
ax.matshow(S.reshape((pattern_height, pattern_width)), cmap='gray')
The corrupted pattern we take here simply edditing some bits in the pattern array of letter C. We can see the corrupted pattern as follow:
We pass the corrupted pattern through the network and it is updated as defined in the equation 3. Thus, after each iteration some update is applied to the corrupted matrix. We take the hamming distance of the corrupted pattern which is being updated every time and all the patterns. And then we decide the closest pattern in terms of least hamming distance.
h = np.zeros((pattern_width * pattern_height))
#Defining Hamming Distance matrix for seeing convergence
hamming_distance = np.zeros((max_iterations,nb_patterns))
for iteration in range(max_iterations):
for i in range(pattern_width * pattern_height):
i = np.random.randint(pattern_width * pattern_height)
h[i] = 0
for j in range(pattern_width * pattern_height):
h[i] += W[i, j]*S[j]
S = np.where(h<0, -1, 1)
for i in range(nb_patterns):
hamming_distance[iteration, i] = ((patterns - S)[i]!=0).sum()
fig, ax = plt.subplots()
ax.matshow(S.reshape((pattern_height, pattern_width)), cmap='gray')
hamming_distance
Here we see that the hamming distance between corrupted pattern and third pattern i.e. letter C has become 0 after few iterations thus correcting the corrupted pattern.
We can also see the plot for all the hamming distances below:
]]>The formal definition for object detection is as follows:
A Computer Vision technique to locate the presence of objects on images or videos. Object Detection comprises of two things i.e. Image Classification and Object Localization.
Image Classification answers the question " What is in the picture/frame?”. It takes an image and predicts the object in an image. For example, in the pictures below we can build a classifier that can detect a person in the picture and a bicycle.
But if both of them are in the same image then it becomes a problem. We could train a multilabel classifier but we still don’t know the positions of bicycle or person. The task of locating the object in the image is called Object localisation.
Object detection is a widely used technique in production systems. There are variants of object detection problem such as:
An image has multiple objects but every application has a focus on a particular thing such as a face detection application is focused on finding a face, a traffic control system is focused on vechiles, an driving technology is focused on differentiating between vehicles and living beings. In the same line, Object detection technique helps to identify the image segment that the application needs to focus on.
It can be used to reduce the dimension of the image to only capture the object of interest and hence, improving the execution time greatly.
Generally, Object detection is achieved by using either machine-learning based approaches or Deep learning based approaches.
In this approach, we define the features and then train the classifier (such as SVM) on the feature-set. Following are the machine learning based object detection techniques:
SIFT was created by David Lowe from the University British Columbia in 1999.The SIFT approach, for image feature generation, takes an image and transforms it into a large collection of local feature vectors. Each of these feature vectors is invariant to any scaling, rotation or translation of the image. There are four steps involved in the SIFT algorithm:
These resulting vectors are known as SIFT keys and are used in a nearest-neighbour approach to identify possible objects in an image.
Deep Learning techniques are able to do end-to-end object detection without specifically defining features, and are typically based on convolutional neural networks (CNN). A Convolutional Neural Network (CNN, or ConvNet) is a special kind of multi-layer neural networks, designed to recognize visual patterns directly from pixel images.
In 2012, AlexNet significantly outperformed all prior competitors at ImageNet Large Scale Visual Recognition Challenge(ILSVRC) and won the challenge. Convolutional Neural Networks became the gold standard for image classification after Kriszhevsky’s CNN’s performance during ImageNet.
CNNs were too slow and computationally very expensive. R-CNN solves this problem by using an object proposal algorithm called Selective Search which reduces the number of bounding boxes that are fed to the classifier to close to 2000 region proposals.
In R-CNN, the selective search method developed by J.R.R. Uijlings and al. (2012) is an alternative to exhaustive search in an image to capture object location. It looks at the image through windows of different sizes, and for each size tries to group together adjacent pixels by texture, color, or intensity to identify objects.
The main idea is composed of two steps. First, using selective search, it identifies a manageable number of bounding-box object region candidates (region of interest). And then it extracts CNN features from each region independently for classification.
R-CNN was improved over the time for better performance. Fast Region-based Convolutional Network (Fast R-CNN) developed by R. Girshick (2015) reduced the time consumption related to the high number of models necessary to analyse all region proposals in R-CNN.
The YOLO model (J. Redmon et al., 2016) directly predicts bounding boxes and class probabilities with a single network in a single evaluation. They reframe the object detection as a single regression problem, straight from image pixels to bounding box coordinates and class probabilities.
YOLO divides each image into a grid of S x S and each grid predicts N bounding boxes and confidence. The confidence score tells us how certain it is that the predicted bounding box actually encloses some object.
Over time, it has become faster and better, with its versions named as: YOLO V1, YOLO V2 and YOLO V3. YOLO V2 is better than V1 in terms of accuracy and speed. YOLO V3 is more accurate than V2.
SSD model was published (by Wei Liu et al.) in 2015, shortly after the YOLO model, and was also later refined in a subsequent paper.
Unlike YOLO, SSD does not split the image into grids of arbitrary size but predicts offset of predefined anchor boxes for every location of the feature map. Each box has a fixed size and position relative to its corresponding cell. All the anchor boxes tile the whole feature map in a convolutional manner.
Feature maps at different levels have different receptive field sizes. The anchor boxes on different levels are rescaled so that one feature map is only responsible for objects at one particular scale.
]]>