ETL Process for Group 11

Ashraff Hadi
3 min readJan 22, 2021

Extract, convert, load (ETL) is, according to wikipedia, the general process of copying data from one or more sources into a destination scheme that represents the data differently from the source(s) or in a way other than the source(s) (s). In the 1970s, the ETL method became a common notion and is mostly used in data warehousing.
Have you ever wondered why data warehousing is important? Believe it or not, whether ETL is introduced and well built, it will provide the company with benefits.
So, in these posts, step by step, I will lead you to the ETL pipeline. We’re going to use the python code.

First step : Extract

1. send http request to URL and then parse the html content.

!pip install beautifulsoup4 
!pip install requests
from bs4 import BeautifulSoup
import requests
url = “https://oto.my/search/"
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, ‘html.parser’)

Below are the actions we’ve assigned for the particular attributes to each of the group member starting of with:

  1. Title attribute
title_tag = []for div in soup.findAll("li", {"class":"price"}):
titles = div.findAll("b")
for title in titles:
title_tag.append(float(title.text.strip().replace('US$ ',''))) #removing whitespace and replace the currency sign with nothing.
for title in title_tag:
print(title)

2. Description Attribute

title_tag = []for div in soup.findAll("div", {"class":"info-title"}):titles = div.findAll("a")for title in titles:title_tag.append(title.text.strip())for title in title_tag:print(title)

3. Volume Attribute

volume_tag = []for div in soup.findAll(“div”, {“class”:”info-right-row”}):
volumes = div.findAll(“span”)
for volume in volumes:
volume_tag.append(volume.text.replace(‘L’,’’))
vol = volume_tag[3::10]
for v in vol:
print(v)

4. Transmission Attribute

transmission_tag = []for div in soup.findAll("div", {"class":"info-right-row"}):
trans = div.findAll("span")
for tran in trans:
transmission_tag.append(tran.text)
tra = transmission_tag[5::10]
for t in tra:
print(t)

5. Car Type Attribute

car_tag = []for div in soup.findAll("div", {"class":"info-right-row"}):
cars = div.findAll("span")
for car in cars:
car_tag.append(car.text)
car = car_tag[7::10]
for c in car:
print(c)

6. Location Attribute

location_tag = []for div in soup.findAll("div", {"class":"info-right-row"}):
locations = div.findAll("span")
for location in locations:
location_tag.append(location.text)
location = location_tag[9::10]
for l in location:
print(l)

7. Price Attribute

price_tag = []
p_tag = []
for div in soup.findAll("div", {"class":"info-price"}):
#harga = div.findAll("")
#for price in harga:
price_tag.append(div.text.strip())
for p in price_tag[:10]:
print(p)
p_tag.append(p)

8. Dealer Attribute

d_tag = []for div in soup.findAll("div", {"class":"seller-container"}):
deals = div.findAll("small")
for deal in deals:
d_tag.append(deal.text)
for deal in d_tag:
print(deal)

Now that we have extracted all the data is required for the next step,

Second step : Transformation

  1. Use latest Matplotlib
%matplotlib inline  
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.style.use('ggplot')
# optional: for ggplot-like style
# check for latest version of Matplotlib
print('Matplotlib version: ', mpl.version) # >= 2.0.0

In this project, there’s no numeric data from the original or in other words before the cleaning. Therefore, there’s not much we can show. We can only show or visualize how many car are there based on volume of engine and number of car sales.

Set the frequencies and and mean for the data.

# np.histogram returns 2 valuescount, bin_edges = np.histogram(df['Volume'])print(count) # frequency countprint(bin_edges) # bin ranges, default = 10 bins

After that, plot box plot to show the data between the volume of engine and number of car sale.

df['Volume'].plot(kind='hist', figsize=(8, 5), xticks=bin_edges)plt.title('Histogram of the Volume of Cars for Sale') # add a title to the histogramplt.ylabel('Number of Cars') # add y-labelplt.xlabel('Volume of Engine') # add x-labelplt.show()

Final step : Load

At the final countdown, the data that has been scraped and load into a csv file.This represent the final documentation of the ETL process.

df.to_csv("car_sale_updated.csv", index=False, encoding="utf-8")

I hope you enjoy reading the articles and can get something out of it.

You can find the full codes here :

https://colab.research.google.com/drive/1I0f-XTn9lYe83lvHKAtPWwCuc8XC_yAE?usp=sharing#scrollTo=JnOKrytzNXGI

--

--

Ashraff Hadi
0 Followers

This is a compilation of my group work for our data engineering subject, hope it helps those in need!