Importing CSV Data into Elasticsearch
In this blog, I am going to explain how you can import publicly available CSV data into Elasticsearch. Elastic Stack enables us to easily analyze any data and can help us to create dashboards with key performance indicators. CSV data for different domains like healthcare, crime, agriculture, etc are available on different government sites which we can easily download. After downloading that CSV file we can push it in Elasticsearch to perform search and analysis on top of that data. I have seen many times people don't know how we can import these CSV data into Elasticsearch and that is why in this blog I have explained this process in a step by step way.
After data import, you can use this data for data analysis or for creating different dashboards. Here I am taking the example of 'Crimes - 2001 to present' from the data.gov website (https://catalog.data.gov/dataset?res_format=CSV). From this website, you can download different types of data in CSV format. The size of this CSV file is approximately 1.6 GB.
Now, let us start the process to import this data into Elasticsearch. You need to do the following:
- Download the CSV file (crimes_2001.csv) from "https://catalog.data.gov/dataset?res_format=CSV" website. This file has the following fields:
ID | Case Number | Date | Block | IUCR | Primary Type | Description | Location | Description | Arrest | Domestic | Beat | District | Ward | Community Area | FBI Code | X Coordinate | Y Coordinate | Year | Updated On | Latitude | Longitude | Location
- Create a Logstash configuration file for reading the CSV data and writing it to Elasticsearch. You need to write following expression in Logstash configuration file (crimes.conf):
input {
file {
path => "/home/user/Downloads/crimes_2001.csv"
start_position => beginning
}
}
filter {
csv {
columns => [
"ID",
"Case Number",
"Date",
"Block",
"IUCR",
"Primary Type",
"Description",
"Location Description",
"Arrest",
"Domestic",
"Beat",
"District",
"Ward",
"Community Area",
"FBI Code",
"X Coordinate",
"Y Coordinate",
"Year",
"Updated On",
"Latitude",
"Longitude",
"Location"
]
separator => ","
}
}
output {
stdout
{
codec => rubydebug
}
elasticsearch {
action => "index"
hosts => ["127.0.0.1:9200"]
index => "crimes"
}
}
- After creating the Logstash configuration file execute the configuration with the following command:
/usr/share/logstash/bin/logstash -f /etc/logstash/conf.d/crimes.conf
This command will create the pipeline to read the CSV data and to write it into Elasticsearch.
If you want to know the basics of Logstash then please refer to the "Introduction to Logstash" blog where I have explained the basics of Logstash.
- You can verify the index "crimes" creation in Elasticsearch by listing indices in the browser:
http://localhost:9200/_cat/indices?v
- If your index "crimes" is listed you can see the data in Elasticsearch:
http://localhost:9200/crimes/_search?pretty
- After opening the above URL you can see the data in the following format:
{
"_index" : "crimes",
"_type" : "_doc",
"_id" : "BTTXFWoB75utKkMR2zRC",
"_score" : 1.0,
"_source" : {
"Case Number" : "HY190059",
"Block" : "066XX S MARSHFIELD AVE",
"FBI Code" : "26",
"IUCR" : "4625",
"X Coordinate" : "1166468",
"Ward" : "15",
"Y Coordinate" : "1860715",
"Beat" : "0725",
"Location Description" : "STREET",
"Domestic" : "false",
"Community Area" : "67",
"Updated On" : "02/10/2018 03:50:01 PM",
"Primary Type" : "OTHER OFFENSE",
"Year of Crime" : "2015",
"host" : "KELLGGNLPTP0305",
"Date" : "03/18/2015 11:00:00 PM",
"path" : "/home/user/Downloads/crimes.csv",
"Arrest" : "true",
"Longitude" : "-87.665319468",
"id" : "10000094",
"Description" : "PAROLE VIOLATION",
"@timestamp" : "2019-04-13T08:37:05.351Z",
"District" : "007",
"Latitude" : "41.773371528",
"@version" : "1"
}
}
Above Elasticsearch document from the 'crimes' index is representing a single record from the CSV file.
In this way you can push any CSV data into Elasticsearch and then can perform search, analytics or create dashboards using that data. If you have any query please comment.
Other Blogs on Elastic Stack:
Introduction to Elasticsearch
Elasticsearch Installation and Configuration on Ubuntu 14.04
Log analysis with Elastic stack
Elasticsearch Rest API
Basics of Data Search in Elasticsearch
Elasticsearch Rest API
Wildcard and Boolean Search in Elasticsearch
Configure Logstash to push MySQL data into Elasticsearch
Configure Logstash to push MongoDB data into Elasticsearch
Metrics Aggregation in Elasticsearch
Bucket Aggregation in Elasticsearch
How to create Elasticsearch Cluster
In case of any doubt please leave your comments. You can also follow me on Twitter: https://twitter.com/anubioinfo
If you found this article interesting, then you can explore “Mastering Kibana 6.0”, “Kibana 7 Quick Start Guide”, “Learning Kibana 7”, and “Elasticsearch 7 Quick Start Guide” books to get more insight about Elastic Stack, how to perform data analysis, and how you can create dashboards for key performance indicators using Kibana.
Comments (5)
Ram P
Apr 12, 2019, 8:38:45 PM
@anurag You may want to remove the column names from the quotes as logstash doesn't know how to filter your column names
Anurag Srivastava
Apr 13, 2019, 8:30:12 AM
@Ram: this is required to map the field names as if we remove the column from filter it will generate following source in Elasticsearch: "_source" : { "column14" : "25", "column21" : "-87.744384567", "@version" : "1", "column2" : "HY189866", "column12" : "011", "column8" : "STREET", "column20" : "41.891398861", "host" : "KELLGGNLPTP0305", "column15" : "04B", "column1" : "10000092", "column13" : "28", "column10" : "false", "column7" : "AGGRAVATED: HANDGUN", "column5" : "041A", "column3" : "03/18/2015 07:44:00 PM", "path" : "/home/user/Downloads/crimes.csv", "column17" : "1903566", "column9" : "false", "column19" : "02/10/2018 03:50:01 PM", "column6" : "BATTERY", "column16" : "1144606", "column22" : "(41.891398861, -87.744384567)", "@timestamp" : "2019-04-13T08:26:32.732Z", "column4" : "047XX W OHIO ST", "column11" : "1111", "column18" : "2015", "message" : """10000092,HY189866,03/18/2015 07:44:00 PM,047XX W OHIO ST,041A,BATTERY,AGGRAVATED: HANDGUN,STREET,false,false,1111,011,28,25,04B,1144606,1903566,2015,02/10/2018 03:50:01 PM,41.891398861,-87.744384567,"(41.891398861, -87.744384567)"""" }
Ram P
Apr 13, 2019, 2:25:01 PM
@anurag I meant to say is to remove the quotes. You would need the column names for identification. Also did you try auto_detect_column_names =>true. ? By defining this ELK would take the row row as column names when the input is CSV
Anurag Srivastava
Apr 13, 2019, 3:27:09 PM
@Ram: Thanks for the comment, I have just checked the auto_detect_column_names =>true and it is working.
Abhishek Kulkarni
Feb 24, 2022, 11:03:53 AM
How to load multiple CSV files to Logstash
Leave a comment
Related Blogs
Introduction to Logstash
Dec 20, 2019, 11:38:31 AM | Anurag Srivastava
Importing MongoDB data into Elasticsearch
Mar 9, 2019, 8:20:38 AM | Anurag Srivastava
Importing MySQL data into Elasticsearch
Feb 9, 2019, 12:06:18 PM | Anurag Srivastava
Snapshot and Restore Elasticsearch Indices
Sep 16, 2019, 5:55:06 AM | Anurag Srivastava
Log analysis with Elastic stack
Jan 31, 2018, 6:11:29 AM | Anurag Srivastava
Creating Elasticsearch Cluster
Apr 6, 2019, 8:41:41 PM | Anurag Srivastava
Top Blogs
Configure SonarQube Scanner with Jenkins
Jun 21, 2018, 4:58:11 AM | Anurag Srivastava
Execute Commands on Remote Machines using sshpass
Jul 16, 2018, 5:00:02 PM | Anurag Srivastava
Importing MongoDB data into Elasticsearch
Mar 9, 2019, 8:20:38 AM | Anurag Srivastava
Importing MySQL data into Elasticsearch
Feb 9, 2019, 12:06:18 PM | Anurag Srivastava
Configure Jenkins for Automated Code Deployment
Jun 13, 2018, 3:44:01 PM | Anurag Srivastava
Deploying Angular code using Python script
Jun 26, 2018, 4:50:18 PM | Anurag Srivastava