Select, drop or extract Columns

I am using a SQL database

Use SQL select before dumping out files for your ML project

SELECT column1, column2 ,column3 INTO OUTFILE '/tmp/out.csv'
  FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
  LINES TERMINATED BY '\n'
  FROM my_table;

Or, use Athena as shown here

Files are in a folder, and I like linux commands

The easiest way to do this is using the linux cut command. Suppose you have a file called in.csv or a directory of files that look like

Folder
├── in1.csv
├── in2.csv
├── .
├── .
└── in2000.csv

… and assuming the delimiter used is a comma (,), and you want to select the first three columns, do

cut -d "," -f1-3 Folder/in*.csv > outfile.csv

If you need a specific list of columns, do:

cut -d "," -f1-10,20-25,30-33 Folder/in*.csv > outfile.csv

You can also use awk on linux and MacOS to select and reorder columns:

awk -F "|" '{ print $1","$3","$5 }' Folder/in*csv > outfile.csv

Here, we read from a pipe-delimited file, select and reorder columns, and then write a comma separated file.

To remove additional characters, like a double-quote, add a {gsub(/\"/,"")}; before the print keyword.

How about in Python?

import pandas as pd
df = pd.read_csv('in.csv')

#Select columns named column1 and column2
df[['column1', 'column2']].to_csv('out.csv')

#Or select using column number
df.iloc[:, 0:2].to_csv('out.csv')

Related content: