by Shantnu Tiwaridata-science
Readexcel method is used to read the excel file in python.And then you have to pass file as an argument. Print (data) simply prints the data of excel file. Now on running the above chunks of code we got the output as below. Conversion of Cell Contents. Some times you want to do conversion of your cell contents from excel.So, here you can.
In this tutorial you’re going to learn how to work with large Excel files in Pandas, focusing on reading and analyzing an xls file and then working with a subset of the original data.
Free Bonus:Click here to download an example Python project with source code that shows you how to read large Excel files.
This tutorial utilizes Python (tested with 64-bit versions of v2.7.9 and v3.4.3), Pandas (v0.16.1), and XlsxWriter (v0.7.3). We recommend using the Anaconda distribution to quickly get started, as it comes pre-installed with all the needed libraries.
Reading the File
The first file we’ll work with is a compilation of all the car accidents in England from 1979-2004, to extract all accidents that happened in London in the year 2000.
Download Samsung Tool PRO for Windows PC from FileHorse. 100% Safe and Secure Free Download (32-bit/64-bit) Latest Version 2020. Z3x shell crack.
Excel
Start by downloading the source ZIP file from data.gov.uk, and extract the contents. Then try to open Accidents7904.csv in Excel. Be careful. If you don’t have enough memory, this could very well crash your computer.
What happens?
You should see a “File Not Loaded Completely” error since Excel can only handle one million rows at a time.
We tested this in LibreOffice as well and received a similar error - “The data could not be loaded completely because the maximum number of rows per sheet was exceeded.”
To solve this, we can open the file in Pandas. Before we start, the source code is on Github.
Pandas
Within a new project directory, activate a virtualenv, and then install Pandas:
Now let’s build the script. Create a file called pandas_accidents.py and the add the following code:
Here, we imported Pandas, read in the file—which could take some time, depending on how much memory your system has—and outputted the total number of rows the file has as well as the available headers (e.g., column titles).
When ran, you should see:
So, there are over six millions rows! No wonder Excel choked. Turn your attention to the list of headers, the first one in particular:
This should read Accident_Index
. What’s with the extra xefxbbxbf
at the beginning? Well, the x
actually means that the value is hexadecimal, which is a Byte Order Mark, indicating that the text is Unicode.
Why does it matter to us?
You cannot assume the files you read are clean. They might contain extra symbols like this that can throw your scripts off.
This file is good, in that it is otherwise clean - but many files have missing data, data in internal inconsistent format, etc. So any time you have a file to analyze, the first thing you must do is clean it. How much cleaning? Enough to allow you to do some analysis. Follow the KISS principle.
What sort of cleanup might you require?
- Fix date/time. The same file might have dates in different formats, like the American (mm-dd-yy) or European (dd-mm-yy) formats. These need to be brought into a common format.
- Remove any empty values. The file might have blank columns and/or rows, and this will come up as NaN (Not a number) in Pandas. Pandas provides a simple way to remove these: the
dropna()
function. We saw an example of this in the last blog post. - Remove any garbage values that have made their way into the data. These are values which do not make sense (like the byte order mark we saw earlier). Sometimes, it might be possible to work around them. For example, there could be a dataset where the age was entered as a floating point number (by mistake). The
int()
function then could be used to make sure all ages are in integer format.
Analyzing
For those of you who know SQL, you can use the SELECT, WHERE, AND/OR statements with different keywords to refine your search. We can do the same in Pandas, and in a way that is more programmer friendly.
To start off, let’s find all the accidents that happened on a Sunday. Looking at the headers above, there is a Day_of_Weeks
field, which we will use.
In the ZIP file you downloaded, there’s a file called Road-Accident-Safety-Data-Guide-1979-2004.xls, which contains extra info on the codes used. If you open it up, you will see that Sunday has the code 1
.
That’s how simple it is.
Here, we targeted the Day_of_Weeks
field and returned a DataFrame with the condition we checked for - day of week 1
.
When ran you should see:
As you can see, there were 693,847 accidents that happened on a Sunday.
Let’s make our query more complicated: Find out all accidents that happened on a Sunday and involved more than twenty cars:
Run the script. Now we have 10 accidents:
Let’s add another condition: weather.
Open the Road-Accident-Safety-Data-Guide-1979-2004.xls, and go to the Weather sheet. You’ll see that the code 2
means, “Raining with no heavy winds”.
Add that to our query:
So there were four accidents that happened on a Sunday, involving more than twenty cars, while it was raining:
We could continue making this more and more complicated, as needed. For now, we’ll stop since our main interest is to look at accidents in London.
If you look at Road-Accident-Safety-Data-Guide-1979-2004.xls again, there is a sheet called Police Force. The code for 1
says, “Metropolitan Police”. This is what is more commonly known as Scotland Yard, and is the police force responsible for most (though not all) of London. For our case, this is good enough, and we can extract this info like so:
Run the script. This created a new DataFrame with the accidents handled by the “Metropolitan Police” from 1979 to 2004 on a Sunday:
What if you wanted to create a new DataFrame that only contains accidents in the year 2000?
The first thing we need to do is convert the date format to one which Python can understand using the pd.to_datetime()
function. This takes a date in any format and converts it to a format that we can understand (yyyy-mm-dd). Then we can create another DataFrame that only contains accidents for 2000:
When ran, you should see:
So, this is a bit confusing at first. Normally, to filter an array you would just use a for
loop with a conditional:
However, you really shouldn’t define your own loop since many high-performance libraries, like Pandas, have helper functions in place. In this case, the above code loops over all the elements and filters out data outside the set dates, and then returns the data points that do fall within the dates.
Nice!
Converting
Chances are that, while using Pandas, everyone else in your organization is stuck with Excel. Want to share the DataFrame with those using Excel?
First, we need to do some cleanup. Remember the byte order mark we saw earlier? That causes problems when writing this data to an Excel file - Pandas throws a UnicodeDecodeError. Why? Because the rest of the text is decoded as ASCII, but the hexadecimal values can’t be represented in ASCII.
We could write everything as Unicode, but remember this byte order mark is an unnecessary (to us) extra we don’t want or need. So we will get rid of it by renaming the column header:
This is the way to rename a column in Pandas; a bit complicated, to be honest. inplace = True
is needed because we want to modify the existing structure, and not create a copy, which is what Pandas does by default.
Now we can save the data to Excel:
Make sure to install XlsxWriter before running:
If all went well, this should have created a file called London_Sundays_2000.xlsx, and then saved our data to Sheet1. Open this file up in Excel or LibreOffice, and confirm that the data is correct.
Conclusion
So, what did we accomplish? Well, we took a very large file that Excel could not open and utilized Pandas to-
- Open the file.
- Perform SQL-like queries against the data.
- Create a new XLSX file with a subset of the original data.
Keep in mind that even though this file is nearly 800MB, in the age of big data, it’s still quite small. What if you wanted to open a 4GB file? Even if you have 8GB or more of RAM, that might still not be possible since much of your RAM is reserved for the OS and other system processes. In fact, my laptop froze a few times when first reading in the 800MB file. If I opened a 4GB file, it would have a heart attack.
Free Bonus:Click here to download an example Python project with source code that shows you how to read large Excel files.
So how do we proceed?
The trick is not to open the whole file in one go. That’s what we’ll look at in the next blog post. Until then, analyze your own data. Leave questions or comments below. You can grab the code for this tutorial from the repo.
🐍 Python Tricks 💌
Get a short & sweet Python Trick delivered to your inbox every couple of days. No spam ever. Unsubscribe any time. Curated by the Real Python team.
About Shantnu Tiwari
Shantnu has worked in the low level/embedded domain for ten years. Shantnu suffered at the hands of C/C++ for several years before he discovered Python, and it felt like a breath of fresh air.
» More about ShantnuMaster Real-World Python Skills With Unlimited Access to Real Python
Join us and get access to hundreds of tutorials, hands-on video courses, and a community of expert Pythonistas:
Master Real-World Python Skills
With Unlimited Access to Real Python
Join us and get access to hundreds of tutorials, hands-on video courses, and a community of expert Pythonistas:
What Do You Think?
Real Python Comment Policy: The most useful comments are those written with the goal of learning from or helping out other readers—after reading the whole article and all the earlier comments. Complaints and insults generally won’t make the cut here.
What’s your #1 takeaway or favorite thing you learned? How are you going to put your newfound skills to use? Leave a comment below and let us know.
Keep Learning
Master Real-World Python Skills With Unlimited Access to Real Python
Already a member? Sign-In
Join us and get access to hundreds of tutorials, hands-on video courses, and a community of expert Pythonistas:
Latest versionReleased:
Library to create spreadsheet files compatible with MS Excel 97/2000/XP/2003 XLS files, on any platform, with Python 2.6, 2.7, 3.3+
Project description
xlwt
This is a library for developers to use to generatespreadsheet files compatible with Microsoft Excel versions 95 to 2003.
The package itself is pure Python with no dependencies on modules or packagesoutside the standard Python distribution.
Please read this before using this package:https://groups.google.com/d/msg/python-excel/P6TjJgFVjMI/g8d0eWxTBQAJ Quicktime 2x speed no sound.
Quick start
Documentation
Documentation can be found in the docs directory of the xlwt package.If these aren’t sufficient, please consult the code in theexamples directory and the source code itself.
The latest documentation can also be found at:https://xlwt.readthedocs.org/en/latest/
Problems?
Try the following in this order:
- Read the source
- Ask a question on https://groups.google.com/group/python-excel/
Acknowledgements
xlwt is a fork of the pyExcelerator package, which was developed byRoman V. Kiseliov. This product includes software developed byRoman V. Kiseliov <roman@kiseliov.ru>.
xlwt uses ANTLR v 2.7.7 to generate its formula compiler.
Release historyRelease notifications RSS feed
1.3.0
1.2.0
1.1.2
1.1.1
1.0.0
0.7.5
0.7.4
0.7.3
0.7.2
0.7.1
0.7.0
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Filename, size | File type | Python version | Upload date | Hashes |
---|---|---|---|---|
Filename, size xlwt-1.3.0-py2.py3-none-any.whl (100.0 kB) | File type Wheel | Python version py2.py3 | Upload date | Hashes |
Filename, size xlwt-1.3.0.tar.gz (153.9 kB) | File type Source | Python version None | Upload date | Hashes |
Hashes for xlwt-1.3.0-py2.py3-none-any.whl
Algorithm | Hash digest |
---|---|
SHA256 | a082260524678ba48a297d922cc385f58278b8aa68741596a87de01a9c628b2e |
MD5 | 085e6a73f9bffa8de4abd2c131b8afd5 |
BLAKE2-256 | 4448def306413b25c3d01753603b1a222a011b8621aed27cd7f89cbc27e6b0f4 |
Hashes for xlwt-1.3.0.tar.gz
Algorithm | Hash digest |
---|---|
SHA256 | c59912717a9b28f1a3c2a98fd60741014b06b043936dcecbc113eaaada156c88 |
MD5 | 4b1ca8a3cef3261f4b4dc3f138e383a8 |
BLAKE2-256 | 069756a6f56ce44578a69343449aa5a0d98eefe04085d69da539f3034e2cd5c1 |