Read Word Document using JAVA

Hi guys,

Following code will enable us to read Microsoft Word Document file using JAVA API.

/*
* To change this license header, choose License Headers in Project Properties.
* To change this template file, choose Tools | Templates
* and open the template in the editor.
*/
package com.milind.mr.doc.test;
/**
*
* @author milind
*/
import java.io.File;
import java.io.FileInputStream;
import java.util.List;
import org.apache.commons.io.FilenameUtils;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
public class MicrosoftWordDocReader {
public static void readDocFile(String fileName) {
try {
File file = new File(fileName);
FileInputStream fis = new FileInputStream(file.getAbsolutePath());
HWPFDocument doc = new HWPFDocument(fis);
WordExtractor we = new WordExtractor(doc);
String[] paragraphs = we.getParagraphText();
System.out.println("Total no of paragraph " + paragraphs.length);
for (String para : paragraphs) {
System.out.println(para.toString());
}
fis.close();
} catch (Exception e) {
e.printStackTrace();
}
}
public static void readDocxFile(String fileName) {
try {
File file = new File(fileName);
FileInputStream fis = new FileInputStream(file.getAbsolutePath());
XWPFDocument document = new XWPFDocument(fis);
List<XWPFParagraph> paragraphs = document.getParagraphs();
System.out.println("Total no of paragraph " + paragraphs.size());
for (XWPFParagraph para : paragraphs) {
System.out.println(para.getText());
}
fis.close();
} catch (Exception e) {
e.printStackTrace();
}
}
public static void main(String[] args) {
String ext = FilenameUtils.getExtension("D:\\test.docx");
System.out.println("extension : " + ext);
if ("docx".equalsIgnoreCase(ext)) {
readDocxFile("D:\\Test.docx");
} else if ("doc".equalsIgnoreCase(ext)) {
readDocFile("D:\\Test.doc");
} else {
System.out.println("INVALID FILE TYPE. ONLY .doc and .docx are permitted.");
}
}
}

Following is the pom.xml contents

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.milind</groupId>
<artifactId>mr-doc</artifactId>
<version>1.0</version>
<packaging>jar</packaging>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>1.7</maven.compiler.source>
<maven.compiler.target>1.7</maven.compiler.target>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-scratchpad</artifactId>
<version>3.0.1-FINAL</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>3.9</version>
</dependency>
<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
<version>2.4</version>
<type>jar</type>
</dependency>
</dependencies>
</project>

view raw
pom.xml
hosted with ❤ by GitHub

Following is the word file contents

Input File
It is input word file

 

Following is the output of code.

Output Screenshot
Code Output

 

Thanks for having a read.

Do comment below for your queries.

Published by milindjagre

I founded my blog www.milindjagre.co four years ago and am currently working as a Data Scientist Analyst at the Ford Motor Company. I graduated from the University of Connecticut pursuing Master of Science in Business Analytics and Project Management. I am working hard and learning a lot of new things in the field of Data Science. I am a strong believer of constant and directional efforts keeping the teamwork at the highest priority. Please reach out to me at milindjagre@gmail.com for further information. Cheers!

2 thoughts on “Read Word Document using JAVA

    1. Did you mean a specific page in the word document?
      IF YES, I have not explored page specific data extraction.

      Please let me know if you found the solution for this.
      Appreciate it.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: