/    Sign up×
Community /Pin to ProfileBookmark

How to remove all html tags from html code

Hello. Is there a function in PHP that removes all html tags from an external text file that is a html document?

In other words, I would like to save to a different external file only the words that a browser actually displays.

Is there a quick solution already built in eg PHP or do I have to code a cycle going through the text file considering all characters between every > and < only?

Thank you.

to post a comment
PHP

13 Comments(s)

Copy linkTweet thisAlerts:
@ginerjmNov 27.2021 — How about this: https://www.php.net/manual/en/function.strip-tags.php
Copy linkTweet thisAlerts:
@codewitchauthorNov 27.2021 — Thank you for your reply.

The strip_tags() function is indeed a solution. The only problem is that it also returns the css sheet written in the html code. But with an external stylesheet it works exactly as desired.
Copy linkTweet thisAlerts:
@ginerjmNov 27.2021 — HTH
Copy linkTweet thisAlerts:
@NogDogNov 27.2021 — You might have issues with &lt;script&gt;/* Javascript code here */&lt;/script&gt;, too? If you need to eliminate non-text elements, you might need to leverage either the [DOM Extension](https://www.php.net/manual/en/book.dom.php) and/or play around with regexes in [preg_replace()](https://www.php.net/preg_replace())
Copy linkTweet thisAlerts:
@codewitchauthorNov 28.2021 — Hello NogDog. Thank you for your post. I have found a way how to get rid of both the css code and the script issue:

<i>
</i>&lt;?php

//this code returns all words located in the body tag of the html document

$myfile = fopen("file.htm", "r") or die("Unable to open file.");
$s=fread($myfile,filesize("file.htm"));
fclose($myfile);

$s=explode("&lt;/head&gt;",$s)[1];

$n1=substr_count($s, '&lt;/script&gt;');
$n2=substr_count($s, '&lt;/SCRIPT&gt;');

$t7='';

for($i=1;$i&lt;=$n1;$i++){

<i> </i>$t7.=explode("&lt;script",$s)[0];
<i> </i>$s=explode("&lt;/script&gt;",$s)[1];

}

$s=$t7;
$t7='';

for($i=1;$i&lt;=$n2;$i++){

<i> </i>$t7.=explode("&lt;SCRIPT",$s)[0];
<i> </i>$s=explode("&lt;/SCRIPT&gt;",$s)[1];

}


echo strip_tags($t7);

?&gt;


There is written <script only, because it can be longer: <script type="text/javascript">

Hopefully it is all issues to deal with here.
Copy linkTweet thisAlerts:
@EpicblueNov 28.2021 — The strip_tags() work is to be sure an answer. The main issue is that it likewise returns the css sheet written in the html code. Yet, with an outside template it works precisely as wanted.
Copy linkTweet thisAlerts:
@SempervivumNov 28.2021 — Not being an expert for DOMDocument I figured out by some research and try and error that the property textContent of a DOM node contains all the text inside including nested elements and excluding script and style tags. Taking advantage of this, getting the text grows fairly simple:
``<i>
</i>$doc = new DOMDocument();
$doc-&gt;loadhtmlfile('my-test-document.html', LIBXML_NOWARNING | LIBXML_NOERROR);
$body = $doc-&gt;getElementsByTagName('body');

foreach ($body as $item) {
echo $item-&gt;textContent;
}<i>
</i>
`</CODE>
Feel free to add some checking that only one body exists.<br/>
Tested with this HTML I had at hand:
<CODE>
`<i>
</i>&lt;!DOCTYPE html&gt;
&lt;html lang="en"&gt;

&lt;head&gt;
&lt;meta charset="UTF-8"&gt;
&lt;meta name="viewport" content="width=device-width, initial-scale=1"&gt;
&lt;title&gt;Nested Test Layout&lt;/title&gt;
&lt;/head&gt;

&lt;body&gt;
&lt;style&gt;
html,
body {
height: 100%;
margin: 0em;
padding: 0em;
}

body {
display: flex;
}

section#left {
flex-grow: 2;
background-color: darkkhaki;
align-items: stretch;
}

section#right {
display: flex;
flex-direction: column;
flex-grow: 3;
}

#right div {
padding: 0em;
margin: 0em;

}

section#right div:first-child {
background-color: darkorange;
flex: 3;
}

section#right div:nth-child(3) {
background-color: darkturquoise;
flex: 3;

}

section#right div:nth-child(4) {
background-color: dodgerblue;
flex: 1;
}

section#right div:nth-child(4) p {
margin: 0em;
}

section#right .row {
display: flex;
flex-direction: row;
background-color: darkred;
flex: 3;
}

section#right .row article {
flex: 3;
border: 1px solid black;
}
&lt;/style&gt;
&lt;section id="left"&gt;
&lt;p&gt;Left&lt;/p&gt;
&lt;/section&gt;
&lt;section id="right"&gt;
&lt;div&gt;
&lt;p&gt;Right - First&lt;/p&gt;
&lt;/div&gt;
&lt;div class="row"&gt;
&lt;article&gt;
&lt;p&gt;Lorem Ipsum&lt;/p&gt;
&lt;/article&gt;
&lt;article&gt;
&lt;p&gt;Lorem Ipsum&lt;/p&gt;
&lt;/article&gt;
&lt;article&gt;
&lt;p&gt;Lorem Ipsum&lt;/p&gt;
&lt;/article&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;p&gt;Right - Third&lt;/p&gt;
&lt;/div&gt;
&lt;div&gt;
&lt;p&gt;Right - Fourth&lt;/p&gt;
&lt;/div&gt;
&lt;/section&gt;
&lt;script&gt;
// some javascript here
&lt;/script&gt;
&lt;/body&gt;

&lt;/html&gt;<i>
</i>
``

Although this is not valid HTML I placed the style inside the body in order to verify that it's not included in the text.
Copy linkTweet thisAlerts:
@codewitchauthorNov 29.2021 — Hello Sempervivum. The code you posted does not deal with the script tags. In addition it deleted content of a div "I am a text to be displayed.". I will post results. Code:
<i>
</i>&lt;?php

//code credit: Sempervivum

$doc = new DOMDocument();
$doc-&gt;loadhtmlfile('file.htm', LIBXML_NOWARNING | LIBXML_NOERROR);
$body = $doc-&gt;getElementsByTagName('body');

foreach ($body as $item) {
echo $item-&gt;textContent;
}

?&gt;

My file.htm:
<i>
</i>&lt;!DOCTYPE html&gt;
&lt;html lang="en"&gt;
&lt;head&gt;
&lt;meta charset="utf-8"/&gt;
&lt;meta name="viewport" content="width=device-width, initial-scale=1.0"&gt;
&lt;title&gt;hey I am a title&lt;/title&gt;
&lt;link rel="stylesheet" href="style.css"&gt;
&lt;style&gt;
h1{color: blue;}

<i> </i>&lt;/style&gt;
&lt;/head&gt;
&lt;body class="code0"&gt;


&lt;div style="background-color: red; width: 100%; "&gt;
I am a text to be displayed.
&lt;/div&gt;


&lt;div style="background-color: green; width: 100%; "&gt;

&lt;form action="file.php"&gt;
&lt;center&gt;
   &lt;span style="width: 50px; height: 100%; text-align: center; " class=vv&gt;aaa&lt;/span&gt;
&lt;input type="text" name="c2" style="width: 79%; font-size:25pt;"&gt;&lt;/span&gt;
    &lt;span style="width: 50px; height: 100%; text-align: center; " class=vv&gt;&lt;input type="submit" value="send" class=c7&gt;&lt;/span&gt;
&lt;/center&gt;
&lt;/form&gt;

&lt;/div&gt;



&lt;form action="file.php"&gt;
&lt;span class="form-container"&gt;
&lt;input type="text" name="c2" style="width: 79%; font-size:25pt;"&gt;
&lt;input type="submit" value="send" style="position: relative; left: 15px"&gt;
&lt;/span&gt;
&lt;/form&gt;


Hey I am a cool text.





&lt;div class="mycontainer"&gt;
&lt;div class="my1"&gt;
<br/>
<i> </i> &lt;!-- here begins the my1 content --&gt;
<i> </i>
<i> </i>
<i> </i> aaa
<i> </i>
<i> </i>
<i> </i> &lt;!-- here ends the my1 content --&gt;
<i> </i>
<i> </i> &lt;/div&gt;
<i> </i> &lt;div class="my2"&gt;
<i> </i>
<i> </i> &lt;!-- here begins the my2 content --&gt;
<i> </i>
<i> </i>
<i> </i> bbb
<i> </i>
<i> </i>
<i> </i> &lt;!-- here ends the my2 content --&gt;

<i> </i> &lt;/div&gt;
<i> </i> &lt;div class="my3"&gt;
<i> </i>
<i> </i> &lt;!-- here begins the my3 content --&gt;
<i> </i>
<i> </i>
<i> </i> ccc
<i> </i>
<i> </i>
<i> </i> &lt;!-- here ends the my3 content --&gt;

<i> </i> &lt;/div&gt;
<i> </i> &lt;div class="my4"&gt;
<i> </i>
<i> </i> &lt;!-- here begins the my4 content --&gt;
<i> </i>
<i> </i>
<i> </i> ddd
<i> </i>
<i> </i>
<i> </i> &lt;!-- here ends the my4 content --&gt;

<i> </i> &lt;/div&gt;
&lt;/div&gt;

I am a text to be included.

&lt;script&gt;I am a text not to be included..&lt;/script&gt;

&lt;script&gt;I am a text not to be included..&lt;/script&gt;

&lt;SCRIPT&gt;I am a text not to be included..&lt;/SCRIPT&gt;




&lt;/body&gt;
&lt;/html&gt;

output:
<i>
</i> aaa Hey I am a cool text. aaa bbb ccc ddd I am a text to be included. I am a text not to be included.. I am a text not to be included.. I am a text not to be included..×
Copy linkTweet thisAlerts:
@SempervivumNov 29.2021 — Ah, so sorry, I retested this and noticed that I had some trouble with the versions of my test document: I used the wrong one which did not contain script tags.
Copy linkTweet thisAlerts:
@NogDogNov 29.2021 — Brute force approach:
[code=php]
$doc = new DOMDocument();
$doc->loadhtmlfile(__DIR__.'/file.htm', LIBXML_NOWARNING | LIBXML_NOERROR);
$scripts = $doc->getElementsByTagName('script');
foreach($scripts as $script) {
$script->textContent = '';
}
$body = $doc->getElementsByTagName('body');
foreach ($body as $item) {
echo preg_replace('/ns+/', "n", $item->textContent);
}
[/code]

Output:
[code=text]
I am a text to be displayed.
aaa
Hey I am a cool text.
aaa
bbb
ccc
ddd
I am a text to be included.
[/code]
Copy linkTweet thisAlerts:
@SempervivumNov 29.2021 — That's fine, a different approach I coded right now is removing the script nodes first:
``<i>
</i>$doc = new DOMDocument();
$doc-&gt;loadhtmlfile('thread777-layout-stef-codewitch.html', LIBXML_NOWARNING | LIBXML_NOERROR);
$body = $doc-&gt;getElementsByTagName('body');

// remove script nodes first:
$scriptNodes = $doc-&gt;getElementsByTagName('script');
for ($i = $scriptNodes-&gt;length; --$i &gt;= 0;) {
$node = $scriptNodes[$i];
$node-&gt;parentNode-&gt;removeChild($node);
}

foreach ($body as $item) {
// textContent of the body will include nested elements:
echo $item-&gt;textContent;
}<i>
</i>
``
Copy linkTweet thisAlerts:
@NogDogNov 29.2021 — > @Sempervivum#1639988

That's probably cleaner. I was too impatient to figure out how to get removeChild() to work. :)
Copy linkTweet thisAlerts:
@codewitchauthorNov 30.2021 — @NogDog @Sempervivum: Thank you both for your replies.
×

Success!

Help @codewitch spread the word by sharing this article on Twitter...

Tweet This
Sign in
Forgot password?
Sign in with TwitchSign in with GithubCreate Account
about: ({
version: 0.1.9 BETA 4.24,
whats_new: community page,
up_next: more Davinci•003 tasks,
coming_soon: events calendar,
social: @webDeveloperHQ
});

legal: ({
terms: of use,
privacy: policy
});
changelog: (
version: 0.1.9,
notes: added community page

version: 0.1.8,
notes: added Davinci•003

version: 0.1.7,
notes: upvote answers to bounties

version: 0.1.6,
notes: article editor refresh
)...
recent_tips: (
tipper: @Yussuf4331,
tipped: article
amount: 1000 SATS,

tipper: @darkwebsites540,
tipped: article
amount: 10 SATS,

tipper: @Samric24,
tipped: article
amount: 1000 SATS,
)...