Menu
Hello. Is there a function in PHP that removes all html tags from an external text file that is a html document?
In other words, I would like to save to a different external file only the words that a browser actually displays.
Is there a quick solution already built in eg PHP or do I have to code a cycle going through the text file considering all characters between every > and < only?
Thank you.
<script>/* Javascript code here */</script>
preg_replace()
<i>
</i><?php
//this code returns all words located in the body tag of the html document
$myfile = fopen("file.htm", "r") or die("Unable to open file.");
$s=fread($myfile,filesize("file.htm"));
fclose($myfile);
$s=explode("</head>",$s)[1];
$n1=substr_count($s, '</script>');
$n2=substr_count($s, '</SCRIPT>');
$t7='';
for($i=1;$i<=$n1;$i++){
<i> </i>$t7.=explode("<script",$s)[0];
<i> </i>$s=explode("</script>",$s)[1];
}
$s=$t7;
$t7='';
for($i=1;$i<=$n2;$i++){
<i> </i>$t7.=explode("<SCRIPT",$s)[0];
<i> </i>$s=explode("</SCRIPT>",$s)[1];
}
echo strip_tags($t7);
?>
``<i>
</i>$doc = new DOMDocument();
$doc->loadhtmlfile('my-test-document.html', LIBXML_NOWARNING | LIBXML_NOERROR);
$body = $doc->getElementsByTagName('body');
foreach ($body as $item) {
echo $item->textContent;
}<i>
</i>
`</CODE>
Feel free to add some checking that only one body exists.<br/>
Tested with this HTML I had at hand:
<CODE>
`<i>
</i><!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Nested Test Layout</title>
</head>
<body>
<style>
html,
body {
height: 100%;
margin: 0em;
padding: 0em;
}
body {
display: flex;
}
section#left {
flex-grow: 2;
background-color: darkkhaki;
align-items: stretch;
}
section#right {
display: flex;
flex-direction: column;
flex-grow: 3;
}
#right div {
padding: 0em;
margin: 0em;
}
section#right div:first-child {
background-color: darkorange;
flex: 3;
}
section#right div:nth-child(3) {
background-color: darkturquoise;
flex: 3;
}
section#right div:nth-child(4) {
background-color: dodgerblue;
flex: 1;
}
section#right div:nth-child(4) p {
margin: 0em;
}
section#right .row {
display: flex;
flex-direction: row;
background-color: darkred;
flex: 3;
}
section#right .row article {
flex: 3;
border: 1px solid black;
}
</style>
<section id="left">
<p>Left</p>
</section>
<section id="right">
<div>
<p>Right - First</p>
</div>
<div class="row">
<article>
<p>Lorem Ipsum</p>
</article>
<article>
<p>Lorem Ipsum</p>
</article>
<article>
<p>Lorem Ipsum</p>
</article>
</div>
<div>
<p>Right - Third</p>
</div>
<div>
<p>Right - Fourth</p>
</div>
</section>
<script>
// some javascript here
</script>
</body>
</html><i>
</i>
``
<i>
</i><?php
//code credit: Sempervivum
$doc = new DOMDocument();
$doc->loadhtmlfile('file.htm', LIBXML_NOWARNING | LIBXML_NOERROR);
$body = $doc->getElementsByTagName('body');
foreach ($body as $item) {
echo $item->textContent;
}
?>
<i>
</i><!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>hey I am a title</title>
<link rel="stylesheet" href="style.css">
<style>
h1{color: blue;}
<i> </i></style>
</head>
<body class="code0">
<div style="background-color: red; width: 100%; ">
I am a text to be displayed.
</div>
<div style="background-color: green; width: 100%; ">
<form action="file.php">
<center>
<span style="width: 50px; height: 100%; text-align: center; " class=vv>aaa</span>
<input type="text" name="c2" style="width: 79%; font-size:25pt;"></span>
<span style="width: 50px; height: 100%; text-align: center; " class=vv><input type="submit" value="send" class=c7></span>
</center>
</form>
</div>
<form action="file.php">
<span class="form-container">
<input type="text" name="c2" style="width: 79%; font-size:25pt;">
<input type="submit" value="send" style="position: relative; left: 15px">
</span>
</form>
Hey I am a cool text.
<div class="mycontainer">
<div class="my1">
<br/>
<i> </i> <!-- here begins the my1 content -->
<i> </i>
<i> </i>
<i> </i> aaa
<i> </i>
<i> </i>
<i> </i> <!-- here ends the my1 content -->
<i> </i>
<i> </i> </div>
<i> </i> <div class="my2">
<i> </i>
<i> </i> <!-- here begins the my2 content -->
<i> </i>
<i> </i>
<i> </i> bbb
<i> </i>
<i> </i>
<i> </i> <!-- here ends the my2 content -->
<i> </i> </div>
<i> </i> <div class="my3">
<i> </i>
<i> </i> <!-- here begins the my3 content -->
<i> </i>
<i> </i>
<i> </i> ccc
<i> </i>
<i> </i>
<i> </i> <!-- here ends the my3 content -->
<i> </i> </div>
<i> </i> <div class="my4">
<i> </i>
<i> </i> <!-- here begins the my4 content -->
<i> </i>
<i> </i>
<i> </i> ddd
<i> </i>
<i> </i>
<i> </i> <!-- here ends the my4 content -->
<i> </i> </div>
</div>
I am a text to be included.
<script>I am a text not to be included..</script>
<script>I am a text not to be included..</script>
<SCRIPT>I am a text not to be included..</SCRIPT>
</body>
</html>
<i>
</i> aaa Hey I am a cool text. aaa bbb ccc ddd I am a text to be included. I am a text not to be included.. I am a text not to be included.. I am a text not to be included..×
[code=php]
$doc = new DOMDocument();
$doc->loadhtmlfile(__DIR__.'/file.htm', LIBXML_NOWARNING | LIBXML_NOERROR);
$scripts = $doc->getElementsByTagName('script');
foreach($scripts as $script) {
$script->textContent = '';
}
$body = $doc->getElementsByTagName('body');
foreach ($body as $item) {
echo preg_replace('/ns+/', "n", $item->textContent);
}
[/code]
[code=text]
I am a text to be displayed.
aaa
Hey I am a cool text.
aaa
bbb
ccc
ddd
I am a text to be included.
[/code]
``<i>
</i>$doc = new DOMDocument();
$doc->loadhtmlfile('thread777-layout-stef-codewitch.html', LIBXML_NOWARNING | LIBXML_NOERROR);
$body = $doc->getElementsByTagName('body');
// remove script nodes first:
$scriptNodes = $doc->getElementsByTagName('script');
for ($i = $scriptNodes->length; --$i >= 0;) {
$node = $scriptNodes[$i];
$node->parentNode->removeChild($node);
}
foreach ($body as $item) {
// textContent of the body will include nested elements:
echo $item->textContent;
}<i>
</i>
``
>@Sempervivum#1639988
removeChild()
0.1.9 — BETA 4.24