www.webdeveloper.com
Page 1 of 2 12 LastLast
Results 1 to 15 of 18

Thread: [RESOLVED] Regular Expressions drive me crazy....

  1. #1
    Join Date
    Feb 2008
    Posts
    70

    resolved [RESOLVED] Regular Expressions drive me crazy....

    Hello there guys. once again I need some help with regular expressions
    somehow i cant seem to grasp the concept behind them, I read all those examples and I seem to understand them but when it comes to writing them myself then I get into trouble....

    anyway, down with the actual question
    I am building a Content Management System which supports templating. The templates are html files.
    When they get imported into the system their paths for images, external css and scripts no longer work so I need to replace them with the actual path.

    Currently I am replacing them like this:
    $this->template = str_replace('href="', 'href="'.$path.'/', $this->template);
    $this->template = str_replace('src="', 'src="'.$path.'/', $this->template);

    The problem with this is that when I replace the href attribute to fix the change of path for CSS files it also replaces links (they also have a href attribute). Also with this way I have to scan the file twice so this makes things slower.

    So what I actually need is a regular expression that will replace the path
    for:
    <img src="images/image.jpg" />
    <script type="text/javascript" src="js/script.js"></script>
    <link href="css/styles.css" rel="stylesheet" type="text/css" />
    but not for:
    <a href="index.php?page=76">Link</a>


    Any help will be MUCH apreciated

  2. #2
    Join Date
    Jan 2009
    Location
    Insanity
    Posts
    1,131
    introduce a tag.

    href='##mytag##'

    and replace the ##mytag## string with the data you need., apply a similar technique to others you will need to insert data with.. Job done.

  3. #3
    Join Date
    Mar 2004
    Posts
    3,081
    There are lots of ways to fix this problem. The way you're trying to do it is probably the most obvious and sensible at first, but I think it might be better and much more efficient to eliminate the problem of the broken links rather than working around them with corrections. It depends on how and why these links are being broken.

    For example, if the links are broken simply because of a miss-match in a directory name in the URL then you could easily five all URLs that use that miss-matched folder name by using something called a symbolic link. A symbolic link is basically just a folder or file that points to another folder or file. So if you had a symbolic link masquerading as a folder it would just point to another folder somewhere else instead of actually containing files.

    Say, for example, that a URL that needed fixing in one of your templates had the address of
    Code:
    /a/b/c/d/file.txt
    , but the file the URL was supposed to reference was actually at
    Code:
    /a/b/k/d/file.txt
    Instead of rewriting every URL that goes through the /a/b/c/d/ folders, you could just put a symbolic link in the folder called "b". You just make a folder link called "k" that points to "c". From then on, any time something wants to go inside folder "k" it will actually be put in the /a/b/c/ folder, but the address will still say /a/b/k/.

    I know I probably haven't explained that very well and made it sound much more complicated than it actually is, but I can assure you it is actually very simple indeed. The text command to make a link like in the example above would be
    Code:
    ln -s /a/b/c /a/b/k
    .

    You could also probably achieve much the same effect by using a redirection in your Apache config.

    If you are going to go for regular expressions, though, you'll probably want to check out using backreferences for matching the closing tag of an element. I would give you an example but I'm really rusty on this stuff and can't remember the exact syntax from memory. I've been trying to get a development setup installed on my PC so that I could help you since last night, but for some reason MySQL won't compile at the moment. So I can't.

    Oh and you mught want to look at using preg_replace() so that you can do all reaplacements in one go, rather than repeatedly calling some other function.

    I don't really understand Junk's suggestion, but if it's as efficient as using a symbolic link and easier to implement (maybe you might not have access to the CLI) then his solution might well be best.

    Can you give examples of these URLs so we know what we're dealing with, or do you think you can take it from here?
    I'm thuper, thanks for asking.

    It lives! http://www.stephenphilbin.com/ (Well it kinda' does anyway).
    My portable colour selection tool

  4. #4
    Join Date
    Feb 2008
    Posts
    70
    Thank you JunkMale and Stephen for your replies, unfortunately those solutions wont work in my case.

    The CMS should work with any template without having to add special tags for link formatting e.t.c. so it will work with any template you download from the internet without having to have extra knowledge in order to adapt the template to the system or on behalf of the designer.

    Also the CMS supports multiple templates each having its own folder and does not have any restrictions on the folder name, the solution stephen proposed would work if there was only one and if the name of the folder containing the template was known beforehand. also this solution requires reconfiguring the server each time a new site is build with the CMS and probably a lot of users wont know how to do it.

    I think that regular expressions is the only way in my case since it would make any template work without imposing any restrictions or requiring special handling.

    Anyway thanks again guys

  5. #5
    Join Date
    Mar 2004
    Posts
    3,081
    Can you give some better code samples to illustrate exactly what you mean and the variances that are to be expected?

    I think I might understand your problem, but I'm not sure. It actually sounds like you need standardization rather than regular expressions, but I'll try to lend a hand with expressions once I get my host software back up and running (should be very soon).

    It sounds like you're trying to build a system that would allow others to create a package of graphics and CSS/Javascript to customize its appearance and somehow know how to use these files. If that is what you're after, then you're probably best off making a standard for the naming of files and the structure of the package. It sounds like you'd just end up having to come up with a set of regular expressions to support each package that someone might make.
    I'm thuper, thanks for asking.

    It lives! http://www.stephenphilbin.com/ (Well it kinda' does anyway).
    My portable colour selection tool

  6. #6
    Join Date
    Jul 2009
    Location
    Falls Church, Va.
    Posts
    780
    @OP

    Based on your comment:

    When they get imported into the system their paths for images, external css and scripts no longer work so I need to replace them with the actual path.
    You might want to approach this differently, in my view. Let me explain/justify...

    Template files uploaded by users should *never* be fully formed HTML files, otherwise it's not a "template" file. Templates should always be formatted to include meta data in the form of tags specific to that CMS. Document all the tags and their translations in the page where they upload the template so users know what format your CMS expects.

    You should already have separate GLOBAL template files for doc type, HTML and body tags, linking CSS style sheets and Javascript code, head meta tags and any other essential HTML tags used in all pages. Maybe even create your own header/footer templates to keep the site consistent or common elements on a page. Add all this to your documentation and allow users to upload specific page content only, devoid of html, body, css, certain head meta data, etc, that works with your global templates.

    Then, handle exceptions by validating their uploaded template which means stripping anytags intended for or used by your global template(s) and leaving the rest.

    Example:

    Alllow users to upload a stylesheet (.css file), not a "link" to one (which has already been stripped by this point). Then parse it's name into whatever template deals with stylesheets, the path being set automatically.You can do the same for calling Javascript code, and so on.

    Key point:

    The point is YOU control the organization of all these templates, how parsing is done in which ones, all of which when combined in the proper sequence render the final page. This is not the same as a user uploading 100% of the HTML and you having to account for a million variations and possibilities not only in syntax but in positioning and location of tags sometimes which can reference URL's and not just local paths. You cannot possibly account for all these things.

    Nobody said this was easy, but this is the proper approach in a commercial application. In a home made, DIY "I don't care" site with a few users you can trust to follow the specifications and create well formed HTML all the time, obviously this approach isn't necessary.

    And finally, what you actually asked for:

    I'm putting this at the end as I want you to really slow down and read the advice I just gave, it's a wise approach when CMS's are involved. Parsing tags and their attributes is best done with regular expressions, just as you asked about. Here is an example:

    PHP Code:

    //$html is the HTML source of an uploaded user template 
    //$new_header is a global template you setup with head and meta/title tags
    preg_match("=<title[^>]*>(.*)</title>=siU",$html,$matches);
     
    $new_header=preg_replace("=<title[^>]*>.*   </title>=siU","\n<title>$matches[1]</title>\n",$new_header);
     
    $html=preg_replace("=<title[^>]*>.*</title>=siU",'',$html); 
    That takes the source of an uploaded template, gets the contents of the title tag into $matches array and parses it into a global template. The last line removes the title tag entirely as a demonstration of how to strip tags beyond PHP's strip_tags() built in method. I know the example is redundant, but it's just to demo the technique of parsing meta data using regular expressions.

    It's up to you to design a sensible template system, but in a nutshell, this is how it's usually done on most CMS's where users play an active role in adding content in the form of HTML. So you have full control.

    Sorry for the long post, but wanted to fully explain the concepts here.

    -jim

  7. #7
    Join Date
    Feb 2008
    Posts
    70
    hello everyone, again I appreciate your help very much.

    SrWebDeveloper I get what you mean and your way was considered at the designing phase and is actually one part of the current implementation

    to clarify things a little bit more I will explain how my CMS works.

    First of all let me start by saying I am way past designing the system, the system has been under development for a year now and most of its parts are complete. During development I had to come up with some 'quick fixes' for some of the problems, such as the above, to push the development forward...its now time to remove the sloppy code.

    The theory behind applying templates to the cms goes like this...you upload a zip file containing html, css, js and image files to the system (exactly as you would get them from your designer or buy them from template monster for instance).

    The system deploys the files into '/templates/template_name/'.

    The system adds/replaces headers, meta tags and global scripts automatically to the template (and can be altered within the CMS).

    The CMS provides the administrator with a visual tool to define 'blocks'. Blocks are essentially boxes defined in the template as {tags} where you can add components.

    Essentially no coding knowledge or special formatting is required by the designer/user, everything can be done within the CMS.



    Now about the problem....

    The CMS uses a single entry point, so the only way to run the page is through '/index.php'.
    This causes relative links contained in the template to get confused for example if an image path is 'images/image_name.jpg' it is not found since now it is located in '/templates/template_name/images/image_name.jgp'.

    As a quick fix to this problem I wrote the following script that essentially replaced all href=" and scr=" strings with href="templates/template_name/ and src="templates/template_name/ respectively

    PHP Code:
    $this->template str_replace('href="''href="'.$path.'/'$this->template);
    $this->template str_replace('src="''src="'.$path.'/'$this->template); 
    the problems I am facing with the above code can be summarized as follows:
    1. the tag could be written in a looser way such as href=' (open single code) or href= " (space between = and ") or href =" (space between href and =) or HREF=" (uppercase) or any combination of the above.

    2. the link tag also has a href attribute that should not get replaced.

    3. tags sometimes reference URL's and not local paths...they should not be replaced.

    Essentially what i need is a regular expression that using preg_replace would make the above replacement....in one step if possible.

    if you concider yourself a regular expressions GURU please give it a try...
    Last edited by paishin; 05-17-2010 at 05:15 PM.

  8. #8
    Join Date
    Feb 2008
    Posts
    70
    ok here's the solution I came up with today....

    I know the code looks horrible and I don't even know what the impact would be on the system's speed since it uses a lot of expressions to achieve what should be done with just one.

    Anyway, for those following this post here is the code...please comment on improvements

    PHP Code:
    $html_template file_get_contents($filetrue); //Read the file
            
     //Extract (script, image and link) tags from template and store in $htm_tags array
            
    preg_match_all(
                
    '/<\s*(script|img|link)+.+(src|href)+[^<]+?>/'
                
    $html_template
                
    $html_tags
                
    ); 


    //insert template path into the extracted html tags

    $altered_html_tags preg_replace
                                 

                       array(
                                  
    '/(href)+\s*=\s*(\")\s*(?!\s*http)/',      //select href=" not followed by http
                                  
    "/(href)+\s*=\s*(\')\s*(?!\s*http)/",         //select href=' not followed by http
                                  
    '/(src)+\s*=\s*(\")\s*(?!\s*http)/',      //select src=" not followed by http
                                  
    "/(src)+\s*=\s*(\')\s*(?!\s*http)/"      //select src=' not followed by http
                    
    ), 
                                                    
                    array(
                        
    'href="'.$template_path.'/',
                        
    "href='".$template_path.'/',
                        
    'src="'.$template_path.'/',
                        
    "src='".$template_path.'/'
                        
    )
                                                        
                        , 
    $html_tags[0]
                );
                                                        

    //Relplace HTML tags of template with the new ones.
                            
    $this->template str_replace($html_tags[0], $altered_html_tags$html_template); 
    Last edited by paishin; 05-17-2010 at 05:36 PM.

  9. #9
    Join Date
    Dec 2002
    Location
    Pleasanton, CA
    Posts
    2,132
    This may help.
    It's not exactly what you want because it's javascript (I don't know PHP), but the regex used here should do what you want.
    If you un-comment each line it turn, it will add the appropiate path except for the last one.
    Code:
    var template_path = 'usr/local/site/';
    
    str = '<img src="images/image.jpg" />';
    //str = '<script type="text/javascript" src="js/script.js">';
    //str = '<link href="css/styles.css" rel="stylesheet" type="text/css" />';
    //str = '<a href="index.php?page=76">Link</a>';
    
    var newstr = str.replace(/(<img src=\"|<link href=\"|<script .+? src=")/,"$1"+template_path);
    alert(newstr);
    
    PHP version, maybe! :)
    $this->template = preg_replace(/(<img src=\"|<link href=\"|<script .+? src=")/, "$1"+$template_path, $linetobechanged);

  10. #10
    Join Date
    Jul 2009
    Location
    Falls Church, Va.
    Posts
    780
    @paishin:

    If I follow you correctly, the original templates being uploaded do not, for example, have meta data representing certain crucial path info and common elements. If true, your template system is not as I described and you're in for a world of hurt trying to code a simple parsing system to account for the chosen design. You even hit on the same problems I mentioned, i.e. the paths vs. URL problem. If so, the code you just posted is a good start as any - because I'm suggesting to you now it might be unwise to spend a lot of time making that code perfect. Don't worry about system resources. You can't make it perfect, so close is good enough. To get it closer than you have now, re-study the code examples I supplied to you which does a decent job accounting for spaces and other syntax irregularities in HTML. MODIFY MY CODE. Notice the use of the equal sign, the pattern modifiers being used. All tips to help you create your code.

    If you think you might need to invest more time "tweaking" than you thought, you should ponder biting the bullet to re-code certain aspects of the template system apparently designed by others. So meta tags used by the CMS for paths and so on are part of the uploaded files prior to processing. Doing so will make server side processing more efficient, faster, less problems, and you can even do things like check to see if key meta tags exist and be able to reject uploads as "improper template, not supported by this CMS". I'm very sorry after a full year you're where you are now, but it would be irresponsible of me to not point this out, if I am indeed following you correctly. I read your comments carefully, you certainly do seem to be parsing data after the upload, you do not seem to be creating a CMS proprietary set of met tags required in the uploaded files, and you even described the uploaded files as similar to those found in Template Monster, which is of course HTML all the way -- what they call "ready made". Not to be confused with templates proprietary to specific CMS's such as Drupal Joomla, Wordpress, etc. where they only work within their host CMS environment which I how it usually is done.

    I leave it to others to assist you now, I've done my share and thank you for your time and patience in reading this and my other response. Cheers.

    -jim

    ps: @nedals - this solution must be in PHP, the OP is processing uploaded files server side. FYI.
    Last edited by SrWebDeveloper; 05-17-2010 at 07:44 PM. Reason: fixed a typo and some formatting, added nedal comment when I saw their reply also

  11. #11
    Join Date
    Dec 2002
    Location
    Pleasanton, CA
    Posts
    2,132
    Quote Originally Posted by SrWebDeveloper View Post
    @nedals - this solution must be in PHP, the OP is processing uploaded files server side. FYI.
    Thanks, but I am aware that this needs to be PHP, as noted in my post. My sample code was to demonstrate the regexp. It seems to me that this is a pretty simple problem unless I'm missing something.

    Parse a template file adding a path to the following tags
    <img src="images/image.jpg" />
    <script type="text/javascript" src="js/script.js">
    <link href="css/styles.css" rel="stylesheet" type="text/css" />

    A really rough PHP version might look something like this (I work mostly with perl)
    Code:
     
    $template_path = 'path/to/files/';
    $this_template = '';
    $filedata = file('filename/');
    foreach ($filedata as $line) {
    	$this->template .= preg_replace(/(<img src=\"|<link href=\"|<script .+? src=\")/, "$1"+$template_path, $line);
    }

  12. #12
    Join Date
    Feb 2008
    Posts
    70
    As I said in the previous post I am partially using what you have suggested (used for removing/inserting html tags) just not for altering the path.
    I am just using a small section of the template class code to illustrate the problem so its not easy to get the bigger picture.

    In my CMS you do not have to have any special format for the template, any HTML file is good to go. Unlike joomla where you have to have a template specifically written for the CMS in my cms you can use any html file and the cms will convert it to a cms template automatically.

    Anyway I have been battling with this regular expression thing for over 10 hours now and I think I've reached to an almost perfect solution these 3 lines can replace all of the above code:

    PHP Code:
    //Extract (script, image and link) tags from template and store in $htm_tags array
    preg_match_all('/<\s*(script|img|link)+.+(src|href)+[^<]+?>/',$html_template,$html_tags);
            
    //correct template path of the extracted html tags 
    $altered_html_tags preg_replace('/(src|href)+\s*=\s*(\"|\')\s*(?!\s*http)/','$0'.$template_path.'/'$html_tags[0]);
            
    //Relplace HTML tags of template with the new ones.
    $this->template str_replace($html_tags[0], $altered_html_tags$html_template); 
    using this you can replace all src and href paths to the actual path, it also ignores tags which contain external links (given that they have http in front).

    I have done this in three lines but I know it can be done in a single line by combining the above two regular expressions but for the life of me I just cant get it to work.

  13. #13
    Join Date
    Aug 2004
    Location
    Ankh-Morpork
    Posts
    19,616
    I think this might do what you are asking:
    PHP Code:
    $text preg_replace(
       
    '#(<\s*(?:img|link|script)[^>]*(?:href|src)\s*=\s*)(?P<quote>[\'"])(.*?)(?P=quote)#si',
       
    "$1$2$template_path/$3$2",
       
    $html_template
    ); 
    "Please give us a simple answer, so that we don't have to think, because if we think, we might find answers that don't fit the way we want the world to be."
    ~ Terry Pratchett in Nation

    eBookworm.us

  14. #14
    Join Date
    Jul 2009
    Location
    Falls Church, Va.
    Posts
    780
    The only problem remaining is a path might be a URL to content on a remote site so you'd have to account for that. You noted it once, so did I, and it should not be forgotten. If it's a factor, this entire method is wrong and the path should be converted to a meta tag in the uploaded template. Period.

    Otherwise, it is simple regular expression replacement as everyone is now discussing and the consolidated version seems to be a great solution.

    -jim

  15. #15
    Join Date
    Aug 2004
    Location
    Ankh-Morpork
    Posts
    19,616
    Add a negative assertion on "http":
    PHP Code:
    $text preg_replace(
       
    '#(<\s*(?:img|link|script)[^>]*(?:href|src)\s*=\s*)(?P<quote>[\'"])(?!http)(.*?)(?P=quote)#si',
       
    "$1$2$template_path/$3$2",
       
    $html_template
    ); 
    "Please give us a simple answer, so that we don't have to think, because if we think, we might find answers that don't fit the way we want the world to be."
    ~ Terry Pratchett in Nation

    eBookworm.us

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
HTML5 Development Center



Recent Articles